5.0 Released to Appstream – Finally

We are two weeks later than I hoped when I last posted here. We are a small company. We had complications with some of the final bits associated with this release and at the same time we also were facing some unusual problems with the infrastructure related to our main platform. I can’t tell you how many nights I have been sitting watching resource monitors and waiting for thread activity exceptions to try to solve some of these problems.

The next time you log into our platform the image above is what you will see. As I noted before, one seemingly minor but important change is that the application will know about all of the document indexes and databases that are available.

One of the more challenging part of the rollout was to preserve your existing preferences and settings. I was quite excited when that little piece was finally solved last week. It was much harder to make that happen than I ever imagined. The setting I care about the most is my search history. The image below is from my personal search history.

One of the key reasons we wanted this upgrade was to provide simpler and faster access to the data that currently requires a request file and . . . You can see on the menu bar there is a Query Databases item. When you select that the interface will switch over to the database tool, all of the available databases will be listed in the Database to Query panel.

Before I go further, I need to make clear – because of how far behind we are with the application rollout – we are consequently behind in the database development etc. At the moment – please consider the EXEC_COMP database as a practice database. One of our interns is working hard to do the transformations we need to have that database online. I think eventually we will be maintaining a document with the databases and the meaning of the keys.

While I will be prepping some videos to support your use of the tool – I want to walk through a simple problem. Suppose I have a list of CIKs and I want to identify all 8-K filing events that affected those registrants for some window. In the steps below I am presuming you have logged into the system.

Step 1 – create a CSV file that has your list of CIKs (as always the CIKs must be integers) and transfer that file to the session.

Step 2 – Start the application and select the Query Databases menu item.

Step 3 – Select the 8Kmeta database.

Step 4 – Look at the Display Columns panel and adjust to suit your preferences. This panel specifies the columns that will be displayed in the viewer after the query and defines the columns available in the output. My practice is to select ALL.

To query we have to select Criteria Columns. In this case I am going to select SEC_FILING_DATE. The available operators will change based on the nature of the criteria. Since the application recognizes the SEC_FILING_DATE as a date value the application offers Comparison Operators that relate to dates.

Step 5 – Select SEC_FILING_DATE and select Between in the Comparison Operator box. (I selected 8/2/2001 for my From Date and 5/16/2018 for my To Date).

Step 6 – Hit the Add Criteria button.

Step 7 – select the Use CIK checkbox and then use the CIK Filter button to navigate to and select the CSV file that has your CIK list (remember – this file has to be available within your session).

Prepping a Query for all 8-K events filed by a specific list of registrants.

Step 7 – Hit the Execute button to execute the query.

Not Responding simply means the application does not have a listener available to accept your input to the application. This query took me approximately 30 seconds.

Notice – the Save Results button. Hit that to save the results to a CSV file. Saving happens pretty quickly, and when the files has been finished you will see a confirmation message.

To transfer the results to your local computer – select the Files icon from the control bar and select the Temporary Files folder:

Use the control to the right of the Size column to select Download.

I am very excited about this improvement. We have a lot to do to make the data transition successful but now that the application piece is in place we can shift our focus to that aspect. The automatic update of the available databases alone will be beneficial as I have seen too many cases where users have not known that new filings were available.

Version 5.0 Beta is Finally in Final Stages of Internal Testing

This has been a long time coming. But the release is getting closer.

The most significant change we are making with this release is to incorporate a database query and extraction feature. Here is a screen shot of a test query – I am looking for all 8-Ks filed between 2/11/2014 and 7/16/2021 that included ITEM 4.02 (Non-Reliance on Previously Issued Financials) as one of the reason codes.

This was brutally fast. The query executed before I could pretty much blink. Notice the Save Results button. If you hit that, you get a regular file dialog to name a csv file to save the results.

We have two strategic goals with this new version. One is to rationalize the introduction of document metadata. Last year I tried to add metadata to individual documents and while it worked, I heard from some users about the cumbersomeness. Frankly we found it painful because every time we want to add some new metadata to a document – we have to do a ton of work, change the document and then re-index the document collection. Now we can separate the metadata from the document but provide you a link to get from a db query to the original documents (and vice-versa). In plainer English – we intend for you to be able to interact with the databases and if you desire identify some specific documents that interest you based on a database query. Run the query – use the output to run a full-text search and filter the full-text search based strictly on the existence of the document in the csv file from the db query.

We also intend – to allow you to run a full text search, save those results and then use those results to pull the metadata you want from a database query. Like you, I am getting a headache as I am writing this because it is complicated – but it is also powerful.

The second strategic goal is to give you better access to the other data we have available on the platform. In this second image you see a new db is listed as available. The application will inspect the db folder when started and list all db available at that moment. You don’t have to do anything but open the tool. With respect to the databases we will start moving all of our existing tabular data to this platform so you no longer have to use the ExtractionPreprocessed interface. In the screenshot below – I am querying our Executive Compensation data for any officer that has the word counsel or legal in their title, who earned a salary greater than $350,000 and were female. Two key issues here, we are going to shift data availability to data year versus document year. And then second – you can do some advanced filtering using the query interface to select your sample.

There are other changes, most are relatively minor. I was helping a client who had not used the Options feature to update the indexed documents. Because we now control more of your experience we made the process of updating document index libraries automatic on restart of your session. While we will announce new document collections through the blog you will not need to do anything special to access the indexes.

All of this is a process. We have some folks working on the steps required to move our existing JSON based data into the SQLite databases. I personally am working on adding institutional trading data (from 13F-HR filings) into a format to make it accessible through this interface.

My guesstimate right now is that the we will switch you over to the new version around the 4th of July. All of your search history etc will not be affected by the transition.

Switching to a db model is really exciting for us. Once we are comfortable that the process of converting the existing data is solid we plan to experiment with developing simpler but powerful ways to JOIN data across databases. If you wanted the effects of a join you will have to make two queries and merge based on like CIK and YEAR.

It is never simple – We don’t know what we don’t know!

I had a PhD student ask a really interesting question last week. Because I don’t want to disclose their research goals it took me a bit of time to come up with a good analogy. They had a search that had more than 100 search terms. They did a summary extraction and was scanning the summary file with the columns that list the terms and then the number of hits found in each document. They would then periodically look at the source document in our viewer to check how their words/phrases were actually used in context. The problem they identified was that they started to see cases where their search term was in the document but it was not used in the right context.

So my example is going to be – suppose I want to find all 10-K filings with mention of Texas. I believe that if the word Texas is reported in the 10-K that provides strong evidence that the company has operations of some sort in the state. However, once I start scanning the results I find plenty of cases where the word TEXAS is in a filing – but the problem is that the word TEXAS is used as part of a noun phrase or other construction that does not actually name Texas as a location of operations. For example, West Texas Intermediate is a benchmark used for pricing oil transactions. Mentions of Texas Instruments or Texas Pacific as a competitor. So the question is – if we don’t know the context of the word in use – how can we be sure the word is actually signifying what we hope it signifies? In other words, the existence of the word may not be sufficient evidence that the instance in meaningful in our case. Further, we do not know in advance all of the possible noun phrases and proper names that include the word TEXAS so we can’t exclude them or account for them in our search. (If you know all of the proper nouns and noun phrases to exclude in advance then modify search to account for those with the ANDANY operator TEXAS andany (TEXAS INSTRUMENTS OR WEST TEXAS INTERMEDIATE OR . . .).)

Here is an image of the search results, the filer is TEJON RANCH CO – the name has a Texas sort of ring to it but they are a real estate development and agribusiness company. They realize some royalties from mineral right leases on their land – which is in California. They appear to have no operations in Texas, but the royalties they receive seem to be tied to West Texas Intermediate.

One way to help identify and get a better handle of how TEXAS is used in a document is to do a ContextExtraction (this is the label we use on the platform) and set a really tight span. In this particular case I suggested to the PhD student that they set a span of 1 ‘Words’ as illustrated in the next image.

By doing so and scanning the results it becomes clear that there is a lot of noise in assuming just the mention of the word TEXAS in a 10-K filing is meaningful. We find cases where Texas Instruments is mentioned as a competitor. There are cases where Texas A&M and other universities with the word Texas in their name have a patent relationship with the registrant or one of the executive officers earned a degree at the university. Restricting the context to one word may not be the best choice in every case. That is okay because it is cheap to rerun and alter the span to test alternative strategies.

The point is that there is no way I could have known in advance all of the ways the word Texas might be used in the filing and be confident that the use of Texas was evidence of corporate activity in Texas. But by extracting the limited context and scanning it I can more confidently look for ways to better measure evidence of business activity.

I will disclose that in our exchange, they were wondering if this was the point they needed to start learning Python. I do encourage folks to start learning Python – but this is not a problem well solved by Python. We had the context around the word TEXAS from every 10-K filed from 2016-2020 in a csv file about four minutes after we started. Now it is going to take some effort to learn what should be included or excluded to make sure their measure reflects what they hope it reflects. Being able to look at these results is what is going to give them the understanding they need to move forward.

Examples of Texas Results that are Likely Noisy

Random – It is about the people. We have amazing interns!

I often tell people that my day job is about the best job in the world. Every semester I get to meet some outstanding young people who are just starting to make their mark on the world. I love class when someone challenges me and asks hard questions. My role here also gives me the same opportunity – we hire late juniors and seniors in high school and try to keep them as long as we can. What I am looking for is a little bit of arrogance (confidence), a little tiny bit of humility and a lot of curiosity and persistence – no experience necessary or really even wanted. Most importantly, I am looking for integrity. They work remotely and I don’t want to invest in monitoring systems and I need them to quickly report when they make an error.

Our training is pretty ad-hoc and is initially focused on helping them learn the importance of details. We have tools that we use to identify, extract and normalize executive and director compensation. If we knew everything about the way the data is going to be presented in the filings we wouldn’t have to use humans because we can address the issues in code. But there are a lot of nuances that get added each year. Some days it feels like we are playing whack-a-mole with choices registrants make, We used to believe that II was part of the name in a Name/Title cell – but today one company used that in the title. I have gone too much in the weeds. The point is we need our interns to be really curious and questioning when they are looking at details.

They start off doing tasks that keep our processes running and learning how to question everything they see in one of our dashboards (does that II really belong in the title). At some point we start teaching them Python. When we start teaching them to code – the focus is on learning how to break tasks into the smallest possible step. It takes roughly a year before they are proficient enough to start making independent contributions to our code base. I will give them some goal and when they ask questions I try mostly to make sure they are asking the right question and then rather than giving them the answer I send them into our Experiments channel on Slack or to Stack Overflow.

One of our interns, Michael Pineda, had a really interesting weekend. Michael is a Mechanical Engineering sophomore at the University of Nebraska at Lincoln (UNL). He started with us late in the fall of his senior year in high school. He is a member of the UNL Society of Automotive Engineers (SAE) club. Each year SAE clubs at colleges across the country compete in the Formula SAE. They build a small formula style race car from scratch. The competition includes the presentation of a business case for their car. This weekend they had to get the car in front of alumni for the first public viewing. Michael is a lead on the suspension team. He shared these pictures in Slack with the rest of our team at 5:45 AM Sunday. Here are some pictures of their car:

Here is the car:

University of Nebraska Lincoln Formula SAE Racer

Do you think Michael is curious and questioning? He has been at work on a new project that is going to be disclosed about the time we release version 5 (I hope). Michael has been offered a fantastic mechanical engineering internship over the summer. Like all of our interns – he will be leaving us to bigger and better things – but it sure is fun to work with them at this point in their lives.

Between them two of our interns are competing at the national level in seven academic competitions in the next month (Mock Trial, Speech & Debate, Academic Decathlon . . .) Three of our interns are enrolled in 12 total AP courses this semester. The two that are graduating from high school this year have been offered academic scholarships in excess of $250,000 (+/-~). Our newest intern is keeping up with a challenging academic schedule and doing some amazing things in the long-jump and relays for her high school.

I know I am rambling, but I would remiss if I did not mention Shelby Lesseig. Shelby was an intern while she was working on her BBA (Accounting)/MPA at UT Austin. I just checked out her Linked-In profile:

Linked-In Description of Work with AcademicEDGAR

Shelby came on while I was still learning about the characteristics we really needed in interns and she unwittingly helped me better identify some of those qualities – she set an early standard that we still use today. We still use some of Shelby’s early work to manage our data extraction and normalization processes. Shelby’s husband and his colleagues have actually benefited from some of her work. Is that cool or what!

This post was prompted by Michael’s excitement over his contribution to the race car and me coming to terms with the fact that he is going to be moving on soon. It made me think more about this journey and while it has been challenging at times – I really do think the best part of it continues to be getting to work with such bright interns (I was going to say kids but I don’t want to diminish them at all).

DOCTAGS An Overview

I was answering a client question this morning about limiting search results to particular documents and decided that it was probably time to post here about our DOCTAG filtering.

An SEC filing includes a form and might also include exhibits. In conversation and generally in writing about filings we don’t often separate the form from the exhibits even when our work might be focused on the form rather than the filing (inclusive of exhibits). As part of our process we collect filings from the SEC and then parse the filing to separate the form from the exhibits. We then tag the form and the exhibits to allow you to select, search and manipulate search results based on the type of document.

The tag for the form is the name of the filing (10-K, 10-K/A) with all of the spaces, dashes (-) and slashes (/) removed. so the 10-K becomes 10K and the 10-K/A becomes 10KA – an SC 13D becomes SC13D . . . The SEC mandates that exhibits follow a convention with respect to the description field when they are included in a filing (see this https://www.law.cornell.edu/cfr/text/17/229.601). We follow the same rule for converting the Exhibit Description to our DOCTYPE code except we remove everything to the right of any decimal in the EXHIBIT TYPE FIELD. While filers may have their own internal system that they use to add meaning to the DESCRIPTION field of an exhibit in a filing – that system is not available to us. So when a filer uses EX-10.17 – our DOCTYPE code is EX10. Here is an image from an Apple 10-K filing we coded the 10-K as 10K and the exhibits as EX4, EX10, EX10, EX21 . . .

Document List for Apple 2020 10-K Filing

At one time I speculated that Apple’s convention is to indicate the sequential order of the exhibit type included in an SEC filing in the fiscal year. I no longer believe that to be true. While we have seen cases where a particular filer seems to have a coding scheme (10.1X is a debt contract and 10.2X is a compensation related contract . .) these practices are internal not externally driven.

I think I have addressed this before – the best way to begin identifying particular types of contracts is to use the DOCTYPE filter and specify EX10 and then use the XFIRSTWORD search with the within operator (W/#) and then key words that would be expected to be within some N words of a particular type of contract. For example (DOCTYPE contains(ex10)) and (xfirstword w/10 (debt or credit)) will return all Exhibit 10s that contain either the word debt or credit within 10 words of the first word in the document.

Search is an art, it is important to play around with the span (w/N) and compare the results. I have seen cases where our clients have more than 400 words/phrases to check – this is no problem at all for the parsing engine – you just have to be careful about the grouping of your phrases/terms. When I hear from a client that they can’t get a search to work – invariably it turns out to be a problem with parentheses placement.

As an aside – I really find the Cornell Law link above to be one of the best resources to use when trying to understand what I can expect to find in a filing.

Version 5.0 is coming!

Let me start with the exciting stuff – here is an image of a key feature in the next version:

We are building a query tool that will be incorporated into our ExtractionEngine. If you have seen some of our past posts – we have loaded some metadata databases to the cloud. We built the databases initially as temporary containers to hold metadata until we fully settled on the specific data we wanted to add to the filings. Some users asked for access to some of the data and because it is complicated to fully incorporate the data into the filings directly we thought a reasonable intermediate step was to make these databases available for you to download.

The problem is that of course you have to be download the database, find and install a viewer and learn how to work with a viewer . . .

Further, I have been wondering about making more of our data available in a more direct fashion. We also have a secret project to dump another very useful data set into our system that we have made really decent progress with. So I wondered if we could create our own data query/view feature – the image above is a first view of this feature.

Generally, when you select the query tool from the menu the application will check our repository to determine what databases are available (because we are going to add lots of data) and load the available databases into the application. You will select a database and the application will dynamically identify the fields and their characteristics (TYPE) and then offer you a panel to build a query. Once the query has run you will have the option of saving the results to a CSV file.

Most of the other features we will be adding to 5.0 are more incremental. For example, if you want to access new document indexes you have to run the File/Options/Index Library/Generate Library utility. With our cloud deployment these is unnecessary so we are making the application dynamically respond to the available index collection when it starts. Once we implement this you will not receive any more emails from me announcing that a new index has been added, the application will just know what indexes are available.

I don’t have a precise date yet for this new release. We are in the midst of proxy season. This means that a lot of our focus is on making sure the executive and director compensation tables included in filings that day are available to you by the close of business. Our system is really good but there are always new challenges that take some attention and focus. I am not sure if we will be able to finish all of the development work before the end of proxy season or not. If not it should be soon thereafter.

Accessing Metadata from Documents Extracted Through our Platform

Fascinating question. a PhD student is doing some work with some filings they extracted from our platform. They ran some complex searches to identify the documents and saved them to their computer. However, they did not generate a summary of the search results so they lost immediate access to the metadata associated with the documents. They realized after some extensive work that they would like to use the metadata in their analysis. So the question was, can they get the metadata without rerunning the search? The answer is yes, particularly since they are working in Python. Below is some code to create a list of dictionaries of the metadata embedded in the htm files in the source folder. We use the LXML library. I think many of you might be using BeautifulSoup. If so I think there is a small modification needed. The key though is that we add the meta as elements with attributes, so we can pull those elements and get their attributes rather cleanly.

import glob
from lxml import html
import os
meta_list = []
for htm_document in glob.glob(source_dir + os.sep + '*.htm'):
    with open(htm_document,'rb') as fh:
    b_string = fh.read()
    meta_dict = dict()
    tree = html.fromstring(b_string)
    meta_e = tree.xpath('.//meta')
    if len(meta_e) == 0:
    print(f'no meta {htm_document}')
    for m in meta_e:
      attrib = m.attrib
      meta_dict[attrib['name']] = attrib['content']
    meta_dict['source_path'] = htm_document
    meta_list.append(meta_dict)

I pulled out an 8-K filing and ran the above code on an 8-K filed by Apple. Here is the result:

{'DOCTYPE': '8K', 'SECPATH': 'https://www.sec.gov/Archives/edgar/data/320193/000119312521001982/d29637d8k.htm', 'ACCEPTANCE': '20210105163216', 'SICCODE': '3571', 'CNAME': 'Apple Inc.', 'FYEND': '0925', 'ITEM_5.02': 'YES'}

Reminder – Searches with WORDS that are Operators Require ~ Appended to the Operator

A client was trying to sort out how to run a specific search. They wanted to use a phrase with the word and – they were getting anomalous results and so dropped us a question. Anytime you need to use a primary operator in a search you need to append a tilde (~) to the search operator.

I am reluctant to share their specific search. I was looking at some audit proposals for the ratification of the Independent Registered Accounting firm and so that is where this example comes from. Suppose you want to search for cases where the proxy reports that the auditor is not expected to be present at the annual meeting. NOT is a search operator and it is a strong one – it will eliminate documents/results with the word/phrase that follow. (It has more complicated uses but let me avoid a rabbit hole here).

To identify those documents where there is an explicit indication that the auditor is not expected to be present at the meeting I ran the following search:

((not~ expected) w/10 present) w/50 (audit* or account*)

If we run the search without the tilde – the result would be those cases where the word expected was not within 10 words of the word present and expected was within 50 words of words rooted on audit or account. Are you confused, sometimes I am to – search is an art.

I will admit, I find these results interesting – there are not that many though and it seems that a good number of the cases are those where the auditor from the prior year is not continuing – but not all. The image above came from Forward Air’s 2021 DEF 14A.

Of course, I immediately wondered if those cases were indication of an intent to dismiss the auditing firm in the near future. Unfortunately when I expanded my search to cover all years I find cases where the auditor is routinely not expected to be present. For example CECO ENVIRONMENTAL, CULLEN FROST BANKERS and DOVER CORP. Shucks, I thought that would be an interesting research paper.

2020 Insider Trading Data Updating With ACCEPTANCE-DATETIME Field

The system is currently running to update all of the 2020 SECTION-16-SUMMARY data with the ACCEPTANCE-DATETIME field. The process started about 1:00 PM on 1/8/2022 – I estimate that it will be complete by 4:00 PM (ish).

I did make a dreadful mistake during this update. I pulled the 2020 offline while I was preparing the code. I received an email and while I was able to address the requirements for that user I realized it was not necessary to pull everything offline. I will not make that mistake again. We are now working on the prior years. Thank you for your patience – this should have been addressed when we first handled these files.

When I posted that the 2021 data was available I noted that my sense was that there were more insider transactions in 2021 than in 2020. This was confirmed as in 2021 there were 875,357 total processed rows in 2021 pulled from 224,474 unique filings. In 2020 there were only 717,239 rows pulled from 203,566 filings. I think we expected this because we saw many more new directors in 2021 than we have seen in some while.

We are still heavily involved in some of the transition stuff and so it may take a while, however, we will generate the DIRECTOR-RELATIONSHIP artifact for 2021 soon. That will be interesting as my sense during the year was that more of the newest directors are female. Once that artifact is created it will be easier to confirm that observation.

I would like to observe that you can map the trading data to the Director/Executive compensation data by the PERSON-CIK value. In those tables we have GENDER. We have AGE and TENURE fields in the Director data.

Context Normalization – Spelling Counts

I had an interesting email from a hard at work PhD student who was using the ContextNormalization feature of our platform to normalize some data. Because they are collecting a piece of data I have not seen used in research before I am going to describe their problem using AUDITOR TENURE data collection. The nature of their problem manifests itself in the same way in almost every Context Normalization case.

As a result of a PCAOB rule change registrants are supposed to disclose their tenure with the client. The most common expression of that tends to be We have served as the Company’s auditor since YYYY. Below is an image from running a search for auditor* since on 2021 10-K filings.

I ran the search auditor* because I also want to catch the expression of auditors since.

I set a really tight span for the context since this is one of those binary cases – it will be concisely expressed or it is not likely to be expressed. Remember – to set the span for Context – use Options/Context and specify the span you need.

Once we’ve done that we are ready to set the parameters for the ContextNormalization. Notice I did not include auditor in the Extraction Pattern. This assures me that the processor will not discard those cases where the phrase is auditors since. Since the processor is working on the active search results I have no concern about phrases like we have been making amazing products since XXXX. Our search was for auditor* since.

This is one of the ‘spelling matters’ issues – if I specify auditor since the engine will not normalize auditors since. The word auditor was critical to get the right context but using it in the Extraction Pattern will reduce the yield since there will not be an exact match to auditor since when they have used auditors since.

The second spelling issue occurs because of formatting errors or typos. When I sorted the results by the value of tenure – you can see I had some results that didn’t make sense.

Someone accidentally inserted an extra space as they were typing in the year values or the underlying html has a tag separating parts of the number.

Then of course we have these cases

In the cases shown above the search correctly identified the context but there are words intervening between the word since and the value we want for year.

We collected the year value from 6,745 documents based strictly on the existence of a valid number following the word since. There were 115 documents with language “since at least YYYY” or “since fiscal YYYY” and other permutations and there were a total of 4 typos.

I keep trying to play around with a Python library called Fuzzy-Wuzzy to improve this yield and while we can make significant improvements for specific use cases – the problem is that I can’t seem to anticipate all use cases in a way that makes me comfortable implementing the library inside one of our functions. However, if you do a ContextExtraction and have some time on your hands I would encourage you to poke at the normalization with that library.

directEDGAR

Search, Extraction & Normalization Engine

Author: Burch