It is never simple – We don’t know what we don’t know!

I had a PhD student ask a really interesting question last week. Because I don’t want to disclose their research goals it took me a bit of time to come up with a good analogy. They had a search that had more than 100 search terms. They did a summary extraction and was scanning the summary file with the columns that list the terms and then the number of hits found in each document. They would then periodically look at the source document in our viewer to check how their words/phrases were actually used in context. The problem they identified was that they started to see cases where their search term was in the document but it was not used in the right context.

So my example is going to be – suppose I want to find all 10-K filings with mention of Texas. I believe that if the word Texas is reported in the 10-K that provides strong evidence that the company has operations of some sort in the state. However, once I start scanning the results I find plenty of cases where the word TEXAS is in a filing – but the problem is that the word TEXAS is used as part of a noun phrase or other construction that does not actually name Texas as a location of operations. For example, West Texas Intermediate is a benchmark used for pricing oil transactions. Mentions of Texas Instruments or Texas Pacific as a competitor. So the question is – if we don’t know the context of the word in use – how can we be sure the word is actually signifying what we hope it signifies? In other words, the existence of the word may not be sufficient evidence that the instance in meaningful in our case. Further, we do not know in advance all of the possible noun phrases and proper names that include the word TEXAS so we can’t exclude them or account for them in our search. (If you know all of the proper nouns and noun phrases to exclude in advance then modify search to account for those with the ANDANY operator TEXAS andany (TEXAS INSTRUMENTS OR WEST TEXAS INTERMEDIATE OR . . .).)

Here is an image of the search results, the filer is TEJON RANCH CO – the name has a Texas sort of ring to it but they are a real estate development and agribusiness company. They realize some royalties from mineral right leases on their land – which is in California. They appear to have no operations in Texas, but the royalties they receive seem to be tied to West Texas Intermediate.

One way to help identify and get a better handle of how TEXAS is used in a document is to do a ContextExtraction (this is the label we use on the platform) and set a really tight span. In this particular case I suggested to the PhD student that they set a span of 1 ‘Words’ as illustrated in the next image.

By doing so and scanning the results it becomes clear that there is a lot of noise in assuming just the mention of the word TEXAS in a 10-K filing is meaningful. We find cases where Texas Instruments is mentioned as a competitor. There are cases where Texas A&M and other universities with the word Texas in their name have a patent relationship with the registrant or one of the executive officers earned a degree at the university. Restricting the context to one word may not be the best choice in every case. That is okay because it is cheap to rerun and alter the span to test alternative strategies.

The point is that there is no way I could have known in advance all of the ways the word Texas might be used in the filing and be confident that the use of Texas was evidence of corporate activity in Texas. But by extracting the limited context and scanning it I can more confidently look for ways to better measure evidence of business activity.

I will disclose that in our exchange, they were wondering if this was the point they needed to start learning Python. I do encourage folks to start learning Python – but this is not a problem well solved by Python. We had the context around the word TEXAS from every 10-K filed from 2016-2020 in a csv file about four minutes after we started. Now it is going to take some effort to learn what should be included or excluded to make sure their measure reflects what they hope it reflects. Being able to look at these results is what is going to give them the understanding they need to move forward.

Examples of Texas Results that are Likely Noisy

Random – It is about the people. We have amazing interns!

I often tell people that my day job is about the best job in the world. Every semester I get to meet some outstanding young people who are just starting to make their mark on the world. I love class when someone challenges me and asks hard questions. My role here also gives me the same opportunity – we hire late juniors and seniors in high school and try to keep them as long as we can. What I am looking for is a little bit of arrogance (confidence), a little tiny bit of humility and a lot of curiosity and persistence – no experience necessary or really even wanted. Most importantly, I am looking for integrity. They work remotely and I don’t want to invest in monitoring systems and I need them to quickly report when they make an error.

Our training is pretty ad-hoc and is initially focused on helping them learn the importance of details. We have tools that we use to identify, extract and normalize executive and director compensation. If we knew everything about the way the data is going to be presented in the filings we wouldn’t have to use humans because we can address the issues in code. But there are a lot of nuances that get added each year. Some days it feels like we are playing whack-a-mole with choices registrants make, We used to believe that II was part of the name in a Name/Title cell – but today one company used that in the title. I have gone too much in the weeds. The point is we need our interns to be really curious and questioning when they are looking at details.

They start off doing tasks that keep our processes running and learning how to question everything they see in one of our dashboards (does that II really belong in the title). At some point we start teaching them Python. When we start teaching them to code – the focus is on learning how to break tasks into the smallest possible step. It takes roughly a year before they are proficient enough to start making independent contributions to our code base. I will give them some goal and when they ask questions I try mostly to make sure they are asking the right question and then rather than giving them the answer I send them into our Experiments channel on Slack or to Stack Overflow.

One of our interns, Michael Pineda, had a really interesting weekend. Michael is a Mechanical Engineering sophomore at the University of Nebraska at Lincoln (UNL). He started with us late in the fall of his senior year in high school. He is a member of the UNL Society of Automotive Engineers (SAE) club. Each year SAE clubs at colleges across the country compete in the Formula SAE. They build a small formula style race car from scratch. The competition includes the presentation of a business case for their car. This weekend they had to get the car in front of alumni for the first public viewing. Michael is a lead on the suspension team. He shared these pictures in Slack with the rest of our team at 5:45 AM Sunday. Here are some pictures of their car:

Here is the car:

University of Nebraska Lincoln Formula SAE Racer

Do you think Michael is curious and questioning? He has been at work on a new project that is going to be disclosed about the time we release version 5 (I hope). Michael has been offered a fantastic mechanical engineering internship over the summer. Like all of our interns – he will be leaving us to bigger and better things – but it sure is fun to work with them at this point in their lives.

Between them two of our interns are competing at the national level in seven academic competitions in the next month (Mock Trial, Speech & Debate, Academic Decathlon . . .) Three of our interns are enrolled in 12 total AP courses this semester. The two that are graduating from high school this year have been offered academic scholarships in excess of $250,000 (+/-~). Our newest intern is keeping up with a challenging academic schedule and doing some amazing things in the long-jump and relays for her high school.

I know I am rambling, but I would remiss if I did not mention Shelby Lesseig. Shelby was an intern while she was working on her BBA (Accounting)/MPA at UT Austin. I just checked out her Linked-In profile:

Linked-In Description of Work with AcademicEDGAR

Shelby came on while I was still learning about the characteristics we really needed in interns and she unwittingly helped me better identify some of those qualities – she set an early standard that we still use today. We still use some of Shelby’s early work to manage our data extraction and normalization processes. Shelby’s husband and his colleagues have actually benefited from some of her work. Is that cool or what!

This post was prompted by Michael’s excitement over his contribution to the race car and me coming to terms with the fact that he is going to be moving on soon. It made me think more about this journey and while it has been challenging at times – I really do think the best part of it continues to be getting to work with such bright interns (I was going to say kids but I don’t want to diminish them at all).

DOCTAGS An Overview

I was answering a client question this morning about limiting search results to particular documents and decided that it was probably time to post here about our DOCTAG filtering.

An SEC filing includes a form and might also include exhibits. In conversation and generally in writing about filings we don’t often separate the form from the exhibits even when our work might be focused on the form rather than the filing (inclusive of exhibits). As part of our process we collect filings from the SEC and then parse the filing to separate the form from the exhibits. We then tag the form and the exhibits to allow you to select, search and manipulate search results based on the type of document.

The tag for the form is the name of the filing (10-K, 10-K/A) with all of the spaces, dashes (-) and slashes (/) removed. so the 10-K becomes 10K and the 10-K/A becomes 10KA – an SC 13D becomes SC13D . . . The SEC mandates that exhibits follow a convention with respect to the description field when they are included in a filing (see this https://www.law.cornell.edu/cfr/text/17/229.601). We follow the same rule for converting the Exhibit Description to our DOCTYPE code except we remove everything to the right of any decimal in the EXHIBIT TYPE FIELD. While filers may have their own internal system that they use to add meaning to the DESCRIPTION field of an exhibit in a filing – that system is not available to us. So when a filer uses EX-10.17 – our DOCTYPE code is EX10. Here is an image from an Apple 10-K filing we coded the 10-K as 10K and the exhibits as EX4, EX10, EX10, EX21 . . .

Document List for Apple 2020 10-K Filing

At one time I speculated that Apple’s convention is to indicate the sequential order of the exhibit type included in an SEC filing in the fiscal year. I no longer believe that to be true. While we have seen cases where a particular filer seems to have a coding scheme (10.1X is a debt contract and 10.2X is a compensation related contract . .) these practices are internal not externally driven.

I think I have addressed this before – the best way to begin identifying particular types of contracts is to use the DOCTYPE filter and specify EX10 and then use the XFIRSTWORD search with the within operator (W/#) and then key words that would be expected to be within some N words of a particular type of contract. For example (DOCTYPE contains(ex10)) and (xfirstword w/10 (debt or credit)) will return all Exhibit 10s that contain either the word debt or credit within 10 words of the first word in the document.

Search is an art, it is important to play around with the span (w/N) and compare the results. I have seen cases where our clients have more than 400 words/phrases to check – this is no problem at all for the parsing engine – you just have to be careful about the grouping of your phrases/terms. When I hear from a client that they can’t get a search to work – invariably it turns out to be a problem with parentheses placement.

As an aside – I really find the Cornell Law link above to be one of the best resources to use when trying to understand what I can expect to find in a filing.

directEDGAR

Search, Extraction & Normalization Engine

Month: April 2022

It is never simple – We don’t know what we don’t know!

Random – It is about the people. We have amazing interns!

DOCTAGS An Overview