I received an interesting email from a client this morning asking about including a dollar sign in a search to cut down on the noise from a search since they wanted disclosures only when a monetary amount was reported in proximity to the search phrase. Rather than share their search I will describe another similar search. Suppose you want to find the amounts reported as expenditures for research and development. A natural starting point would be to search for research and~ development. Note the ~ appended to a search operator causes the search engine to treat the word as a term not an operator. A search for research and~ development returns the phrase research and development. The search phrase research and development though returns any document with both word (no proximity constraint).
The problem (as indicated in the email) with the search research and~ development is that the phrase could exist in many places in a document without a disclosure of the amounts.
For example Apple used the phrase nine times in their 2017 10-K – most of the hits were noise – The Company believes ongoing investment in research and development (“R&D”), marketing and advertising is critical to the development and sale of innovative products, services and technologies.
If we add a number range constraint to the search we can significantly reduce the noise. Our application does not index dots, dashes, commas, dollar signs . . . but we do index number groups We can search for ranges of numbers by inserting the lower and upper bounds of the range separated by 2 ~ symbols.
To achieve the goal of identifying disclosures that might describe the amount of expenditures for research and development I proposed this search (research and~ development) w/10 1~~999. Clearly this search will take longer because the search engine is going to have to inspect every instance of the R&D phrase for proximity to any number in the range 1 to 999. But it will significantly reduce the noise from the first search. The first search yielded 19,391 documents when applied to the 10-K filings filed in 2016-2020. The second search returned only 13,466 documents. The noise is of course not completely eliminated but it is greatly reduced.
To use a number range in your search remember that dollar signs, dots and commas are not indexed. However the digits are. So a search for a number in the range of $1 to $999,999,999,999 can be reduced to 1~~999. If a number is found that meets the criteria – only the digits to the left of the decimal or comma will be highlighted (but they can be extracted with the Context Extraction feature).
I received an interesting email from a client this week. They are trying to match comment letters (SEC Form UPLOAD) to the subject 10-K filing. The timing was perfect because I wanted to find a unique use case for a presentation at the University of Nebraska at Omaha to highlight how directEDGAR’s feature set can really accelerate data collection – especially if you are handy with coding in Python. I made a claim in the announcement for the presentation that with directEDGAR I could accomplish something probably ten times faster than relying just on Python.
The nice thing about comment letters is that they generally identify the filing in the subject area of the letter. I decided to begin this work by running a search over all comment letters for FORM 10 K* or FORM 10K*. I used the 10 K* to make sure I captured amendments and 10-KT. The FORM 10K* search was to capture any 10KSB, 10-KT and 10K405 filings. To make sure the search was focused on 10-K forms and variants as the subject of the comment letter I restricted the search to those where the references were within the first 200 words of the first word of the document. My final search was xfirstword pre/200 ((form 10 K*) or (form 10K*)) . This search identified 79,213 comment letters.
Search Results Displaying Comment Letters Relating to 10-K filings
You can see in the image above I found 79,213 comment letters in my older copy of our filings in one minute and 15 seconds. I observe in those comment letters that the FORM reference is immediately followed by the date the filling was made. I want to capture that and I don’t want to spend a chunk of time to do this. One of the features of our application is that we create a text (txt) version of every document and load it into the index. This is good because most of the comment letters are pdf files. I do not need to mess with a library in Python to extract the text – I can just use the EXTRACTION\DocumentExtraction TextOnly feature of the application to dump a well named text only version of the letter into a folder on my computer.
That took about fifteen minutes – at the end of the fifteen minutes I had a text version of the 79K+ documents available on my desktop ready for use with Python. Below is an image of the directory – each file is named using our CIK-RDATE-CDATE-stuff convention so that is useful for making sure I have an audit trail back to the comment letter.
Text Version of Comment Letters Ready for Processing
I want to scan some of these to make sure I understand their structure. Thus I used the SmartBrowser to quickly review the output.
Reviewing the Text Version of the Comment Letters Using the SmartBrowser
Now I am ready to write some code to parse out the specific filing type and the filed date from the filing. I am fairly new to Python 3.9 – we anchored on 2.9 many moons ago. Since I am expecting most of our users are using one of the more modern version of Python I am wrote the code in 3.9.0. My general strategy is to read the first 30 lines of the text files as a list – find the line that begins with the word FORM – confirm it has 10-K (Actually I am going to look for FORM 10. I will then inspect the line to see if it begins with the word FILED. If I find the word FILED I will split and try to convert the remainder of the line into a date. I am also going to convert the date in the line that has FORM 10-K into a date object as well – this is our CDATE or better known as the balance sheet date. Below is an image of the lines I processed for this particular filing. I spent approximately 30 minutes on the code – and it is not perfect but my goal in this experiment is to demonstrate how our platform with your skills can accelerate your work.
Parsing Example from Reading Comment Letters
Notice – I am trying to avoid going into the weeds here. If you get a chance to replicate what I am doing you will see why I made the decisions I did to identify the relevant dates as well as allow for the possibility that the RDATE could be on the next line after the line that described the FORM or the second line after.
I want to observe – these documents were created by humans – there are errors – some of the dates are outside the bounds of what would be expected:
Typo in Comment Letter
Now that I have the balance sheet date (CDATE) and the dissemination date I am ready to access the actual filings. We have a CIK-DATE search feature on our application that will limit the results to documents/filings made by a unique list of CIK-DATE pairs – we can set a window around the date parameter if we choose. Since the SEC’s filing date may deviate in some cases from the dissemination date I am going to set a five day window.
Setting Application Parameters to Find Specific 10-Ks referenced in Comment Letters.
The search/filtering process on this type of search can take a bit longer since we have to filter on CIK and date – but we are way ahead – in the test I just ran it took 12 minutes. So roughly a bit less than two hours invested. I am not going to claim that this is the end point. There were some exceptions I could clean up – but my goal was to demonstrate the possibility. In summary I spent less than two hours to go from an idea to a result. You can see in the image below – I can ‘touch’ the filing mentioned in that comment letter above (as well as all of Abbott’s other 10-K filings that were the subject of a comment letter).
I am going to share the code here. I am brand new to 3.9 and some of the conventions are different from 2.9 so if you are an expert 3.9 coder and want to improve the code – share your results. Otherwise – for those of you just starting out and want to play with the interaction of directEDGAR and Python this seems like a worthwhile place to muck around. My goal with this exercise was to highlight another way our platform can accelerate your data collection if you just think outside the box. Rough Code to Parse Comment Letters
This post is a bit wonkish – but update instructions are near the end of this post. We made an update to our CIK mapping file late last week. This is the file the application uses to retrieve filings by companies that have completed some reorganization that triggers filings under a new CIK. Our file maps between the new and the old CIK so if you are trying to match data based on an old CIK but the registrant is filing under a new CIK the application will retrieve the data you requested even if you are using the new (or old) CIK.
For example – the entity know known as The Walt Disney Company files under the current CIK 1744489. Prior to their acquisition/merger with subsidiaries of The Fox Corporation they (Disney) made filings under CIK 1001039 (from late 1995 until late 2019. Prior to 1995 they filed under CIK 29082.
Our mapping file was developed to anticipate you collecting some data from another source that might have one or more of these CIK as the key and you wanting to match that data to some data you would anticipate finding in their filings. If you use the CIK filtering feature or if you retrieve any of our preprocessed data based on CIK – the application interface will have a box to check asking whether you want to use historical CIKs in your search/retrieval. The image below shows the check box to select if you were to run a search and included a CIK filtering file.
Include Historical CIKs Check Box
It is your option to determine if you want only data for your CIK file or if you want us to augment your CIK list with the values from our mapping file. If you select the Include Historical CIKs option the application will augment your CIK list with all of the additional CIKs that have been associated with the entity. So for example, if you have CIK 1744489 in your sample the application will automatically add CIK 29082 and CIK 1001039 to the in-memory version of your list as it processes the list for the task. If you have a request file that has CIK 1744489 and the YEAR values of 2021, 2020, 2019 and 2018 – the application will extend your list to include each of the additional CIKs and the YEAR values you identified for your original CIK list. To make this clear – the image below has the the original request in black and the extended list as determined by the application in red.
CIK/YEAR request augmented by the application
The values in red are not added to your version of the list – but they are used by the application. However, any missing values (in red or black) will be reported in the missing_cik_year_pairs.csv file at the end of the search/extraction. Sorry for getting lost in the details – but they are important. The real reason for this post is to make sure you remember to periodically update the mapping file on your version of directEDGAR and since we just updated the file this is a perfect time for you to update yours – the process is simple.
From the File menu selection select Options and then select Update Historical CIK. Press the Perform Update button.
Options panel – Update Historical CIK
The application will call home, license validity will be established and our server will return a copy of the latest mapping file which will be saved for use by the application. When the process is complete (usually a second or two) a confirmation message will appear.
Update Successful response message
We are still working/struggling to communicate with you about the results – which current CIKs mapped to an older CIK so you are not fully surprised by the fact that you asked for say the MDA from CIK 1744489 from 2018 but instead you received the MDA from CIK 1001039. The challenge is how to do this in real time – for instance APA (CIK 1841666) became the successor registrant to Apache Corp (CIK 6769) on March 1, 2021. While adding the CIK matching to the mapping file is trivial. It is much more complicated to go back and embed a new CIK in all of the prior documents.