DB Document Results Filter is More Versatile than It Seems

Our directEDGAR platform offers at least ten ways to filter search results. The effect of most of these is clearly obvious (DATE, CIK, CIK/DATE, 8-K FILING REASON, DOCTYPE, SICCODE, CNAME, . . .). I used the DB DOCUMENT RESULTS filter today and realized that I have not done a good enough job of describing at least one important use case.

I had an email from a PhD student who was trying to collect a table from some specific filings. They needed some guidance. I want to obscure their data so the following example is based on snipping the Executive Compensation (EC) table. I encourage our users to start table snipping with a really narrow and explicit focus. The idea is that the more narrow the focus the less noise in the output. For example, our process for identifying EC tables starts by requiring the table to have PRINCIPAL, NAME YEAR, POSITION, SALARY, EQUITY, AWARDS, OTHER and TOTAL. I can’t remember from our testing but my recollection is that less than 1% of the tables found with this set of requirements are TYPE 1 errors. And that is the heart of what I want to explain. It is absolutely true that I will not find an EC table from every DEF 14A with those requirements but my results will be very clean. Now the problem is to collect tables from those filings that we missed on the first pass, for example a reasonable proportion of filers do not use NAME, PRINCIPAL POSITION or YEAR in the column headings.

When our table parser finishes one of the outputs is a csv file named MISSING. The file contains a single column labeled FILENAME. This column contains a list of files from our search results where the application did not return/identify a table. Here is a screenshot of the file.

If we rename the FILENAME column to DE_PATH we can then use that file to filter our previous search on just those documents. That is important – presumably our original search set out some important parameters and we want to focus our data collection on just those firms. So I renamed the FILENAME column to DE_PATH and saved the missing file. I then selected the DB Document Results Filter by activating the Use DB check box and then clicking the Set DB File button.

Once I hit the Okay button and then Perform Search the results will be filtered to only those filings/documents that were listed in the CSV file. Here is the result of that search. Now I can review these presumably relevant filings to identify how to identify the EC table using less restrictive criteria. I brought the focus to a filing where the EC table did not have a label for the YEAR column.

I would have to dig through our system to say this with absolute authority but my recollection is that there are some tables where the only way we can grab them is by using the words SALARY and TOTAL. However, if we were to use only those words to identify tables on the first pass then we would get a huge amount of noise.

We have some obligations to deliver normalized Executive and Director compensation data by 7:00 PM each filing day so we absolutely cannot waste time having a person trudge through noisy data. While all of this is automated our process started with what I described above. We are getting ready to add two new tables to our system. The collection of these table will be automated. But the way we learn to automate it is to begin by manually snipping and then reviewing tables as they exist and identify the characteristics that will provide the least amount of noise in the output on the first pass. For me to have this coded I have to be able to describe how to filter the tables to settle on the exact criteria to apply. The only way I can do that is to review filings, snip some tables and then review what was missed to understand why it was missed. I use the DB Document Results Filter repeatedly as I am iterating through the process.

If the world was perfect we could use science alone to do this work. Filings are created by humans and thus they are not perfect so we have to bring some tenacity to this process. Features like the DB Document Results Filter reduce at least some of the frictions associated with being persistent.

Leave a Reply