Should We Make Up Data?

We are preparing to drop a fairly comprehensive Audit Fee data base. Long time coming but one of the real challenges is that there can be some wonky data and it is what it is. I am spending time trying to devise tests to evaluate the quality of our parsing and data assignment code. One of the tests I dreamed up is checking if the Date of the Audit Report precedes the dissemination data by more than 10 days. Here is a screenshot of the audit report for CSW Industrials (CIK 1624794) for the year ended 3/31/2022.

Here is a screenshot of the auditor’s signature:

It is not a huge issue, the filing was made on 5/18/2022 and perhaps someone felt harried when doing the final checks. For us though, it raises a significant question – when do we have liberty to change data after it is reported? I am so reluctant to make changes because that seems like a slippery slope. What is the source of truth? As an aside, the auditor submitted an Exhibit-18 describing the change from LIFO to FIFO that was filed with the 10-K. That has the date 5/18/2021.

We have identified about 100 different cases of similar/adjacent issues. Trying to validate that what we are doing is correct is our primary focus. A secondary issue though is to decide what to do. As I write this I am more inclined to use the 5/18/2021 date just because the 10-K should be the source of truth.

New Update to TableExtraction & Dehydrator Outputs

We just updated the TableExtraction and Dehydrator code to create new outputs.

For TableExtraction we added a new search_phrase.txt file that is only generated when you use the Search Query option to identify tables for extraction. The file contains both your original set of parameters as well as the transformation(s) we apply to generate the code that is then delivered to the TableExtraction engine. Here is a sample:

Input:  (acquired or price or assumed) and (goodwill  and intangible and asset and 
         liabil)
Parsed: AND(OR(Contains('acquired'), Contains('price'), Contains('assumed')),
        AND(Contains('goodwill'), Contains('intangible'), Contains('asset'),  
         Contains('liabil')))

The Input line reports the text you submitted and the Parsed line reports on how the line was transformed. What we hope is that this provides more visibility to you regarding why the results contained (or did not contain) particular tables.

We also redesigned the logging to create a new log file named SnipSummary_DATE_TIME.csv. This file contains all of the details from the input file and we have added a column named COUNT to report to you the number of snips that were extracted from each of the source files. Here is a screenshot of this file (I hid all of the metadata columns except for CIK).

The intention is to give you clearer visibility into the results. In the example above I was snipping purchase price allocations and their is not necessarily an upper limit on the number of those tables that might be reported in any particular 10-K. However, it was actually the case that for many of the snips above the snipped tables included the Statement of Cash Flows (as it included all of the strings I set as my parameter. I discovered that easily by initially reviewing the snips from those CIKs. There are many cases where you expect only one table from a filing – in that case you might see counts of 2/3/4 and you can identify those to review.

The MISSING.csv file also has been modified. In the last build the missing (the list of documents for which no tables were found) was a txt file. It is now a csv file. All of the metadata from the original file is present and there is an additional column DE_PATH. The reason for that column is so that you can then run a search to focus just on those documents. There is a demonstration of how to use the MISSING.csv file to run that search in this video (the search example begins at the 4’28” mark in the video). What is key here is that you can run the exact same search you ran initially, but the output is limited to only these specific documents.

Finally, another update was made to the Dehydration output. Malformed tables are now saved in a PROBLEM_TABLES subdirectory with an adjacent csv file that has the same name as the snip. The csv file contains all of the usual metadata from Dehydration/Rehydration and then we have parsed each line so that the content from the td/th elements ends up in a single cell. Here is a screenshot of this:

As you can see from that real example, this file will be easy to prepare for your data pipeline. You would just rename COL3, COL5 and COL7, delete COL2, COL4 and COL6 and then delete the two rows below (FISCAL 2022 DIRECTOR COMPENSATION and the row that has all of the original column headings).
Earlier, we stacked these in one csv file but after working with that some we believe this output is much easier to work with.

As a side note – after I saw the results with the SCF tables – I deleted all of the tables and output from that run and changed by Search Query to


Input:  (acquired or price or assumed) and (goodwill  and intangible and asset and 
         liabil) not (invest or financ or operat)

This modification reduced the noise in my output.

Details matter – another benefit of adding COFILERS

We are frantically trying to wrap up a complete auditor/audit fee database. We need to plug a pretty significant hole with respect to the AUDITOR, TENURE, REPORT DATE and LOCATION for 10-K filings made in 2020. Why only that year, that is another story.

I ran a search for ((DOCTYPE contains(10k*)) AND date(1/1/2020- 12/31/2020) pre/5 we have served). In English, I am looking for cases where there is a date in the range of 1/1/2020 through 12/31/2020 where the date precedes the phrase we have served by no more than five words. Further, I wanted to limit my search to only 10-K or 10-K/A documents. (Note – most auditors report their tenure before the report date – for those searches we changed the order of the date and the we have served). Below is the results of the search where date precedes the the tenure.

I set a box around the Baltimore Gas & Electric Co. (BG&E) summary to highlight there were 55 matches in the document. There were 55 matches in that filing because that filing has nine cofilers (and the fact that a match is a bit more complicated.) Here is a screenshot of the context extraction from that search with a focus on the lines belonging to BG&E. As you can see Pricewaterhouse Coopers served as the auditor for each of the subsidiaries included (the parent company is Exelon Corp). However, the context does not identify the entity to which the block of text references.

As a researcher you have to decide if you want this data. If you need it then you will actually find 81 total rows in the context extraction results (9 10-K filings with 9 disclosures in each 10-K filing). Unfortunately, the only way to identify the correct disclosure per CIK is to dig back into the filing. It turns out that the disclosure that belongs to BG&E is the first one in the list above with the phrase since at least 1993.

I actually imagine in many cases that these particular disclosures are not important because you are likely to eliminate these types of filers from your sample. Again though, if they are important it is critical to review the duplicates carefully to match the data to the Central Index Key that it is associated with.

As a random aside – it is/was interesting to me to observe that most of the cases with this form of the disclosure (Auditor, location, date and then tenure) were found in filings with COFILERS. The more common form of the disclosure (as pictured below) had many fewer total COFILERS in the results.

I ran these searches separately to simplify the parsing of the data (AUDITOR, TENURE, LOCATION CITY, LOCATION STATE, REPORT DATE). from the context. I am lazy, I wanted more certainty about the order I was expecting the analysis to follow so the parsing code had less to test.

More AI Automation Does not Make the World Better

Much earlier in the week I posted about Excel auto-populating a new mapping based on the column headings found in a collection of tables that report the Equity Compensation Plan Information disclosure required of SEC Filers. Today, I am not frustrated but I do want to share my experience – particularly since I know some folks are all into AI.

I am still working on that table and I am reviewing the labels before we start discussing moving this table into production. We are too small to have a human compare the original tables with the mappings so this step has to be handled with significant care. We have 908 original column headings pulled from about 1,600 original tables. These column headings should generally map into three semantic concepts, the number of shares to be issued, the weighted-average exercise price of any options reflected in the number of shares and then the remaining number of shares available for future issuance. In about 2% of the cases there are four columns of data presented but the overwhelming structure is consistent with what is set out the CFR section which mandates this disclosure.

I wanted to do a test to accomplish two goals. First, I wanted an independent review of the mappings. Since I completed 909 rows I thought asking Manish to review it would be worse then drudgery. I also wanted to evaluate whether or not I could have saved the approximately four hours original mapping took. In other words, could I trust this tool for this task? I spent about an hour outlining the problem to a very popular AI tool and submitted the list of column headings. Below is a screenshot of the original labels as parsed from the disclosures

I then merged the output from that process with my original labels to identify any differences and identified the differences. Here is a screenshot of the cases where my mapping did not match the original mapping.

This is not horrible but it is concerning. My instructions described three main concepts – TO_BE_ISSUED, AVERAGE_EXERCISE_PRICE and REMAINING. I did not recall any heading that seemed to map into a concept of OUTSTANDING and I do not think any of the above are real examples of cases where the in this context OUTSTANDING is a better choice then TO_BE_ISSUED. The first two do have the word OUTSTANDING but I would argue with an editor about the word choice if I saw the word OUTSTANDING used in this context.

I asked the AI tool why and their response really seemed to be more of a justification then a valid answer. Here is an excerpt:

Step B — But the phrase also contains both a past and future state

“issued or to be issued”

This is a compound construction:

“issued” → past

“to be issued” → future

In equity-comp language, humans sometimes use:

“issued and outstanding” interchangeably (sloppily)

“issued or issuable” to refer to currently outstanding awards

The diagnostic logic (incorrectly, in hindsight) treated:

“issued OR to be issued”
as a state aggregate rather than a flow category

That aggregate was internally labeled OUTSTANDING — meaning:

“currently existing instruments, whether already issued or issuable”

Maybe an English major would draw the same conclusion – I’ll have to ask one.

Unfortunately, I did not accomplish what I set out to accomplish – which was an evaluation of my original labels. So I took a different tack and provided my mappings and asked my trusty intern (the AI tool) to evaluate the mappings. According to my intern – my mappings were perfect. I doubted that, any reader of this blog knows I am an expert error maker. So I went through the mappings line by line. Hugely tedious but I found two errors. One case where I mapped “OF SECURITIES REMAINING AVAILABLE FOR FUTURE ISSUANCE UNDER EQUITY COMPENSATION PLANS (EXCLUDING SECURITIES REFLECTED IN COLUMN )” to “TO_BE_ISSUED” and then one where I mapped a label that communicated information about the exercise price into REMAINING.

This is interesting to me because I am trying to figure out the value of AI integration somehow into our client facing tools. I understand there is no joy in doing the original mapping and then checking them. Wouldn’t it be hugely amazing to offload that work completely? I just don’t see it yet. I think we need to perhaps streamline (pre-map) the original labels with better cleaning. So maybe my 908 rows could be reduced to 300 or so. But if I have to stand by the results (in my case it is the quality of the data we deliver, in yours it might be a really interesting and novel research finding that you want to publish) I am not ready to have any of the existing tools take over data cleaning. They are a huge help in other ways but when it comes down to verifying the accuracy of data, I think we have some ways to go.

I will say there is probably some significant value in letting the tools do the heavy lifting. If in fact you can take the time to carefully explain the nature and features of the disclosures and you dump a significant list, then you are likely to save some initial time. But I would also think about how to audit the results.

Survey in Next Update Email

We have been working to add some new data to the platform. Specifically the breakdown of the disclosure of the Equity Compensation Plan Information (EQCPI). I specifically have been evaluating the tests we need to identify the cases that need to be inspected. As good as our parser it we still get some bad parses and we need to be able to systematically identify those and shunt them to a review queue for a person to look at. For context, with Executive or Director Compensation, if the sum of the reported compensation differs from the reported total by more than $10,000 a human looks to see if the issue can be addressed (maybe a value is reported as 123.456 and on closer examination it appears that it should have been reported as 123,456).

What we are seeing in the EQCPI disclosures are errors in the disclosure that we can’t initially be certain of the source of the error. Here is what I saw in a summary file after I ran the code we are working on to identify potential errors.

This was reported in a proxy filing made by OXFORD INDUSTRIES (CIK: 75288). If I were collecting this data for research I would be concerned about the reliability of these values since there is no total for TO_BE_ISSUED. However, this is what was reported (other than the normalization choices we made) as you can see in the next image.

So while this merits review, the captured data accurately represents the disclosure that was made. For internal purposes though I need to be more confident that the data was correctly captured. I have decided that we should stop ignoring dashes and dash like characters and replacing them with something consistent. Currently, we replace dashes with nothing. I am leaning to start replacing them with something clever like wasDASH. I am also thinking about replacing blanks (the value for total under Number of Securities . . .) with wasBLANK.

The advantage of doing so is that we can reduce significantly the cases we have to review. If we do this internally, no big deal. However, I was thinking about whether or not this would benefit you when you are normalizing tabular data. Here is a screenshot of what I am considering.

In some ways this might seem excessive but I don’t know what I don’t know. My point is that to get it right we have to consider that something happened between extraction and dehydration. My lovely wife will tell you quickly that I am not perfect, the point I am trying to make is that some of these html source files can be in a shape and structure we have never seen before and so the error could be that one of our assumptions that is embedded into the code is wrong for this case.

The next time I send out an update I am going to survey you as to whether or not we should make this part of the normalization so that you see these values when the tables are rehydrated.

AI Automation Does Not Make the World Better

Is it an age thing – I really dislike all of the ‘AI’ junk that invades my work! How about when you are trying to compose an email and your provider happily types in front of you. Today I am normalizing column heading extracted from tables that report the status of shares to be issued and available in Equity Compensation Plans (often reported in ITEM 12 in the 10-K). Step 1 is tedious because of slight but potentially meaningful differences in column headings even though the actual column headings are specified by CFR 17 SEC 229.201.

I want to map the standard column headings into TO_BE_ISSUED, AVERAGE EXERCISE PRICE and AVAILABLE. So I am carefully reading the original column headings and making the determination. I accidentally hit enter when Excel analyzed and determined that I was typing AVAILABLE and it generously populated the cells below – with nonsense no less. Here are some of the 64 Automatic Mappings that were auto-populated

ORIG_LABEL	NEW_LABEL (from Excel)
'NUMBER OF SECURITIES AVAILABLE FOR FUTURE ISSUANCE UNDER EQUITY COMPENSATION PLAN (EXCLUDING SECURITIES REFLECTED IN COLUMN) (# OF SHARES)	AVAIABLE
'WEIGHTED AVERAGE EXERCISE PRICE OF OUTSTANDING OPTIONS AND RIGHTS 1 EQUITY COMPENSATION PLANS APPROVED BY SHAREHOLDERS:	AVERAGE
'PLAN CATEGORY NUMBER OF SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING OPTIONS, WARRANTS AND RIGHTS	CATEGORY
'EQUITY PLAN SUMMARY COLUMN NUMBER OF SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING OPTIONS, WARRANTS, AND RIGHTS	COLUMN
'NUMBER OF SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING OPTIONS, WARRANTS AND RIGHTS AND THE VESTING OF UNVESTED RESTRICTED STOCK UNITS	EXERCISE
'EQUITY COMPENSATION PLAN INFORMATION NUMBER OF SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING OPTIONS WARRANTS AND RIGHTS	INFORMATION
'NUMBER OF SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING OPTIONS, WARRANTS AND RIGHTS OR SETTLEMENT OF RESTRICTED STOCK UNITS	ISSUED
'NUMBER OF SECURITIES TO BE ISSUED UPON EXERCISE OR SETTLEMENT OF OUTSTANDING OPTIONS, WARRANTS AND RIGHTS	OF
'EQUITY COMPENSATION PLAN INFORMATION AT DECEMBER 31, 2024 WEIGHTED-AVERAGE EXERCISE PRICE OF OUTSTANDING OPTIONS, WARRANTS AND RIGHTS	PLAN
'NUMBER OF DEFINED SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING OPTIONS, WARRANTS AND RIGHTS DECEMBER 31, 2024	SECURITIES
NUMBER OF SECURITIES THAT MAY BE ISSUED UPON EXERCISE OR VESTING OF OUTSTANDING OPTIONS, WARRANTS AND RIGHTS AT TARGET	THATMAY
'NUMBER OF SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING OPTIONS OR SETTLEMENT OF LTIP PSUS AND RSUS	TO
'NUMBER OF SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING OPTION AWARDS, RESTRICTED STOCK UNITS, AND PERFORMANCE STOCK UNITS	TO B ISS
NUMBER OF SECURITIES AVAILABLE FOR FUTURE ISSUANCE UNDER EQUITY COMPENSATION PLANS (EXCLUDING SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING SHARE OPTIONS)	UNDER
NUMBER OF SECURITIES AVAILABLE FOR FUTURE ISSUANCE UNDER EQUITY COMPENSATION PLANS (EXCLUDING SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING SHARE OPTIONS)	UPON

Those of you that are avid readers know that I can substitute their for there. I have never lost sleep over those errors – I only have one brain cell. This interference is frustrating and unproductive.

I am not sure how Excel does this – but I think it could border on industrial espionage. It might be that Excel is using a C library similar to FuzzyWuzzy and the analysis and substitution decisions are happening locally. But what if they are communicating with a thread in the cloud? There are two concerns I have about this.

First, what if I am working with confidential data – does #Microsoft see all of our data and are they able to act on it? Second, I had to correct these errors, am I serving as an uncompensated trainer for Microsoft? The initial predictions were delivered, I corrected them. Did that info get delivered back to Microsoft and is that going to feed the monster? Two years ago I would have thought that possibility was idiotic – I am not so sure now.

I published this originally about an hour ago. It took me twenty minutes to clean up the mess that was caused by the original automation. I am back on track, happily making my mappings and it happened again. I was read for it this time – When Excel stared volunteering to do my work I froze and captured it.

If you can see above – I had typed REM and Excel volunteered AINDING – and look below – if I had not caught that we would have had the ANCEMAINCURITIES – tell me this is not a problem.

What is the truth?

We are desperately trying to finish up a new comprehensive audit fee database. One of the fields is the AUDIT REPORT DATE. But then we run into issues like this one. Here is a screenshot of the audit signature following the audit report included in New Concept Energy’s 10-K for the FYE 12/31/2024. Here is a link to the filing CIK 105744 10-K.

The dissemination date of this 10-K was 3/24/2025. Approximately one year and one week after the signature date. We discovered this while stress testing our data collection strategies. We have a fairly robust regular expression to use to find these dates – but the discrepancy between the dissemination date and the audit report date warranted further research (we don’t know how we make mistakes until we find the mistakes).

So now the question is what to do about this? I wonder if this is a clue (gosh I hate giving away this research idea) but anyway, is this a clue that the “Company’s internal control over financial reporting was ~~effective~~ ineffective as of December 31, 2024.” Note, the quote is from the 10-K but the date issue, while perhaps trivial might be evidence that the company does not have effective controls. I started this paragraph wondering what we should do about this. We have a couple of choices:

Leave as is.
Leave as is and flag.
Make an educated guess and change the date to 3/17/2025.
Leave the field blank.

I think what we need to do is confirm that we correctly pulled the value that is being asserted, audit report date (in this case) – and leave the value if we have collected it correctly but flag the issue.

I teased in the middle of the prior paragraph – it is kind of interesting the number and nature of errors that we find. Could this be evidence of ineffective controls in that the company does not have enough staff to carefully review and evaluate their disclosures?

Another New Dehydrator Feature

As illustrated in the screenshot below – I updated the Dehydrator to include what I have cleverly named the hydrator-{n}-grid-data.csv.

The n represents the count of dehydration artifacts in the directory that was analyzed. This new output reports the results of dehydration on a row by column basis. To keep the focus on the data I have hidden most of the metadata fields in the file – but here is a screenshot of the datavalues:

The RID represents the index of the data rows the data was pulled from. The ROW-LABEL reports the actual content from that row in the first column from the left. The CID represents the column index where 1 represents the first column to the right of the ROW-LABEL column. The value I have blocked above (818019) can be seen in this table which was the source of the data.

While it may look like the ROW-LABEL is just JOHN M. HOLMES, the actual html tells a different story. In this case both the name and the title are marked in the html as data belonging to each of the first three rows.

The entire Dehydration/Rehydration process is designed to allow you to more easily normalize the column headings (and/or row labels) to create a more compact matrix of data when working with hundreds or thousands of snipped tables from the filings. For example, while 3,111 DEF 14A filed in 2024 and so far in 2025 used OPTION AWARDS as a column heading there were 83 other variants used (OPTION-BASED AWARDS, OPTIONS/SAR AWARDS, . . . AWARDED OPTIONS). With the column-list.csv file you are able to identify one label that is to be applied to all of these phrases that carry the same semantic meaning. This is great when you want the entire result set normalized.

If you are not looking to normalize the whole table, instead if you only want particular data values then this feature is designed to help you get faster access to the data you are looking for. In this case, if all you wanted were option like awards then you can easily sort on the relevant words for the COLUMN-LABEL and that data is quickly available in a very compact form. I had two different PhD students who described this issue (not with compensation data) over the summer so I wrote the code to address their particular problems and have finally been able to integrate it into dehydration. This does not add much overhead to the Dehydration process so after agonizing a bit about offering a switch for it I just decided to make it a default feature.

I am going into the weeds next with this. There is another very important reason for this artifact. We have been trying to identify a reliable, but auditable way to reduce the column headings and row labels that are written to the column-list.csv file. We can’t really do this until we have access to all of the column headings and row labels in a particular collection of tables. This output gives us that access. If I am analyzing every single column heading at one time then I can apply another set of rules to reduce them as long as I can also keep track of where each particular column heading was used and if I have a way to share that information with you when complete. Seeing a label in a list of labels is not quite as useful as seeing it in the actual context adjacent to the other labels and data values.

In addition to the creation of this new artifact we did some more tweaking to the parser rules to improve the header and data row separation process. The new version incorporates these additional rules.

New Features for Dehydrator & Severance was Inspirational!

A PhD student reached out a couple of weeks ago because the dehydrator stopped during execution and she wasn’t sure why and how to get the data she needed. As always, I am going to obscure the data that our users are asking about – so in the example below I am going to use Executive Compensation since that is part of our normal processing.

The root cause of the failure was fascinating – there were html comments as children of td elements in the source tables she was trying to normalize. I had never even thought of that and so our parser crashed on her. The standard procedure has been to read the log, find the last file parsed, then use that to identify the next one and remove it as the crash would have been caused by that file. Since this was parallel to some work I have been focused on for a while I decided to use her problem as a test bed for some improvements that have been needed.

First, I changed our sanitizer to anticipate comments as children of any elements not just the table element (which is all I had ever seen before). So those would not be a problem. But I also decided that we could better handle ‘failed’ cases. Previously if the error was related to something that we anticipated the hydrator wrote to the log file that the processing failed for some unspecified error. Below is a screenshot with an example of this prior treatment highlighted.

When this happened we just passed the file and you had to identify these and manually do something to capture the data. Now we are parsing these files without any attempt to identify the column headings. When we have a failed file we are now writing the rows of data from the html into a csv file cleverly named hydrator-{n}-non-rectangular-summary-data.csv where the n represents the hydration attempt in that particular folder. All we are doing is stacking the content. We are actually writing the content one failed file at a time and this was deliberate. I was mulling this over and was initially going to write everything at the end with uniform column headings but we cannot necessarily expect the data in column 3 for CIK 1234567 to be the same for CIK 7653421. So we write in a new set of column headings for each individual file that gets diverted to this part of the processor. Here is the new log entry for the table that was flagged above:

Below is an image of the content from that file after it has been stacked in the non-rectangular-summary-data.csv file. When I took the screenshot I also included the next file to hopefully add additional context to my observation above. That it is a reach to try to predict what values belong in which column when considering the entire universe of tables and data types that you might be extracting. This is why we stack them in this manner.

For testing, I pulled 10,364 executive compensation snips. There were 186 that failed on the first pass and the data from those files was accurately deposited in the new csv file. I should be careful here, I did not inspect each of the rows of the 186 source files to compare those to the corresponding rows in the csv file. I checked this one to confirm the data was accurately deposited and about five others. I hope I am not caught later with inaccuracies. If so I will try to address.

I did some additional re-engineering of the table parsers in the dehydrator. When I received the email from the PhD student I had been spending quite a bit of time looking at tables and code. To identify the column headings in a file we start with trying to use deterministic characteristics of the html. There are two that we have historically relied on. One is the presence of comments that define the column heading rows and the other is the presence of th elements in the rows. However, historically these have been infrequently used. We process thousands of tables each month and we do look closely at their construction and structure. We have noticed additional class attributes that have become more common that are also very deterministic identifiers of column heading boundaries. I included a new parser step that evaluates the existence of these attributes and assigns them a higher priority for helping set those boundaries. With this new parser step approximately 20% of the table column headings are parsed deterministically as opposed to less than 3% before when the only deterministic identifiers were comments and th elements.

If a table does not have deterministic indicators of column headings we then use a series of heuristics. I am not going to share much about those as I consider those to be our ‘secret-sauce’. But I was watching an Apple TV show called Severance one night after wrestling with how to implement more refined heuristics. If you have watched the show, the employees look at numbers for patterns. I ended up deciding to build my own Severance tool to integrate into the dehydrator to test for patterns that would help us deliver better results to you. It is working and it has made it easier to tweak some of our heuristic separators. I have some more on the way. Prior to this update we had four classes of markers, we now have six and I am working on three more. These will slow down dehydration – this new build added four minutes to parsing my sample folder but the results are great – it reduced the number of column headings from over 2,000 in this sample to 1,600 and I hope in the next build to reduce it even more significantly.

This update should be fully available by Monday 10/27. I will post about the next update when it is ready.

DB Document Results Filter is More Versatile than It Seems

Our directEDGAR platform offers at least ten ways to filter search results. The effect of most of these is clearly obvious (DATE, CIK, CIK/DATE, 8-K FILING REASON, DOCTYPE, SICCODE, CNAME, . . .). I used the DB DOCUMENT RESULTS filter today and realized that I have not done a good enough job of describing at least one important use case.

I had an email from a PhD student who was trying to collect a table from some specific filings. They needed some guidance. I want to obscure their data so the following example is based on snipping the Executive Compensation (EC) table. I encourage our users to start table snipping with a really narrow and explicit focus. The idea is that the more narrow the focus the less noise in the output. For example, our process for identifying EC tables starts by requiring the table to have PRINCIPAL, NAME YEAR, POSITION, SALARY, EQUITY, AWARDS, OTHER and TOTAL. I can’t remember from our testing but my recollection is that less than 1% of the tables found with this set of requirements are TYPE 1 errors. And that is the heart of what I want to explain. It is absolutely true that I will not find an EC table from every DEF 14A with those requirements but my results will be very clean. Now the problem is to collect tables from those filings that we missed on the first pass, for example a reasonable proportion of filers do not use NAME, PRINCIPAL POSITION or YEAR in the column headings.

When our table parser finishes one of the outputs is a csv file named MISSING. The file contains a single column labeled FILENAME. This column contains a list of files from our search results where the application did not return/identify a table. Here is a screenshot of the file.

If we rename the FILENAME column to DE_PATH we can then use that file to filter our previous search on just those documents. That is important – presumably our original search set out some important parameters and we want to focus our data collection on just those firms. So I renamed the FILENAME column to DE_PATH and saved the missing file. I then selected the DB Document Results Filter by activating the Use DB check box and then clicking the Set DB File button.

Once I hit the Okay button and then Perform Search the results will be filtered to only those filings/documents that were listed in the CSV file. Here is the result of that search. Now I can review these presumably relevant filings to identify how to identify the EC table using less restrictive criteria. I brought the focus to a filing where the EC table did not have a label for the YEAR column.

I would have to dig through our system to say this with absolute authority but my recollection is that there are some tables where the only way we can grab them is by using the words SALARY and TOTAL. However, if we were to use only those words to identify tables on the first pass then we would get a huge amount of noise.

We have some obligations to deliver normalized Executive and Director compensation data by 7:00 PM each filing day so we absolutely cannot waste time having a person trudge through noisy data. While all of this is automated our process started with what I described above. We are getting ready to add two new tables to our system. The collection of these table will be automated. But the way we learn to automate it is to begin by manually snipping and then reviewing tables as they exist and identify the characteristics that will provide the least amount of noise in the output on the first pass. For me to have this coded I have to be able to describe how to filter the tables to settle on the exact criteria to apply. The only way I can do that is to review filings, snip some tables and then review what was missed to understand why it was missed. I use the DB Document Results Filter repeatedly as I am iterating through the process.

If the world was perfect we could use science alone to do this work. Filings are created by humans and thus they are not perfect so we have to bring some tenacity to this process. Features like the DB Document Results Filter reduce at least some of the frictions associated with being persistent.