New Update to TableExtraction & Dehydrator Outputs

We just updated the TableExtraction and Dehydrator code to create new outputs.

For TableExtraction we added a new search_phrase.txt file that is only generated when you use the Search Query option to identify tables for extraction. The file contains both your original set of parameters as well as the transformation(s) we apply to generate the code that is then delivered to the TableExtraction engine. Here is a sample:

Input:  (acquired or price or assumed) and (goodwill  and intangible and asset and 
         liabil)
Parsed: AND(OR(Contains('acquired'), Contains('price'), Contains('assumed')),
        AND(Contains('goodwill'), Contains('intangible'), Contains('asset'),  
         Contains('liabil')))

The Input line reports the text you submitted and the Parsed line reports on how the line was transformed. What we hope is that this provides more visibility to you regarding why the results contained (or did not contain) particular tables.

We also redesigned the logging to create a new log file named SnipSummary_DATE_TIME.csv. This file contains all of the details from the input file and we have added a column named COUNT to report to you the number of snips that were extracted from each of the source files. Here is a screenshot of this file (I hid all of the metadata columns except for CIK).

The intention is to give you clearer visibility into the results. In the example above I was snipping purchase price allocations and their is not necessarily an upper limit on the number of those tables that might be reported in any particular 10-K. However, it was actually the case that for many of the snips above the snipped tables included the Statement of Cash Flows (as it included all of the strings I set as my parameter. I discovered that easily by initially reviewing the snips from those CIKs. There are many cases where you expect only one table from a filing – in that case you might see counts of 2/3/4 and you can identify those to review.

The MISSING.csv file also has been modified. In the last build the missing (the list of documents for which no tables were found) was a txt file. It is now a csv file. All of the metadata from the original file is present and there is an additional column DE_PATH. The reason for that column is so that you can then run a search to focus just on those documents. There is a demonstration of how to use the MISSING.csv file to run that search in this video (the search example begins at the 4’28” mark in the video). What is key here is that you can run the exact same search you ran initially, but the output is limited to only these specific documents.

Finally, another update was made to the Dehydration output. Malformed tables are now saved in a PROBLEM_TABLES subdirectory with an adjacent csv file that has the same name as the snip. The csv file contains all of the usual metadata from Dehydration/Rehydration and then we have parsed each line so that the content from the td/th elements ends up in a single cell. Here is a screenshot of this:

As you can see from that real example, this file will be easy to prepare for your data pipeline. You would just rename COL3, COL5 and COL7, delete COL2, COL4 and COL6 and then delete the two rows below (FISCAL 2022 DIRECTOR COMPENSATION and the row that has all of the original column headings).
Earlier, we stacked these in one csv file but after working with that some we believe this output is much easier to work with.

As a side note – after I saw the results with the SCF tables – I deleted all of the tables and output from that run and changed by Search Query to


Input:  (acquired or price or assumed) and (goodwill  and intangible and asset and 
         liabil) not (invest or financ or operat)

This modification reduced the noise in my output.

directEDGAR

Search, Extraction & Normalization Engine

New Update to TableExtraction & Dehydrator Outputs

Like this:

Related

Leave a ReplyCancel reply

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from directEDGAR