A PhD student reached out a couple of weeks ago because the dehydrator stopped during execution and she wasn’t sure why and how to get the data she needed. As always, I am going to obscure the data that our users are asking about – so in the example below I am going to use Executive Compensation since that is part of our normal processing.
The root cause of the failure was fascinating – there were html comments as children of td elements in the source tables she was trying to normalize. I had never even thought of that and so our parser crashed on her. The standard procedure has been to read the log, find the last file parsed, then use that to identify the next one and remove it as the crash would have been caused by that file. Since this was parallel to some work I have been focused on for a while I decided to use her problem as a test bed for some improvements that have been needed.
First, I changed our sanitizer to anticipate comments as children of any elements not just the table element (which is all I had ever seen before). So those would not be a problem. But I also decided that we could better handle ‘failed’ cases. Previously if the error was related to something that we anticipated the hydrator wrote to the log file that the processing failed for some unspecified error. Below is a screenshot with an example of this prior treatment highlighted.
When this happened we just passed the file and you had to identify these and manually do something to capture the data. Now we are parsing these files without any attempt to identify the column headings. When we have a failed file we are now writing the rows of data from the html into a csv file cleverly named hydrator-{n}-non-rectangular-summary-data.csv where the n represents the hydration attempt in that particular folder. All we are doing is stacking the content. We are actually writing the content one failed file at a time and this was deliberate. I was mulling this over and was initially going to write everything at the end with uniform column headings but we cannot necessarily expect the data in column 3 for CIK 1234567 to be the same for CIK 7653421. So we write in a new set of column headings for each individual file that gets diverted to this part of the processor. Here is the new log entry for the table that was flagged above:
Below is an image of the content from that file after it has been stacked in the non-rectangular-summary-data.csv file. When I took the screenshot I also included the next file to hopefully add additional context to my observation above. That it is a reach to try to predict what values belong in which column when considering the entire universe of tables and data types that you might be extracting. This is why we stack them in this manner.
For testing, I pulled 10,364 executive compensation snips. There were 186 that failed on the first pass and the data from those files was accurately deposited in the new csv file. I should be careful here, I did not inspect each of the rows of the 186 source files to compare those to the corresponding rows in the csv file. I checked this one to confirm the data was accurately deposited and about five others. I hope I am not caught later with inaccuracies. If so I will try to address.
I did some additional re-engineering of the table parsers in the dehydrator. When I received the email from the PhD student I had been spending quite a bit of time looking at tables and code. To identify the column headings in a file we start with trying to use deterministic characteristics of the html. There are two that we have historically relied on. One is the presence of comments that define the column heading rows and the other is the presence of th elements in the rows. However, historically these have been infrequently used. We process thousands of tables each month and we do look closely at their construction and structure. We have noticed additional class attributes that have become more common that are also very deterministic identifiers of column heading boundaries. I included a new parser step that evaluates the existence of these attributes and assigns them a higher priority for helping set those boundaries. With this new parser step approximately 20% of the table column headings are parsed deterministically as opposed to less than 3% before when the only deterministic identifiers were comments and th elements.
If a table does not have deterministic indicators of column headings we then use a series of heuristics. I am not going to share much about those as I consider those to be our ‘secret-sauce’. But I was watching an Apple TV show called Severance one night after wrestling with how to implement more refined heuristics. If you have watched the show, the employees look at numbers for patterns. I ended up deciding to build my own Severance tool to integrate into the dehydrator to test for patterns that would help us deliver better results to you. It is working and it has made it easier to tweak some of our heuristic separators. I have some more on the way. Prior to this update we had four classes of markers, we now have six and I am working on three more. These will slow down dehydration – this new build added four minutes to parsing my sample folder but the results are great – it reduced the number of column headings from over 2,000 in this sample to 1,600 and I hope in the next build to reduce it even more significantly.
This update should be fully available by Monday 10/27. I will post about the next update when it is ready.




































