Another New Dehydrator Feature

As illustrated in the screenshot below – I updated the Dehydrator to include what I have cleverly named the hydrator-{n}-grid-data.csv.

The n represents the count of dehydration artifacts in the directory that was analyzed. This new output reports the results of dehydration on a row by column basis. To keep the focus on the data I have hidden most of the metadata fields in the file – but here is a screenshot of the datavalues:

The RID represents the index of the data rows the data was pulled from. The ROW-LABEL reports the actual content from that row in the first column from the left. The CID represents the column index where 1 represents the first column to the right of the ROW-LABEL column. The value I have blocked above (818019) can be seen in this table which was the source of the data.

While it may look like the ROW-LABEL is just JOHN M. HOLMES, the actual html tells a different story. In this case both the name and the title are marked in the html as data belonging to each of the first three rows.

The entire Dehydration/Rehydration process is designed to allow you to more easily normalize the column headings (and/or row labels) to create a more compact matrix of data when working with hundreds or thousands of snipped tables from the filings. For example, while 3,111 DEF 14A filed in 2024 and so far in 2025 used OPTION AWARDS as a column heading there were 83 other variants used (OPTION-BASED AWARDS, OPTIONS/SAR AWARDS, . . . AWARDED OPTIONS). With the column-list.csv file you are able to identify one label that is to be applied to all of these phrases that carry the same semantic meaning. This is great when you want the entire result set normalized.

If you are not looking to normalize the whole table, instead if you only want particular data values then this feature is designed to help you get faster access to the data you are looking for. In this case, if all you wanted were option like awards then you can easily sort on the relevant words for the COLUMN-LABEL and that data is quickly available in a very compact form. I had two different PhD students who described this issue (not with compensation data) over the summer so I wrote the code to address their particular problems and have finally been able to integrate it into dehydration. This does not add much overhead to the Dehydration process so after agonizing a bit about offering a switch for it I just decided to make it a default feature.

I am going into the weeds next with this. There is another very important reason for this artifact. We have been trying to identify a reliable, but auditable way to reduce the column headings and row labels that are written to the column-list.csv file. We can’t really do this until we have access to all of the column headings and row labels in a particular collection of tables. This output gives us that access. If I am analyzing every single column heading at one time then I can apply another set of rules to reduce them as long as I can also keep track of where each particular column heading was used and if I have a way to share that information with you when complete. Seeing a label in a list of labels is not quite as useful as seeing it in the actual context adjacent to the other labels and data values.

In addition to the creation of this new artifact we did some more tweaking to the parser rules to improve the header and data row separation process. The new version incorporates these additional rules.

Leave a Reply