Skip to main content
How Can We Help?

Search for answers or browse our knowledge base.

Return to Support Front Page

Categories
Print

Working With Tables

Overview

directEDGAR’s Rehydration and Rehydration tools were designed to assist in the process of consolidating large volumes of tabular data. Consolidating tabular data is made challenging because filers use a number of subtly different words and phrases to describe the same concept. The Dehydrator and Rehydrator tools extract the column and row headings from every file and provide a way for you to review and standardize them, and then apply the standardized headings to the original data, reducing the variation caused by inconsistent labeling by filers.

To describe the use of our tools I have a sample of 86 Beneficial Ownership tables that were snipped using the TableExtraction tool. Our Dehydration\Rehydration tools only work on what we describe as table snips that have been extracted from the filings with our tools. These snips are essentially copies of the original tables that are stored in individual htm/txt files. The snips are named using the CIK-RDATE_CDATE-F##-## system used with all directEDGAR extraction artifacts.

We never begin a table normalization process until we have reviewed any regulatory or GAAP or other guidelines and or limitations that specify the nature and form of the disclosure. The requirement to disclose beneficial ownership in tables is set out in CFR 17 229.403. In those requirements the model table has the following set of column headings:

(1) Title of Class (2) Name and address of beneficial owner (3) Amount and nature of beneficial ownership (4) Percent of class https://www.law.cornell.edu/cfr/text/17/240.13d-3

This is a very limited set of column headings and so initially it looks as if this data would be very easy to normalize. The reality of the disclosure form though is very very different As we work through the example below we will point out why it is important to be aware of the regulatory framework that sets the disclosure requirements to be sure we are capturing the right data.

Dehydration

Dehydration is the first step. The engine parses the table snips and tries to identify every unique row and column label. It then organizes them into separate files that summarize the labels and reports their frequency and identifies the filers that used each particular label. To open the Dehydrator, go to the normalization tab in the application and click “Dehydrator.” This window will open:

Select the folder containing your snips as the source directory.  Don’t pre-populate from an existing dictionary if this is your first time, leave it blank. We’ll create our first dictionary when we Rehydrate. Running the Dehydrator will create the following new files in your source directory. 

File

Purpose

dehy.log

Lists specific processing errors

hydrator-1-column-list.csv

Every identifiable column label from snips

hydrator-1-dehydrate.log

Lists files that were processed and whether or not they were successfully processed

hydrator-1-labels.db

A special database that has the row and column label files as well as all of the data values at their intersection.  This file is not user editable.

hydrator-1-row-list.csv

Every identifiable row label from snips

As an aside – remember that the CFR describes four columns that should be reported in the table – in our sample of 86 tables we found 120 unique column headings/labels. There are two issues that drive that variation – first there is the language choices made by different registrants – using different words to describe the same concept. For example, we found that registrants used these phrases (and many others) “BENEFICIAL OWNERSHIP NUMBER,” “BENEFICIAL OWNERSHIP NUMBER OF SHARES,” and “BENEFICIALLY OWNED NUMBER OF SHARES to describe the number of shares. There is as much variation in the labels used to describe the percentage of ownership as well. Without the ability to standardize these labels any matrix of this data would have the 120 column headings. After Dehydration, Normalization and Rehydration the number of unique column headings will be significantly lower and thus this data can be used in our research.

hydrator-1-column-list.csv/hydrator-1-row-list.csv files

These files are at the heart of the normalization process. As mentioned earlier, the Dehydrator process parses the snips and tries to identify all unique column headings and row labels and creates a record of their frequency in the population of tables that are being analyzed. They both have the same structure – for simplicity purposes our focus in this example will be on the hydrator-1-column-list.csv file. Here is a partial screenshot from our work:

As I have already mentioned, there were 120 unique labels after dehydration from the original 86 unique tables. Without normalization – if we took these tables directly to Excel there would be 120 column headings. To make this data usable we would have to look at columns and somehow consolidate them in Excel or perhaps write some intermediate code – in each of these alternatives we would be looking at the columns, their values and having to do some thinking. Our process keeps the focus on the column heading.

Both the column-list and row-list files have the same format – here is an explanation of the various column headings.

Column


Purpose

ORIG_LABEL

Column or row label pulled by dehydrator from original table

NEW_LABEL

User- or Dictionary-assigned labels to be used in rehydration

CIK_COUNT

The number of CIKs with a given original label

CIK_1, CIK_2...

An individual CIK with a given original label.

Normalization

Normalization is the word we use to describe the process of organizing the existing labels and mapping those with a common meaning but a different expression into a common heading. So for example, in the first three rows are the values %; %%; and % OF CLASS. My first inclination is to normalize those headings to PERCENT – but it is important to be careful and try to avoid making too many presumptions about the intended meaning of the column headings if miss-labeling will lead to corrupted data. Thus, one of the reasons we provide the list of CIKs associated with a particular label is to make it easier to review the source table to confirm the appropriateness of the normalized label we are considering. In my case, I am just a little concerned that there might be multiple classes in this table with the same heading in the case of the label % OF CLASS and so I want to review those snips to establish what is being reported. Both seem to be reporting on common stock based on the actual table:

The point I am hoping to make here is that – caution is important at this stage. While it may slow down the process a bit initially our experience is that the while there is a learning curve, since we are going to ultimately create a dictionary with these mappings to do more tables in the future the payoff from this initial caution is significant.

We had no control over how the registrants organized this data. There will be cases where extraneous data is included in the table. For example, one of the cells in our hydrator-1-column-list.csv file had the column heading POSITION. A review of that table reveals that the column values do describe the relationship of the named person with the company. We normalized this to POSITION with the intent of deleting the column in our final output.

Probably the most challenging problem are the cases when a registrant uses a label in a way that indicates the meaning of the data is significantly different from the way other registrants have used the label. To get a sense of this review this image of the beneficial ownership table as reported by Emcore Corporation:

The language they provide in the proxy to describe the meaning of this data conforms to the language used by the SEC to define what must be included/ classified as Shares Beneficially Owned.

Shares beneficially owned include shares of Common Stock, options to acquire shares of Common Stock and restricted stock units that are exercisable or will vest within sixty (60) days of December 31, 2022. Unless otherwise indicated, the address of each of the beneficial owners is c/o EMCORE Corporation, (EMCORE proxy).

However, Analog Devices deviates from the definition of Beneficially Owned as set out in the CFR and uses this reporting format.

Analog Devices (6281)

This is important to catch. I suspect, if I had not reviewed these carefully I might have labeled Total Beneficial Ownership in the same way as I would have labeled Shares Beneficially Owned. This could lead to corrupt data in the final results. In this case I really don’t care about the Shares Acquirable Within 60 Days. The ultimate value I want to collect is what is reported in the Total Beneficial Ownership column. To be safe, reviewing this presentation has alerted me to the fact that I need a unique label for Shares Beneficially Owned and another for Total Beneficial Ownership.

Ultimately we ended up mapping the 120 column headings into 9 unique labels. This is still more than what we want in the output but we are exercising caution to try to prevent data loss. Here is a screen shot of part of the normalized column headings.

Notice the NOISE column. I used that label to map together all of the data columns that I am confident I will not use.

Rehydration

Now it’s time to Rehydrate. The Rehydrator is right below the Dehydrator in the Normalization drop-down. The Rehydrator will take the row and column labels we just assigned, and combine them with the original data from the snips, and consolidate the individual snips into one CSV file, which will be labeled “hydrator-1-summaries”

The source directory is the same as before. Unless you’re working with multiple datasets in the same folder, the dataset number will always be 1. You’re almost always going to want to select original column labels, unless you have a reason to transpose the data or in those cases where the none option is the most productive option.

Creating a Dictionary

The dictionary feature lets us ensure the row and column labels which we manually mapped earlier will persist and be mapped in future normalizations. To set up a dictionary, first select “Navigate to Other/New” then click Browse.

This will open up the file explorer. Navigate to the directory where you do your directEDGAR work, create a new folder devoted to dictionaries, title the dictionary related to the nature of the data you’re working with, and hit Open. You don’t need to create a file – you just need to create a title in the file explorer.

Then click “Okay” in the Rehydrator panel and the application will create the dictionary and Rehydrate the data using the new column headings in place of the original ones.

The Rehydration process creates several new files in the directory. These are listed in the following table.

File

Purpose

hydrator-1-rehydrator.log

Lists files that were processed and whether or not they were successfully processed

hydrator-1-summaries.csv

The normalized data with the old ROW/COLUMN labels mapped into new values based on your dictionary/

Here is a screenshot of the file after Rehydration.

There is additional clean-up. For this data, the most important step is to sort on STOCK and confirm that the value for BENEFICIALOWNERSHIP is present, if not then in those cases the value for STOCK needs to be moved over to the BENEFICIALOWNERSHIP column.

Using Dictionaries

The benefit to creating a dictionary is realized with the next batch of tables that get normalized. With a dictionary available the Dehydrator application will pre-populate the NEW-LABEL column with the values from the dictionary.

Here is an image of the beneficial_ownership_user-column-list.csv file that has my beneficial ownership mappings.

So the next time I work with these tables, if I include/select the dictionary before Dehydration any time a value for ORIG_LABEL is found in the new data the NEW_LABEL value from the dictionary will be included in the column-list/row-list files.

To use a dictionary from your dictionary folder during Dehydration first specify the directory that has the files you want Dehydrate. If this is the first time to use a particular dictionary select the Navigate to Other button, find the folder where your dictionaries have been saved.

You need to select the column-list.csv not the row-list.csv. If you have normalized row labels the system will automatically populate both row and column headings.  Here is an image of the hydrator-1-column-list.csv file after dehydrating a new set of tables.

Adding to Dictionaries

Presumably I will continue this work and assign values to the NEW_LABEL column as appropriate. When I am finished I can then add these mappings into the dictionary. These can be added to an existing dictionary during Rehydration. As illustrated in the image below, hit the radio button next to Select Recent Dictionary and your recent dictionary should be available.

If the dictionary you want to work with is not listed, select Navigate to Other/New and when the directory is loaded select the dictionary-column-list.csv file. Note – if you do not select a dictionary-column-list.csv file the application will warn you.

Since the dictionary files are CSV files they can be edited using Excel, Notepad or many other tools. You can open them add to them or modify the mappings as you need. One of the reasons we added the Navigate to Other/New capability is that we have clients at the same facility and on the same network who are working on large datasets. This feature allows you to share a dictionary that is located in a network location.

Metadata in Extracted Table Results

All csv files created during any form of table extraction have a common set of metadata included in the output. Here is a screenshot of the metadata section of the csv file generated during the work to create this documentation:

The meaning of each of the fields is described in the following table:

FIELD

DESCRIPTION

NO

Row number in the csv file

CIK

Central Index Key of the filer

RDATE

R + The dissemination date per the header of the filing the table was extracted from

CDATE

C + The conformed date per the header of the filing the table was extracted from - meaning varies between filing types and not always consistent

FNAME

F + The last two digits of the ACCESSION number of the filing the table was extracted from

TID

Table Index - represents the index of the table when counting all tables in the source document

NAME-LABEL

The column heading of the first column with text/values in the table

NAME

The values in the first column in the table with data

New Output File – hydrator-n-non-rectangular-summary-data.csv

As of 10/27/2025 there will sometimes be a new output file created during dehydration. This file will contain the raw data from tables that the application was not able to rectangularize (yes a made-up word). If the underlying html has rows with unequal number of cells these are files that cannot be rectangularized and our algorithm for mapping column headings to data would potentially cause alignment problems between some data rows and column headings. Previously the application reported in the log that dehydration failed for these specific table. Now they are “stacked” in a csv file with the metadata you need to identify the html file and then each value from each cell based on the original placement in the table. The log will now report something like 2025-10-26 15:36:08,102:processing of 3570-R20240415-C20240523-F42-419.htm skipped (non-rectangular (extracted to CSV)). Here is a screenshot of the contents of this new output.

Notice the red arrow above. We are writing to this file incrementally during processing. It is important that you realize that other than the metadata rows you should not expect similar content to be present across table boundaries for particular columns – to make that clear I have the column heading repeated for each file.

Working with Tables on Your Local Computer

We always allow our clients to work with our tools on their local computer. This is one of those cases where a local install might provide some benefits. Please contact us to make arrangements for a no-cost local version.

Table of Contents