When “Structured Data” Isn’t What the Company Reported

One of the promises of inline XBRL was simplicity.
If a data point appears on the cover page of a 10-K and it’s tagged, then—at least in theory—it should be easy to extract, compare, and analyze.

Public float is a good example.

The SEC requires filers to report their public float on the cover page, and with tagged cover pages that value is now explicitly labeled and machine-readable. Many researchers and data users understandably rely on the SEC’s structured data feeds to capture it.

But there is a quiet problem.

In a subset of filings, the tagged public float value does not match what the company actually reports on the cover page.

Here’s what that looks like in practice (Packaging Corp of America 12/31/2024 10-K):

  • Reported on the HTML cover page
    • $16,124,636,098
  • Tagged Value in XBRL:
    • 16,124,636,098,000,000

The widely used SEC Structured data files contain the same value as the tagged XBRL. To explore you can download the February 2025 data folder and after opening look at line 7,553 in the sub.tsv file.

That’s not a rounding issue or a scaling choice. It’s a conversion error—effectively inflating the reported public float by a factor of one million.

What’s especially interesting is that these filings share a common fingerprint. They appear to have been prepared using a specific filing platform, as indicated by metadata embedded directly in the submission document.

<!-- DFIN New ActiveDisclosure (SM) Inline XBRL Document -->
note - creation date varied across the 30 so it was not a specific date issue
<!-- Copyright (c) 2025 Donnelley Financial Solutions, Inc. -->

Based on the above, this doesn’t look like a company-specific mistake. It looks like a systematic transformation issue introduced during the filing conversion process. However, it’s more complicated than that. Based on my analysis, this software was used to file 892 10-K (I excluded 10-K/A) in 2025 but the problem seems to have occurred in roughly 30 or so.

Why this matters

If you rely on the SEC’s structured data endpoint you’ll ingest the overstated value without any obvious warning. The number is valid XBRL; it parses cleanly. Unless you validate it against the rendered filing, it looks perfectly legitimate. Why would you question it?

That’s a problem for:

  • researchers using filer size thresholds in their analysis
  • analysts filtering by public float,
  • and anyone using public float as a screening criteria

A simple safeguard

In our own workflows we actually start with the HTML filing itself and then map tagged XBRLvalues back to what the company actually reports on the cover page. This makes discrepancies easier to flag because even for our client facing databases we report to you both the original value as reported in the html and the tagged value as you can see in the next image

This imposes additional work but it also gives you a point of reference that is lost with the raw XBRL data no matter how you access it. Ingesting just the XBRL give no reference to truth. The Inline XBRL makes more data accessible, the HTML provides a clear way of evaluating the results.

Conclusion

When the first XBRL filings were introduced, we were ready—and genuinely excited. The promise was compelling: standardized, machine-readable financial statements that could be reliably constructed directly from tagged data.

That excitement didn’t last long.

As we began validating what we were deriving from the filings, it became clear that the work required was far more exhaustive than anticipated. In one early effort, we constructed roughly 5,000 income statements from XBRL data and compared those to the original financial statements. Approximately 7% contained problems.

What made this especially frustrating was not the error rate itself, but the nature of the errors. We were unable to devise any rule, heuristic, or algorithm that could reliably identify which statements were wrong. The issues were only visible when a constructed statement was compared to the one actually presented in the filing. So XBRL could not be a starting and ending point.

Inline XBRL has materially improved this situation. By embedding tagged data directly within the rendered document, it provides the necessary context to test, validate, and reconcile structured data against what filers actually report. That context doesn’t eliminate errors—but it makes them observable.

The lesson hasn’t changed: structured data creates opportunity, but only when it is paired with validation, traceability, and context. Inline XBRL doesn’t solve data quality problems on its own—but it finally gives us the tools to see them.

Should We Make Up Data?

We are preparing to drop a fairly comprehensive Audit Fee data base. Long time coming but one of the real challenges is that there can be some wonky data and it is what it is. I am spending time trying to devise tests to evaluate the quality of our parsing and data assignment code. One of the tests I dreamed up is checking if the Date of the Audit Report precedes the dissemination data by more than 10 days. Here is a screenshot of the audit report for CSW Industrials (CIK 1624794) for the year ended 3/31/2022.

Here is a screenshot of the auditor’s signature:

It is not a huge issue, the filing was made on 5/18/2022 and perhaps someone felt harried when doing the final checks. For us though, it raises a significant question – when do we have liberty to change data after it is reported? I am so reluctant to make changes because that seems like a slippery slope. What is the source of truth? As an aside, the auditor submitted an Exhibit-18 describing the change from LIFO to FIFO that was filed with the 10-K. That has the date 5/18/2021.

We have identified about 100 different cases of similar/adjacent issues. Trying to validate that what we are doing is correct is our primary focus. A secondary issue though is to decide what to do. As I write this I am more inclined to use the 5/18/2021 date just because the 10-K should be the source of truth.

New Update to TableExtraction & Dehydrator Outputs

We just updated the TableExtraction and Dehydrator code to create new outputs.

For TableExtraction we added a new search_phrase.txt file that is only generated when you use the Search Query option to identify tables for extraction. The file contains both your original set of parameters as well as the transformation(s) we apply to generate the code that is then delivered to the TableExtraction engine. Here is a sample:

Input:  (acquired or price or assumed) and (goodwill  and intangible and asset and 
         liabil)
Parsed: AND(OR(Contains('acquired'), Contains('price'), Contains('assumed')),
        AND(Contains('goodwill'), Contains('intangible'), Contains('asset'),  
         Contains('liabil')))


The Input line reports the text you submitted and the Parsed line reports on how the line was transformed. What we hope is that this provides more visibility to you regarding why the results contained (or did not contain) particular tables.

We also redesigned the logging to create a new log file named SnipSummary_DATE_TIME.csv. This file contains all of the details from the input file and we have added a column named COUNT to report to you the number of snips that were extracted from each of the source files. Here is a screenshot of this file (I hid all of the metadata columns except for CIK).

The intention is to give you clearer visibility into the results. In the example above I was snipping purchase price allocations and their is not necessarily an upper limit on the number of those tables that might be reported in any particular 10-K. However, it was actually the case that for many of the snips above the snipped tables included the Statement of Cash Flows (as it included all of the strings I set as my parameter. I discovered that easily by initially reviewing the snips from those CIKs. There are many cases where you expect only one table from a filing – in that case you might see counts of 2/3/4 and you can identify those to review.

The MISSING.csv file also has been modified. In the last build the missing (the list of documents for which no tables were found) was a txt file. It is now a csv file. All of the metadata from the original file is present and there is an additional column DE_PATH. The reason for that column is so that you can then run a search to focus just on those documents. There is a demonstration of how to use the MISSING.csv file to run that search in this video (the search example begins at the 4’28” mark in the video). What is key here is that you can run the exact same search you ran initially, but the output is limited to only these specific documents.

Finally, another update was made to the Dehydration output. Malformed tables are now saved in a PROBLEM_TABLES subdirectory with an adjacent csv file that has the same name as the snip. The csv file contains all of the usual metadata from Dehydration/Rehydration and then we have parsed each line so that the content from the td/th elements ends up in a single cell. Here is a screenshot of this:

As you can see from that real example, this file will be easy to prepare for your data pipeline. You would just rename COL3, COL5 and COL7, delete COL2, COL4 and COL6 and then delete the two rows below (FISCAL 2022 DIRECTOR COMPENSATION and the row that has all of the original column headings).
Earlier, we stacked these in one csv file but after working with that some we believe this output is much easier to work with.

As a side note – after I saw the results with the SCF tables – I deleted all of the tables and output from that run and changed by Search Query to


Input:  (acquired or price or assumed) and (goodwill  and intangible and asset and 
         liabil) not (invest or financ or operat)

This modification reduced the noise in my output.