New Fields in Tax Recon DB

We are continuing to test and refine our approach to capturing the impact of ASU 2023-09 on the structure of the Effective Tax Rate Reconciliation table. Our work is grounded in the actual table structures that companies have used to implement this expanded disclosure.

One of the key attributes we believe is important to capture is the adoption method (retrospective or prospective). In many filings, the adoption method is not explicitly stated. Instead, it must be inferred from the structure and content of the reconciliation tables. To address this, we developed an algorithm that analyzes table structure and determines the adoption method based on observable disclosure patterns. As a result, we have introduced a new field, ASU_STANDARD, to identify disclosures that conform to ASU 2023-09.

If the ASU_STANDARD field is blank, the associated rows were not disclosed in conformity with ASU 2023-09.

If the field reports ASU_2023_09_PROSPECTIVE, two implications follow for that accession number:

  1. All rows labeled with ASU_2023_09_PROSPECTIVE represent the ASU 2023-09–mandated disclosure.
  2. A second reconciliation table will typically exist for the same accession number that presents the Effective Tax Rate reconciliation using the prior disclosure format. This reflects the prospective adoption method, under which ASU 2023-09 applies only to the current period, while prior periods continue to be presented under the previous disclosure requirements.

As a practical matter, this assumes the filer is not a new registrant, since a newly reporting entity may lack prior-period operating history and therefore may not present a legacy-format reconciliation table.

If the ASU_STANDARD field has the value ASU_2023_09_RETROSPECTIVE, only one logical reconciliation table is expected for that ACCESSION, and that table will typically report up to three years of data.

The term “table” is used here in a logical rather than strictly physical sense. In many filings, the ASU 2023-09 reconciliation disclosure spans multiple pages in the Form 10-K and is therefore physically presented as two or three separate tables. However, these tables collectively represent a single reconciliation disclosure.

When this occurs, we treat the sequence as one logical table. All rows from the component physical tables are assigned the same TABLE_NO, which corresponds to the index of the first table in the sequence. This distinction between physical tables and logical tables reflects the presentation structure of SEC filings while preserving a consistent analytical unit for downstream processing. This approach ensures that the full multi-page disclosure can be reconstructed and analyzed as a single, continuous reconciliation table.

There is a structural issue I would like to address but I think it is too early to do so. The new standard requires separate disclosures by both the jurisdiction and nature of the tax/benefit if they are 5% or more of the total income taxes paid. The jurisdiction can be identified in two ways. First by the country name in the row label as you can see in the next image:

The problem with relying strictly on the row label is that the row label may be a caption for a subsection. To understand my meaning review these rows in the database from the 10-K filing submitted by TEVA PHARMACEUTICAL INDUSTRIES INC (CIK: 818686)

If you look closely you will see that none of the rows with the names of foreign jurisdictions have data. That is because those are section identifiers. To determine the country related tax effect it is necessary to use the dimensional information provided by the filer. When this information is disclosed we capture it in the explicit_member_dim (describes nature of the intended disclosure) and then the explicit_member_text field which provides the identifying information as you can see in the next image.

The structural issue I would like to address is to add country name fields with a value of 1/Null. However, I don’t think we have quite enough information yet to do so. We need a much larger sample of these disclosures to develop the right system to map the disclosures with the right country.

Speaking of a much larger sample, March 2 is the filing deadline for 12/31 Large Accelerated Filers. For those of you eager for this data to beat the crowd we are updating our data nightly and our client facing data weekly. However, we will make a special update early in the morning of March 3rd with the filings through 3/2. However, let me warn you that about 2-4% of the filers will be one or two days late. Further, the Accelerated Filers deadline follows on 3/16.

When “Structured Data” Isn’t What the Company Reported

One of the promises of inline XBRL was simplicity.
If a data point appears on the cover page of a 10-K and it’s tagged, then—at least in theory—it should be easy to extract, compare, and analyze.

Public float is a good example.

The SEC requires filers to report their public float on the cover page, and with tagged cover pages that value is now explicitly labeled and machine-readable. Many researchers and data users understandably rely on the SEC’s structured data feeds to capture it.

But there is a quiet problem.

In a subset of filings, the tagged public float value does not match what the company actually reports on the cover page.

Here’s what that looks like in practice (Packaging Corp of America 12/31/2024 10-K):

  • Reported on the HTML cover page
    • $16,124,636,098
  • Tagged Value in XBRL:
    • 16,124,636,098,000,000

The widely used SEC Structured data files contain the same value as the tagged XBRL. To explore you can download the February 2025 data folder and after opening look at line 7,553 in the sub.tsv file.

That’s not a rounding issue or a scaling choice. It’s a conversion error—effectively inflating the reported public float by a factor of one million.

What’s especially interesting is that these filings share a common fingerprint. They appear to have been prepared using a specific filing platform, as indicated by metadata embedded directly in the submission document.

<!-- DFIN New ActiveDisclosure (SM) Inline XBRL Document -->
note - creation date varied across the 30 so it was not a specific date issue
<!-- Copyright (c) 2025 Donnelley Financial Solutions, Inc. -->

Based on the above, this doesn’t look like a company-specific mistake. It looks like a systematic transformation issue introduced during the filing conversion process. However, it’s more complicated than that. Based on my analysis, this software was used to file 892 10-K (I excluded 10-K/A) in 2025 but the problem seems to have occurred in roughly 30 or so.

Why this matters

If you rely on the SEC’s structured data endpoint you’ll ingest the overstated value without any obvious warning. The number is valid XBRL; it parses cleanly. Unless you validate it against the rendered filing, it looks perfectly legitimate. Why would you question it?

That’s a problem for:

  • researchers using filer size thresholds in their analysis
  • analysts filtering by public float,
  • and anyone using public float as a screening criteria

A simple safeguard

In our own workflows we actually start with the HTML filing itself and then map tagged XBRLvalues back to what the company actually reports on the cover page. This makes discrepancies easier to flag because even for our client facing databases we report to you both the original value as reported in the html and the tagged value as you can see in the next image

This imposes additional work but it also gives you a point of reference that is lost with the raw XBRL data no matter how you access it. Ingesting just the XBRL give no reference to truth. The Inline XBRL makes more data accessible, the HTML provides a clear way of evaluating the results.

Conclusion

When the first XBRL filings were introduced, we were ready—and genuinely excited. The promise was compelling: standardized, machine-readable financial statements that could be reliably constructed directly from tagged data.

That excitement didn’t last long.

As we began validating what we were deriving from the filings, it became clear that the work required was far more exhaustive than anticipated. In one early effort, we constructed roughly 5,000 income statements from XBRL data and compared those to the original financial statements. Approximately 7% contained problems.

What made this especially frustrating was not the error rate itself, but the nature of the errors. We were unable to devise any rule, heuristic, or algorithm that could reliably identify which statements were wrong. The issues were only visible when a constructed statement was compared to the one actually presented in the filing. So XBRL could not be a starting and ending point.

Inline XBRL has materially improved this situation. By embedding tagged data directly within the rendered document, it provides the necessary context to test, validate, and reconcile structured data against what filers actually report. That context doesn’t eliminate errors—but it makes them observable.

The lesson hasn’t changed: structured data creates opportunity, but only when it is paired with validation, traceability, and context. Inline XBRL doesn’t solve data quality problems on its own—but it finally gives us the tools to see them.

Should We Make Up Data?

We are preparing to drop a fairly comprehensive Audit Fee data base. Long time coming but one of the real challenges is that there can be some wonky data and it is what it is. I am spending time trying to devise tests to evaluate the quality of our parsing and data assignment code. One of the tests I dreamed up is checking if the Date of the Audit Report precedes the dissemination data by more than 10 days. Here is a screenshot of the audit report for CSW Industrials (CIK 1624794) for the year ended 3/31/2022.

Here is a screenshot of the auditor’s signature:

It is not a huge issue, the filing was made on 5/18/2022 and perhaps someone felt harried when doing the final checks. For us though, it raises a significant question – when do we have liberty to change data after it is reported? I am so reluctant to make changes because that seems like a slippery slope. What is the source of truth? As an aside, the auditor submitted an Exhibit-18 describing the change from LIFO to FIFO that was filed with the 10-K. That has the date 5/18/2021.

We have identified about 100 different cases of similar/adjacent issues. Trying to validate that what we are doing is correct is our primary focus. A secondary issue though is to decide what to do. As I write this I am more inclined to use the 5/18/2021 date just because the 10-K should be the source of truth.

New Update to TableExtraction & Dehydrator Outputs

We just updated the TableExtraction and Dehydrator code to create new outputs.

For TableExtraction we added a new search_phrase.txt file that is only generated when you use the Search Query option to identify tables for extraction. The file contains both your original set of parameters as well as the transformation(s) we apply to generate the code that is then delivered to the TableExtraction engine. Here is a sample:

Input:  (acquired or price or assumed) and (goodwill  and intangible and asset and 
         liabil)
Parsed: AND(OR(Contains('acquired'), Contains('price'), Contains('assumed')),
        AND(Contains('goodwill'), Contains('intangible'), Contains('asset'),  
         Contains('liabil')))


The Input line reports the text you submitted and the Parsed line reports on how the line was transformed. What we hope is that this provides more visibility to you regarding why the results contained (or did not contain) particular tables.

We also redesigned the logging to create a new log file named SnipSummary_DATE_TIME.csv. This file contains all of the details from the input file and we have added a column named COUNT to report to you the number of snips that were extracted from each of the source files. Here is a screenshot of this file (I hid all of the metadata columns except for CIK).

The intention is to give you clearer visibility into the results. In the example above I was snipping purchase price allocations and their is not necessarily an upper limit on the number of those tables that might be reported in any particular 10-K. However, it was actually the case that for many of the snips above the snipped tables included the Statement of Cash Flows (as it included all of the strings I set as my parameter. I discovered that easily by initially reviewing the snips from those CIKs. There are many cases where you expect only one table from a filing – in that case you might see counts of 2/3/4 and you can identify those to review.

The MISSING.csv file also has been modified. In the last build the missing (the list of documents for which no tables were found) was a txt file. It is now a csv file. All of the metadata from the original file is present and there is an additional column DE_PATH. The reason for that column is so that you can then run a search to focus just on those documents. There is a demonstration of how to use the MISSING.csv file to run that search in this video (the search example begins at the 4’28” mark in the video). What is key here is that you can run the exact same search you ran initially, but the output is limited to only these specific documents.

Finally, another update was made to the Dehydration output. Malformed tables are now saved in a PROBLEM_TABLES subdirectory with an adjacent csv file that has the same name as the snip. The csv file contains all of the usual metadata from Dehydration/Rehydration and then we have parsed each line so that the content from the td/th elements ends up in a single cell. Here is a screenshot of this:

As you can see from that real example, this file will be easy to prepare for your data pipeline. You would just rename COL3, COL5 and COL7, delete COL2, COL4 and COL6 and then delete the two rows below (FISCAL 2022 DIRECTOR COMPENSATION and the row that has all of the original column headings).
Earlier, we stacked these in one csv file but after working with that some we believe this output is much easier to work with.

As a side note – after I saw the results with the SCF tables – I deleted all of the tables and output from that run and changed by Search Query to


Input:  (acquired or price or assumed) and (goodwill  and intangible and asset and 
         liabil) not (invest or financ or operat)

This modification reduced the noise in my output.  

Details matter – another benefit of adding COFILERS

We are frantically trying to wrap up a complete auditor/audit fee database. We need to plug a pretty significant hole with respect to the AUDITOR, TENURE, REPORT DATE and LOCATION for 10-K filings made in 2020. Why only that year, that is another story.

I ran a search for ((DOCTYPE contains(10k*)) AND date(1/1/2020- 12/31/2020) pre/5 we have served). In English, I am looking for cases where there is a date in the range of 1/1/2020 through 12/31/2020 where the date precedes the phrase we have served by no more than five words. Further, I wanted to limit my search to only 10-K or 10-K/A documents. (Note – most auditors report their tenure before the report date – for those searches we changed the order of the date and the we have served). Below is the results of the search where date precedes the the tenure.

I set a box around the Baltimore Gas & Electric Co. (BG&E) summary to highlight there were 55 matches in the document. There were 55 matches in that filing because that filing has nine cofilers (and the fact that a match is a bit more complicated.) Here is a screenshot of the context extraction from that search with a focus on the lines belonging to BG&E. As you can see Pricewaterhouse Coopers served as the auditor for each of the subsidiaries included (the parent company is Exelon Corp). However, the context does not identify the entity to which the block of text references.

As a researcher you have to decide if you want this data. If you need it then you will actually find 81 total rows in the context extraction results (9 10-K filings with 9 disclosures in each 10-K filing). Unfortunately, the only way to identify the correct disclosure per CIK is to dig back into the filing. It turns out that the disclosure that belongs to BG&E is the first one in the list above with the phrase since at least 1993.

I actually imagine in many cases that these particular disclosures are not important because you are likely to eliminate these types of filers from your sample. Again though, if they are important it is critical to review the duplicates carefully to match the data to the Central Index Key that it is associated with.

As a random aside – it is/was interesting to me to observe that most of the cases with this form of the disclosure (Auditor, location, date and then tenure) were found in filings with COFILERS. The more common form of the disclosure (as pictured below) had many fewer total COFILERS in the results.

I ran these searches separately to simplify the parsing of the data (AUDITOR, TENURE, LOCATION CITY, LOCATION STATE, REPORT DATE). from the context. I am lazy, I wanted more certainty about the order I was expecting the analysis to follow so the parsing code had less to test.

More AI Automation Does not Make the World Better

Much earlier in the week I posted about Excel auto-populating a new mapping based on the column headings found in a collection of tables that report the Equity Compensation Plan Information disclosure required of SEC Filers. Today, I am not frustrated but I do want to share my experience – particularly since I know some folks are all into AI.

I am still working on that table and I am reviewing the labels before we start discussing moving this table into production. We are too small to have a human compare the original tables with the mappings so this step has to be handled with significant care. We have 908 original column headings pulled from about 1,600 original tables. These column headings should generally map into three semantic concepts, the number of shares to be issued, the weighted-average exercise price of any options reflected in the number of shares and then the remaining number of shares available for future issuance. In about 2% of the cases there are four columns of data presented but the overwhelming structure is consistent with what is set out the CFR section which mandates this disclosure.

I wanted to do a test to accomplish two goals. First, I wanted an independent review of the mappings. Since I completed 909 rows I thought asking Manish to review it would be worse then drudgery. I also wanted to evaluate whether or not I could have saved the approximately four hours original mapping took. In other words, could I trust this tool for this task? I spent about an hour outlining the problem to a very popular AI tool and submitted the list of column headings. Below is a screenshot of the original labels as parsed from the disclosures

I then merged the output from that process with my original labels to identify any differences and identified the differences. Here is a screenshot of the cases where my mapping did not match the original mapping.

This is not horrible but it is concerning. My instructions described three main concepts – TO_BE_ISSUED, AVERAGE_EXERCISE_PRICE and REMAINING. I did not recall any heading that seemed to map into a concept of OUTSTANDING and I do not think any of the above are real examples of cases where the in this context OUTSTANDING is a better choice then TO_BE_ISSUED. The first two do have the word OUTSTANDING but I would argue with an editor about the word choice if I saw the word OUTSTANDING used in this context.

I asked the AI tool why and their response really seemed to be more of a justification then a valid answer. Here is an excerpt:

Step B — But the phrase also contains both a past and future state

“issued or to be issued”

This is a compound construction:

“issued” → past

“to be issued” → future

In equity-comp language, humans sometimes use:

“issued and outstanding” interchangeably (sloppily)

“issued or issuable” to refer to currently outstanding awards

The diagnostic logic (incorrectly, in hindsight) treated:

“issued OR to be issued”
as a state aggregate rather than a flow category

That aggregate was internally labeled OUTSTANDING — meaning:

“currently existing instruments, whether already issued or issuable”

Maybe an English major would draw the same conclusion – I’ll have to ask one.

Unfortunately, I did not accomplish what I set out to accomplish – which was an evaluation of my original labels. So I took a different tack and provided my mappings and asked my trusty intern (the AI tool) to evaluate the mappings. According to my intern – my mappings were perfect. I doubted that, any reader of this blog knows I am an expert error maker. So I went through the mappings line by line. Hugely tedious but I found two errors. One case where I mapped “OF SECURITIES REMAINING AVAILABLE FOR FUTURE ISSUANCE UNDER EQUITY COMPENSATION PLANS (EXCLUDING SECURITIES REFLECTED IN COLUMN )” to “TO_BE_ISSUED” and then one where I mapped a label that communicated information about the exercise price into REMAINING.

This is interesting to me because I am trying to figure out the value of AI integration somehow into our client facing tools. I understand there is no joy in doing the original mapping and then checking them. Wouldn’t it be hugely amazing to offload that work completely? I just don’t see it yet. I think we need to perhaps streamline (pre-map) the original labels with better cleaning. So maybe my 908 rows could be reduced to 300 or so. But if I have to stand by the results (in my case it is the quality of the data we deliver, in yours it might be a really interesting and novel research finding that you want to publish) I am not ready to have any of the existing tools take over data cleaning. They are a huge help in other ways but when it comes down to verifying the accuracy of data, I think we have some ways to go.

I will say there is probably some significant value in letting the tools do the heavy lifting. If in fact you can take the time to carefully explain the nature and features of the disclosures and you dump a significant list, then you are likely to save some initial time. But I would also think about how to audit the results.

Survey in Next Update Email

We have been working to add some new data to the platform. Specifically the breakdown of the disclosure of the Equity Compensation Plan Information (EQCPI). I specifically have been evaluating the tests we need to identify the cases that need to be inspected. As good as our parser it we still get some bad parses and we need to be able to systematically identify those and shunt them to a review queue for a person to look at. For context, with Executive or Director Compensation, if the sum of the reported compensation differs from the reported total by more than $10,000 a human looks to see if the issue can be addressed (maybe a value is reported as 123.456 and on closer examination it appears that it should have been reported as 123,456).

What we are seeing in the EQCPI disclosures are errors in the disclosure that we can’t initially be certain of the source of the error. Here is what I saw in a summary file after I ran the code we are working on to identify potential errors.

This was reported in a proxy filing made by OXFORD INDUSTRIES (CIK: 75288). If I were collecting this data for research I would be concerned about the reliability of these values since there is no total for TO_BE_ISSUED. However, this is what was reported (other than the normalization choices we made) as you can see in the next image.

So while this merits review, the captured data accurately represents the disclosure that was made. For internal purposes though I need to be more confident that the data was correctly captured. I have decided that we should stop ignoring dashes and dash like characters and replacing them with something consistent. Currently, we replace dashes with nothing. I am leaning to start replacing them with something clever like wasDASH. I am also thinking about replacing blanks (the value for total under Number of Securities . . .) with wasBLANK.

The advantage of doing so is that we can reduce significantly the cases we have to review. If we do this internally, no big deal. However, I was thinking about whether or not this would benefit you when you are normalizing tabular data. Here is a screenshot of what I am considering.

In some ways this might seem excessive but I don’t know what I don’t know. My point is that to get it right we have to consider that something happened between extraction and dehydration. My lovely wife will tell you quickly that I am not perfect, the point I am trying to make is that some of these html source files can be in a shape and structure we have never seen before and so the error could be that one of our assumptions that is embedded into the code is wrong for this case.

The next time I send out an update I am going to survey you as to whether or not we should make this part of the normalization so that you see these values when the tables are rehydrated.

AI Automation Does Not Make the World Better

Is it an age thing – I really dislike all of the ‘AI’ junk that invades my work! How about when you are trying to compose an email and your provider happily types in front of you. Today I am normalizing column heading extracted from tables that report the status of shares to be issued and available in Equity Compensation Plans (often reported in ITEM 12 in the 10-K). Step 1 is tedious because of slight but potentially meaningful differences in column headings even though the actual column headings are specified by CFR 17 SEC 229.201.

I want to map the standard column headings into TO_BE_ISSUED, AVERAGE EXERCISE PRICE and AVAILABLE. So I am carefully reading the original column headings and making the determination. I accidentally hit enter when Excel analyzed and determined that I was typing AVAILABLE and it generously populated the cells below – with nonsense no less. Here are some of the 64 Automatic Mappings that were auto-populated

ORIG_LABEL

NEW_LABEL (from Excel)

'NUMBER OF SECURITIES AVAILABLE FOR FUTURE ISSUANCE UNDER EQUITY COMPENSATION PLAN (EXCLUDING SECURITIES REFLECTED IN COLUMN) (# OF SHARES)
AVAIABLE
'WEIGHTED AVERAGE EXERCISE PRICE OF OUTSTANDING OPTIONS AND RIGHTS 1 EQUITY COMPENSATION PLANS APPROVED BY SHAREHOLDERS:
AVERAGE
'PLAN CATEGORY NUMBER OF SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING OPTIONS, WARRANTS AND RIGHTS
CATEGORY
'EQUITY PLAN SUMMARY COLUMN NUMBER OF SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING OPTIONS, WARRANTS, AND RIGHTS
COLUMN
'NUMBER OF SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING OPTIONS, WARRANTS AND RIGHTS AND THE VESTING OF UNVESTED RESTRICTED STOCK UNITS
EXERCISE
'EQUITY COMPENSATION PLAN INFORMATION NUMBER OF SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING OPTIONS WARRANTS AND RIGHTS
INFORMATION
'NUMBER OF SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING OPTIONS, WARRANTS AND RIGHTS OR SETTLEMENT OF RESTRICTED STOCK UNITS
ISSUED
'NUMBER OF SECURITIES TO BE ISSUED UPON EXERCISE OR SETTLEMENT OF OUTSTANDING OPTIONS, WARRANTS AND RIGHTS
OF
'EQUITY COMPENSATION PLAN INFORMATION AT DECEMBER 31, 2024 WEIGHTED-AVERAGE EXERCISE PRICE OF OUTSTANDING OPTIONS, WARRANTS AND RIGHTS
PLAN
'NUMBER OF DEFINED SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING OPTIONS, WARRANTS AND RIGHTS DECEMBER 31, 2024
SECURITIES
NUMBER OF SECURITIES THAT MAY BE ISSUED UPON EXERCISE OR VESTING OF OUTSTANDING OPTIONS, WARRANTS AND RIGHTS AT TARGET
THATMAY
'NUMBER OF SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING OPTIONS OR SETTLEMENT OF LTIP PSUS AND RSUS
TO
'NUMBER OF SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING OPTION AWARDS, RESTRICTED STOCK UNITS, AND PERFORMANCE STOCK UNITS
TO B ISS
NUMBER OF SECURITIES AVAILABLE FOR FUTURE ISSUANCE UNDER EQUITY COMPENSATION PLANS (EXCLUDING SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING SHARE OPTIONS)
UNDER
NUMBER OF SECURITIES AVAILABLE FOR FUTURE ISSUANCE UNDER EQUITY COMPENSATION PLANS (EXCLUDING SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING SHARE OPTIONS)
UPON

Those of you that are avid readers know that I can substitute their for there. I have never lost sleep over those errors – I only have one brain cell. This interference is frustrating and unproductive.

I am not sure how Excel does this – but I think it could border on industrial espionage. It might be that Excel is using a C library similar to FuzzyWuzzy and the analysis and substitution decisions are happening locally. But what if they are communicating with a thread in the cloud? There are two concerns I have about this.

First, what if I am working with confidential data – does #Microsoft see all of our data and are they able to act on it? Second, I had to correct these errors, am I serving as an uncompensated trainer for Microsoft? The initial predictions were delivered, I corrected them. Did that info get delivered back to Microsoft and is that going to feed the monster? Two years ago I would have thought that possibility was idiotic – I am not so sure now.

I published this originally about an hour ago. It took me twenty minutes to clean up the mess that was caused by the original automation. I am back on track, happily making my mappings and it happened again. I was read for it this time – When Excel stared volunteering to do my work I froze and captured it.

If you can see above – I had typed REM and Excel volunteered AINDING – and look below – if I had not caught that we would have had the ANCEMAINCURITIES – tell me this is not a problem.

What is the truth?

We are desperately trying to finish up a new comprehensive audit fee database. One of the fields is the AUDIT REPORT DATE. But then we run into issues like this one. Here is a screenshot of the audit signature following the audit report included in New Concept Energy’s 10-K for the FYE 12/31/2024. Here is a link to the filing CIK 105744 10-K.

The dissemination date of this 10-K was 3/24/2025. Approximately one year and one week after the signature date. We discovered this while stress testing our data collection strategies. We have a fairly robust regular expression to use to find these dates – but the discrepancy between the dissemination date and the audit report date warranted further research (we don’t know how we make mistakes until we find the mistakes).

So now the question is what to do about this? I wonder if this is a clue (gosh I hate giving away this research idea) but anyway, is this a clue that the “Company’s internal control over financial reporting was effective ineffective as of December 31, 2024.” Note, the quote is from the 10-K but the date issue, while perhaps trivial might be evidence that the company does not have effective controls. I started this paragraph wondering what we should do about this. We have a couple of choices:

  1. Leave as is.
  2. Leave as is and flag.
  3. Make an educated guess and change the date to 3/17/2025.
  4. Leave the field blank.

I think what we need to do is confirm that we correctly pulled the value that is being asserted, audit report date (in this case) – and leave the value if we have collected it correctly but flag the issue.

I teased in the middle of the prior paragraph – it is kind of interesting the number and nature of errors that we find. Could this be evidence of ineffective controls in that the company does not have enough staff to carefully review and evaluate their disclosures?

Another New Dehydrator Feature

As illustrated in the screenshot below – I updated the Dehydrator to include what I have cleverly named the hydrator-{n}-grid-data.csv.

The n represents the count of dehydration artifacts in the directory that was analyzed. This new output reports the results of dehydration on a row by column basis. To keep the focus on the data I have hidden most of the metadata fields in the file – but here is a screenshot of the datavalues:

The RID represents the index of the data rows the data was pulled from. The ROW-LABEL reports the actual content from that row in the first column from the left. The CID represents the column index where 1 represents the first column to the right of the ROW-LABEL column. The value I have blocked above (818019) can be seen in this table which was the source of the data.

While it may look like the ROW-LABEL is just JOHN M. HOLMES, the actual html tells a different story. In this case both the name and the title are marked in the html as data belonging to each of the first three rows.

The entire Dehydration/Rehydration process is designed to allow you to more easily normalize the column headings (and/or row labels) to create a more compact matrix of data when working with hundreds or thousands of snipped tables from the filings. For example, while 3,111 DEF 14A filed in 2024 and so far in 2025 used OPTION AWARDS as a column heading there were 83 other variants used (OPTION-BASED AWARDS, OPTIONS/SAR AWARDS, . . . AWARDED OPTIONS). With the column-list.csv file you are able to identify one label that is to be applied to all of these phrases that carry the same semantic meaning. This is great when you want the entire result set normalized.

If you are not looking to normalize the whole table, instead if you only want particular data values then this feature is designed to help you get faster access to the data you are looking for. In this case, if all you wanted were option like awards then you can easily sort on the relevant words for the COLUMN-LABEL and that data is quickly available in a very compact form. I had two different PhD students who described this issue (not with compensation data) over the summer so I wrote the code to address their particular problems and have finally been able to integrate it into dehydration. This does not add much overhead to the Dehydration process so after agonizing a bit about offering a switch for it I just decided to make it a default feature.

I am going into the weeds next with this. There is another very important reason for this artifact. We have been trying to identify a reliable, but auditable way to reduce the column headings and row labels that are written to the column-list.csv file. We can’t really do this until we have access to all of the column headings and row labels in a particular collection of tables. This output gives us that access. If I am analyzing every single column heading at one time then I can apply another set of rules to reduce them as long as I can also keep track of where each particular column heading was used and if I have a way to share that information with you when complete. Seeing a label in a list of labels is not quite as useful as seeing it in the actual context adjacent to the other labels and data values.

In addition to the creation of this new artifact we did some more tweaking to the parser rules to improve the header and data row separation process. The new version incorporates these additional rules.