Two New Databases Added to the Platform

As I have mentioned before we are trying to port all of the data from the existing ExtractionPreprocessed distribution system to our new environment. Of course this is moving much slower than we would like. However, while some of the challenges relate to discovering things we didn’t know before – some of the challenges have been due to our attempts to improve the process of collecting the data from top to bottom.

I am delighted to tell you that while we are not finished porting the old TAXRECON data over to the new system – we have made significant progress and have introduced two new databases for you to use. We now have the TAXRECON and a DEFERRED_TAX databases available for your research enjoyment!

You can see them in the image above, I appended the word TEST to each of these just in case one of the tax experts tells me that I have made a mistake (because I didn’t know something I didn’t know I needed to know!).

To start with we identified all 10-K filings made in 2019 through April 7th or so 2023 (hint – these will be updated monthly). We eliminated all 10-K’s that were filed by a non-US filer in this first iteration and those that were associated with multiple CIKs (more later on these choices). We then found the tax recon and deferred tax asset/liability tables in the 10-K (and Exhibit-13s), parsed and normalized them. We then matched the data from the html to the tagging in the iXBRL instance documents to collect the tag used in the XBRL. This tag in the database in a name field. There is also a field called orig_row_label. This field has the label that was displayed in the html.

We were struggling with the best way to present this data. After consulting with a number of clients we decided to present it in a manner as close to the original organization as possible. Rather then using the original column labels which are generally representative of a balance sheet date (deferred tax data) or fiscal period (tax recon data) we analyzed the column headings and labeled the column with the data for the fiscal year ended as CURRENTYEAR. We found as many as four periods presented and so the prior year data columns were labeled PRIORYEAR, TWOPRIOR and THREEPRIOR.

Adjacent to each data column we have columns that report the date values that would have been in the data value column heading. Because the taxrecon data summarizes data for a fiscal year we report CURRENTYEAR_START_DATE and CURRENTYEAR_END_DATE (and appropriate labels for the additional columns that exist). The deferred tax data represents balances as of a particular date so we label the date related columns as CURRENTYEAR_FYE, PRIORYEAR_FYE, . . .

In those cases where the registrant reports their taxrecon data in both dollars and percentages we captured both and stack them in the database. Boeing is one example. When Boeing filed their 10-K in 2020 for the fiscal year ended 12/31/2019 they began reporting their tax recon data in both percentages and dollars (prior years were reported in percentages).

Here is a partial view of how we organized this data – there is a field in the database that indicates the nature of the presentation (DOLLAR/PERCENT) at both the table level and the row level.

The rows above are not really adjacent to one another (not that it matters), I am just trying to illustrate that we captured all of the data and organized it at least initially by table type. I cut off the name column because I wanted the focus to be on the DOLLARS/PERCENT presentation – but you can see we have the label that was used in the html as well as the tag.

The next step in this process is to consolidate all of our existing taxrecon data and reorganize it for this format. While we do that we will also use the XBRL to pull the tags so we can add the name field for this data as well. We don’t have prior year deferred tax data so we will also be working to parse and add that.

One of the reasons I started mucking around with this stuff again in November was because of some assertions I had heard from the SEC and others about the quality of the data available through the iXBRL. I was actually pretty eager and imagined that we could make short work of this by identifying the taxrecon table in the iXBRL instance and speeding up our collection and normalization process. Once again, the real-life XBRL fails to live up to the potential it has because of how it is created and vetted. One of the most important things we learned is that we absolutely cannot rely on indications in the XBRL about the sign (+/-) of the reported numeric values. We abandoned that idea fairly early and instead use the presentation information in the html to assign a sign to the data in our database. I don’t have the exact stats but as I recall when we compared signed values we pulled from the html with the values from the XBRL instances the sign discrepancy was greater than about 10%. This is brutal and very boring work. We persisted because we were hoping that there was some clue somewhere that would allow us to rely on the iXBRL data. It was just too uncertain. The fact is that we (outsiders) don’t know for sure if a number should be reported as a subtraction or an addition so we have to rely on the fact that presumably the totals were audited and the disclosures in the html were screened for accuracy.

To finish off this post I want to share my experience using ChatGPT to access this data. I had a sense from working in the middle of the data that the addition of the name field would save some effort on your part. I believe it does but not as much as I imagined. Anyway, once Manish had the final databases available I wanted to know how many unique values were present in the orig_row_label column. I was in a hurry and am still a real amateur with SQLITE so I asked for some help:

I asked ChatGPT the following – I have an sqlite3 db at “S:\directEDGAR_DATA\TAX_RECON_TEST.db” the db has a table named TAXRECON there is a field named name – I want to know how many distinct/unique values there are in this field using Python 3.9.

Below is the code that was generated. I connected to directEDGAR on AWS (the same way you would connect), opened Idle and pasted the code. This time it ran immediately.

import sqlite3

# Connect to the database
conn = sqlite3.connect("S:\directEDGAR_DATA\TAX_RECON_TEST.db")

# Create a cursor object to execute queries
cursor = conn.cursor()

# Execute a query to count the number of distinct values in the name field
query = "SELECT COUNT(DISTINCT orig_row_label) FROM TAX_RECON"
cursor.execute(query)

# Fetch the result and print it
result = cursor.fetchone()[0]
conn.close()
print(f"There are {result} distinct values in the name field.")

I have to admit – I was really disappointed – there are 15,626 unique values for name. That is clearly another significant problem with XBRL. That compares to the fact that there are 22,436 unique values for orig_row_label. So name reduces the noise but I was really hoping for a more significant decrease as compared to the row labels.

I asked for a code modification to generate a list of all unique values for name and their frequency. The query changed to

# Execute a query to get the frequency of each unique value in the name field
query = "SELECT name, COUNT(*) FROM TAXRECON GROUP BY name"
cursor.execute(query)

I then iterated through the results to print all cases where the frequency was greater than 900. Here are those results:

('us-gaap:EffectiveIncomeTaxRateContinuingOperations', 8443)
('us-gaap:EffectiveIncomeTaxRateReconciliationAtFederalStatutoryIncomeTaxRate', 8509)
('us-gaap:EffectiveIncomeTaxRateReconciliationChangeInDeferredTaxAssetsValuationAllowance', 5494)
('us-gaap:EffectiveIncomeTaxRateReconciliationChangeInEnactedTaxRate', 2116)
('us-gaap:EffectiveIncomeTaxRateReconciliationForeignIncomeTaxRateDifferential', 3472)
('us-gaap:EffectiveIncomeTaxRateReconciliationNondeductibleExpense', 1572)
('us-gaap:EffectiveIncomeTaxRateReconciliationNondeductibleExpenseShareBasedCompensationCost', 2359)
('us-gaap:EffectiveIncomeTaxRateReconciliationOtherAdjustments', 5573)
('us-gaap:EffectiveIncomeTaxRateReconciliationOtherReconcilingItemsPercent', 1146)
('us-gaap:EffectiveIncomeTaxRateReconciliationShareBasedCompensationExcessTaxBenefitAmount', 950)
('us-gaap:EffectiveIncomeTaxRateReconciliationShareBasedCompensationExcessTaxBenefitPercent', 1297)
('us-gaap:EffectiveIncomeTaxRateReconciliationStateAndLocalIncomeTaxes', 7077)
('us-gaap:EffectiveIncomeTaxRateReconciliationTaxCredits', 1688)
('us-gaap:EffectiveIncomeTaxRateReconciliationTaxCreditsResearch', 1577)
('us-gaap:IncomeTaxExpenseBenefit', 6812)
('us-gaap:IncomeTaxReconciliationChangeInDeferredTaxAssetsValuationAllowance', 4761)
('us-gaap:IncomeTaxReconciliationChangeInEnactedTaxRate', 2166)
('us-gaap:IncomeTaxReconciliationForeignIncomeTaxRateDifferential', 3106)
('us-gaap:IncomeTaxReconciliationIncomeTaxExpenseBenefitAtFederalStatutoryIncomeTaxRate', 7124)
('us-gaap:IncomeTaxReconciliationNondeductibleExpense', 1628)
('us-gaap:IncomeTaxReconciliationNondeductibleExpenseOther', 924)
('us-gaap:IncomeTaxReconciliationNondeductibleExpenseShareBasedCompensationCost', 2414)
('us-gaap:IncomeTaxReconciliationOtherAdjustments', 4892)
('us-gaap:IncomeTaxReconciliationOtherReconcilingItems', 1775)
('us-gaap:IncomeTaxReconciliationStateAndLocalIncomeTaxes', 6213)
('us-gaap:IncomeTaxReconciliationTaxContingencies', 927)
('us-gaap:IncomeTaxReconciliationTaxCredits', 1602)
('us-gaap:IncomeTaxReconciliationTaxCreditsResearch', 1322)
('us-gaap:IncomeTaxReconciliationTaxExemptIncome', 1246)

directEDGAR

Search, Extraction & Normalization Engine

Two New Databases Added to the Platform

Like this:

Related

One thought on “Two New Databases Added to the Platform”

Leave a ReplyCancel reply

Share this:

Like this:

Related

One thought on “Two New Databases Added to the Platform”

Leave a ReplyCancel reply

Discover more from directEDGAR