XBRL Data – Interesting Assumption in the Research

For a number of reasons I have been reading some of the academic research that has used XBRL data. There is a fascinating underlying assumption – this is not a quote but the sense of this following statement pervades a fair number of these papers. Before the introduction of tagged financial data, users were constrained to either wait until the commercial data services normalized the data and added into their data bases or they were stuck consuming 10-K filings singly.

I don’t think that is true at all. It took me about 10 minutes to write 28 lines of Python code to parse the income statement for Apple from all years from 2005 – 2023. Here is a screenshot of one of those tables:

Here is the code – it relies on a directEDGAR generated summary file, I searched for the 10-K filings of Apple and saved the summary file to get the paths.

from lxml import html, etree
import csv

xpath_expr = (
    "//table["
    "contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'basic') and "
    "contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'diluted') and "
    "contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'operating') and "
    "contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'research') and "
    "contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'years') and "
    "not(contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'quarter') or "
    "contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'months') or "
    "contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'reported'))"
    "]"
)

with open(r"H:\apple\summary.csv") as csv_fh:
    csvReader = csv.DictReader(csv_fh)
    for row in csvReader:
        file_path = row['FILENAME']
        cik, rdate = file_path.split('\\')[5:7]
        with open(file_path,'r', encoding="utf-8") as doc_fh:
            filing = doc_fh.read()
        apple_tree = html.fromstring(filing)
        tables = apple_tree.xpath(xpath_expr)
        for indx, table in enumerate(tables):
            new_html = etree.Element("html")
            body = etree.SubElement(new_html, "body")
            body.append(table)
            new_html_str = etree.tostring(new_html, pretty_print=True, method="html", encoding="unicode")
            new_html_str = new_html_str.replace('\xa0', ' ')
            new_name = '-'.join([cik,rdate,str(indx + 1)]) + '.htm'
            with open('H:\\apple\\newtables\\' + new_name, 'w', encoding='utf-8') as out_fh:
                out_fh.write(new_html_str)

It took me another couple of minutes to normalize the row labels – I did that with directEDGAR’s Dehydrator/Rehydrator tools.

When our Dehydrator/Rehydrator features were added to directEDGAR I had four calls from some of the largest hedge funds in the US. My sense after the second call was that they were wondering if we had something better than what they were using. Think about this, if you had one billion plus to invest in say 2007 – wouldn’t you have the team that could parse out the critical data you wanted for your trading strategy? I was pretty sure that the folks I spoke with were way ahead of where we were when I had those conversations.

My point is that I don’t think XBRL changed anything for these folks. And in my opinion, especially until the introduction of iXBRL it was a lot easier to use the html to parse the financial statements rather than using the XBRL to attempt to construct the financial statements. One reason for that is that you can test every td element in a table to determine whether or not it has a value. If something is missing from the XBRL there is no way to know that it is missing. We discovered that when attempting to build the Effective Tax Rate table from the XBRL. The table would not add up and when we inspected it turns out there was data that was just missing. The tags are nice – and they can be used to speed up normalization. To do that you take the row label from the table and go find the associated tag.

To understand what I mean by that – here is CA’s income statement that was filed in 2010.

What exactly does Product development and enhancements mean? If you have parsed the income statement from the 10-K and then collected the row labels – you can take that row label back into the associated label file and find the tag (ResearchAndDevelopmentExpenseSoftwareExcludingAcquiredInProcessCost). In this case the availability of the XBRL would have helped clean/normalize this data faster – if you had never encountered that row label before.

The ability to do that is a huge benefit. But if I were running technical operations at a hedge fund – I would not attempt to use XBRL to collect fundamental data. I would integrate the tags but I would still be parsing the html tables as my authoritative source.

As an aside – continue reading this is a bit off topic though.

I took the code above and then I ran it against all 10-K filings that were filed from 1/1/2021 to 12/31/2023. There were 25,263 10-K in our archive for that span. Don’t laugh, I only collected data from 1,565 filings. I have two really strong constraints up there – one is that the table must have the word years and the other is it must have the word research. I looked at Microsoft’s 10-K and discovered that they used the word year. I also confirmed that Meta Platforms and Google used the same word. So I changed the code require year instead of years. After that change I was able to increase the number of tables pulled to 2,335. Again, don’t laugh – keep reading.

The constraint that the income statement must have the word research is a really strong constraint as well. I did a search over those 10-K filings and identified 7,650 10-K filings that did not have the word research anywhere in the filing.

If the word research is not in the 10-K, well it can’t be in the income statement. And as academic researchers we do know that even if that word is in the filing it does not have to be in the income statement. I need a sample of ten of those and I suspect I can modify the code above to capture another large segment of the filers.

Here is why I don’t think you should be dismissive of my observations and results. The knowledge to do what I describe above is not a significant hurdle. And while I only have 2,335 tables, I spent very little time. If I were smart enough to have a trading strategy that I wanted to implement in 2007 I am very confident that I could have had the data from the 10-K (and 10-Q and 8-K) available within milliseconds after having the filing delivered. So I am just not convinced that XBRL moved the dial a huge amount.

Leave a Reply