Details matter – another benefit of adding COFILERS

We are frantically trying to wrap up a complete auditor/audit fee database. We need to plug a pretty significant hole with respect to the AUDITOR, TENURE, REPORT DATE and LOCATION for 10-K filings made in 2020. Why only that year, that is another story.

I ran a search for ((DOCTYPE contains(10k*)) AND date(1/1/2020- 12/31/2020) pre/5 we have served). In English, I am looking for cases where there is a date in the range of 1/1/2020 through 12/31/2020 where the date precedes the phrase we have served by no more than five words. Further, I wanted to limit my search to only 10-K or 10-K/A documents. (Note – most auditors report their tenure before the report date – for those searches we changed the order of the date and the we have served). Below is the results of the search where date precedes the the tenure.

I set a box around the Baltimore Gas & Electric Co. (BG&E) summary to highlight there were 55 matches in the document. There were 55 matches in that filing because that filing has nine cofilers (and the fact that a match is a bit more complicated.) Here is a screenshot of the context extraction from that search with a focus on the lines belonging to BG&E. As you can see Pricewaterhouse Coopers served as the auditor for each of the subsidiaries included (the parent company is Exelon Corp). However, the context does not identify the entity to which the block of text references.

As a researcher you have to decide if you want this data. If you need it then you will actually find 81 total rows in the context extraction results (9 10-K filings with 9 disclosures in each 10-K filing). Unfortunately, the only way to identify the correct disclosure per CIK is to dig back into the filing. It turns out that the disclosure that belongs to BG&E is the first one in the list above with the phrase since at least 1993.

I actually imagine in many cases that these particular disclosures are not important because you are likely to eliminate these types of filers from your sample. Again though, if they are important it is critical to review the duplicates carefully to match the data to the Central Index Key that it is associated with.

As a random aside – it is/was interesting to me to observe that most of the cases with this form of the disclosure (Auditor, location, date and then tenure) were found in filings with COFILERS. The more common form of the disclosure (as pictured below) had many fewer total COFILERS in the results.

I ran these searches separately to simplify the parsing of the data (AUDITOR, TENURE, LOCATION CITY, LOCATION STATE, REPORT DATE). from the context. I am lazy, I wanted more certainty about the order I was expecting the analysis to follow so the parsing code had less to test.

More AI Automation Does not Make the World Better

Much earlier in the week I posted about Excel auto-populating a new mapping based on the column headings found in a collection of tables that report the Equity Compensation Plan Information disclosure required of SEC Filers. Today, I am not frustrated but I do want to share my experience – particularly since I know some folks are all into AI.

I am still working on that table and I am reviewing the labels before we start discussing moving this table into production. We are too small to have a human compare the original tables with the mappings so this step has to be handled with significant care. We have 908 original column headings pulled from about 1,600 original tables. These column headings should generally map into three semantic concepts, the number of shares to be issued, the weighted-average exercise price of any options reflected in the number of shares and then the remaining number of shares available for future issuance. In about 2% of the cases there are four columns of data presented but the overwhelming structure is consistent with what is set out the CFR section which mandates this disclosure.

I wanted to do a test to accomplish two goals. First, I wanted an independent review of the mappings. Since I completed 909 rows I thought asking Manish to review it would be worse then drudgery. I also wanted to evaluate whether or not I could have saved the approximately four hours original mapping took. In other words, could I trust this tool for this task? I spent about an hour outlining the problem to a very popular AI tool and submitted the list of column headings. Below is a screenshot of the original labels as parsed from the disclosures

I then merged the output from that process with my original labels to identify any differences and identified the differences. Here is a screenshot of the cases where my mapping did not match the original mapping.

This is not horrible but it is concerning. My instructions described three main concepts – TO_BE_ISSUED, AVERAGE_EXERCISE_PRICE and REMAINING. I did not recall any heading that seemed to map into a concept of OUTSTANDING and I do not think any of the above are real examples of cases where the in this context OUTSTANDING is a better choice then TO_BE_ISSUED. The first two do have the word OUTSTANDING but I would argue with an editor about the word choice if I saw the word OUTSTANDING used in this context.

I asked the AI tool why and their response really seemed to be more of a justification then a valid answer. Here is an excerpt:

Step B — But the phrase also contains both a past and future state

“issued or to be issued”

This is a compound construction:

“issued” → past

“to be issued” → future

In equity-comp language, humans sometimes use:

“issued and outstanding” interchangeably (sloppily)

“issued or issuable” to refer to currently outstanding awards

The diagnostic logic (incorrectly, in hindsight) treated:

“issued OR to be issued”
as a state aggregate rather than a flow category

That aggregate was internally labeled OUTSTANDING — meaning:

“currently existing instruments, whether already issued or issuable”

Maybe an English major would draw the same conclusion – I’ll have to ask one.

Unfortunately, I did not accomplish what I set out to accomplish – which was an evaluation of my original labels. So I took a different tack and provided my mappings and asked my trusty intern (the AI tool) to evaluate the mappings. According to my intern – my mappings were perfect. I doubted that, any reader of this blog knows I am an expert error maker. So I went through the mappings line by line. Hugely tedious but I found two errors. One case where I mapped “OF SECURITIES REMAINING AVAILABLE FOR FUTURE ISSUANCE UNDER EQUITY COMPENSATION PLANS (EXCLUDING SECURITIES REFLECTED IN COLUMN )” to “TO_BE_ISSUED” and then one where I mapped a label that communicated information about the exercise price into REMAINING.

This is interesting to me because I am trying to figure out the value of AI integration somehow into our client facing tools. I understand there is no joy in doing the original mapping and then checking them. Wouldn’t it be hugely amazing to offload that work completely? I just don’t see it yet. I think we need to perhaps streamline (pre-map) the original labels with better cleaning. So maybe my 908 rows could be reduced to 300 or so. But if I have to stand by the results (in my case it is the quality of the data we deliver, in yours it might be a really interesting and novel research finding that you want to publish) I am not ready to have any of the existing tools take over data cleaning. They are a huge help in other ways but when it comes down to verifying the accuracy of data, I think we have some ways to go.

I will say there is probably some significant value in letting the tools do the heavy lifting. If in fact you can take the time to carefully explain the nature and features of the disclosures and you dump a significant list, then you are likely to save some initial time. But I would also think about how to audit the results.

Survey in Next Update Email

We have been working to add some new data to the platform. Specifically the breakdown of the disclosure of the Equity Compensation Plan Information (EQCPI). I specifically have been evaluating the tests we need to identify the cases that need to be inspected. As good as our parser it we still get some bad parses and we need to be able to systematically identify those and shunt them to a review queue for a person to look at. For context, with Executive or Director Compensation, if the sum of the reported compensation differs from the reported total by more than $10,000 a human looks to see if the issue can be addressed (maybe a value is reported as 123.456 and on closer examination it appears that it should have been reported as 123,456).

What we are seeing in the EQCPI disclosures are errors in the disclosure that we can’t initially be certain of the source of the error. Here is what I saw in a summary file after I ran the code we are working on to identify potential errors.

This was reported in a proxy filing made by OXFORD INDUSTRIES (CIK: 75288). If I were collecting this data for research I would be concerned about the reliability of these values since there is no total for TO_BE_ISSUED. However, this is what was reported (other than the normalization choices we made) as you can see in the next image.

So while this merits review, the captured data accurately represents the disclosure that was made. For internal purposes though I need to be more confident that the data was correctly captured. I have decided that we should stop ignoring dashes and dash like characters and replacing them with something consistent. Currently, we replace dashes with nothing. I am leaning to start replacing them with something clever like wasDASH. I am also thinking about replacing blanks (the value for total under Number of Securities . . .) with wasBLANK.

The advantage of doing so is that we can reduce significantly the cases we have to review. If we do this internally, no big deal. However, I was thinking about whether or not this would benefit you when you are normalizing tabular data. Here is a screenshot of what I am considering.

In some ways this might seem excessive but I don’t know what I don’t know. My point is that to get it right we have to consider that something happened between extraction and dehydration. My lovely wife will tell you quickly that I am not perfect, the point I am trying to make is that some of these html source files can be in a shape and structure we have never seen before and so the error could be that one of our assumptions that is embedded into the code is wrong for this case.

The next time I send out an update I am going to survey you as to whether or not we should make this part of the normalization so that you see these values when the tables are rehydrated.