Accessing Metadata from Documents Extracted Through our Platform

Fascinating question. a PhD student is doing some work with some filings they extracted from our platform. They ran some complex searches to identify the documents and saved them to their computer. However, they did not generate a summary of the search results so they lost immediate access to the metadata associated with the documents. They realized after some extensive work that they would like to use the metadata in their analysis. So the question was, can they get the metadata without rerunning the search? The answer is yes, particularly since they are working in Python. Below is some code to create a list of dictionaries of the metadata embedded in the htm files in the source folder. We use the LXML library. I think many of you might be using BeautifulSoup. If so I think there is a small modification needed. The key though is that we add the meta as elements with attributes, so we can pull those elements and get their attributes rather cleanly.

import glob
from lxml import html
import os
meta_list = []
for htm_document in glob.glob(source_dir + os.sep + '*.htm'):
    with open(htm_document,'rb') as fh:
    b_string = fh.read()
    meta_dict = dict()
    tree = html.fromstring(b_string)
    meta_e = tree.xpath('.//meta')
    if len(meta_e) == 0:
    print(f'no meta {htm_document}')
    for m in meta_e:
      attrib = m.attrib
      meta_dict[attrib['name']] = attrib['content']
    meta_dict['source_path'] = htm_document
    meta_list.append(meta_dict)

I pulled out an 8-K filing and ran the above code on an 8-K filed by Apple. Here is the result:

{'DOCTYPE': '8K', 'SECPATH': 'https://www.sec.gov/Archives/edgar/data/320193/000119312521001982/d29637d8k.htm', 'ACCEPTANCE': '20210105163216', 'SICCODE': '3571', 'CNAME': 'Apple Inc.', 'FYEND': '0925', 'ITEM_5.02': 'YES'}

Reminder – Searches with WORDS that are Operators Require ~ Appended to the Operator

A client was trying to sort out how to run a specific search. They wanted to use a phrase with the word and – they were getting anomalous results and so dropped us a question. Anytime you need to use a primary operator in a search you need to append a tilde (~) to the search operator.

I am reluctant to share their specific search. I was looking at some audit proposals for the ratification of the Independent Registered Accounting firm and so that is where this example comes from. Suppose you want to search for cases where the proxy reports that the auditor is not expected to be present at the annual meeting. NOT is a search operator and it is a strong one – it will eliminate documents/results with the word/phrase that follow. (It has more complicated uses but let me avoid a rabbit hole here).

To identify those documents where there is an explicit indication that the auditor is not expected to be present at the meeting I ran the following search:

((not~ expected) w/10 present) w/50 (audit* or account*)

If we run the search without the tilde – the result would be those cases where the word expected was not within 10 words of the word present and expected was within 50 words of words rooted on audit or account. Are you confused, sometimes I am to – search is an art.

I will admit, I find these results interesting – there are not that many though and it seems that a good number of the cases are those where the auditor from the prior year is not continuing – but not all. The image above came from Forward Air’s 2021 DEF 14A.

Of course, I immediately wondered if those cases were indication of an intent to dismiss the auditing firm in the near future. Unfortunately when I expanded my search to cover all years I find cases where the auditor is routinely not expected to be present. For example CECO ENVIRONMENTAL, CULLEN FROST BANKERS and DOVER CORP. Shucks, I thought that would be an interesting research paper.

2020 Insider Trading Data Updating With ACCEPTANCE-DATETIME Field

The system is currently running to update all of the 2020 SECTION-16-SUMMARY data with the ACCEPTANCE-DATETIME field. The process started about 1:00 PM on 1/8/2022 – I estimate that it will be complete by 4:00 PM (ish).

I did make a dreadful mistake during this update. I pulled the 2020 offline while I was preparing the code. I received an email and while I was able to address the requirements for that user I realized it was not necessary to pull everything offline. I will not make that mistake again. We are now working on the prior years. Thank you for your patience – this should have been addressed when we first handled these files.

When I posted that the 2021 data was available I noted that my sense was that there were more insider transactions in 2021 than in 2020. This was confirmed as in 2021 there were 875,357 total processed rows in 2021 pulled from 224,474 unique filings. In 2020 there were only 717,239 rows pulled from 203,566 filings. I think we expected this because we saw many more new directors in 2021 than we have seen in some while.

We are still heavily involved in some of the transition stuff and so it may take a while, however, we will generate the DIRECTOR-RELATIONSHIP artifact for 2021 soon. That will be interesting as my sense during the year was that more of the newest directors are female. Once that artifact is created it will be easier to confirm that observation.

I would like to observe that you can map the trading data to the Director/Executive compensation data by the PERSON-CIK value. In those tables we have GENDER. We have AGE and TENURE fields in the Director data.

Context Normalization – Spelling Counts

I had an interesting email from a hard at work PhD student who was using the ContextNormalization feature of our platform to normalize some data. Because they are collecting a piece of data I have not seen used in research before I am going to describe their problem using AUDITOR TENURE data collection. The nature of their problem manifests itself in the same way in almost every Context Normalization case.

As a result of a PCAOB rule change registrants are supposed to disclose their tenure with the client. The most common expression of that tends to be We have served as the Company’s auditor since YYYY. Below is an image from running a search for auditor* since on 2021 10-K filings.

I ran the search auditor* because I also want to catch the expression of auditors since.

I set a really tight span for the context since this is one of those binary cases – it will be concisely expressed or it is not likely to be expressed. Remember – to set the span for Context – use Options/Context and specify the span you need.

Once we’ve done that we are ready to set the parameters for the ContextNormalization. Notice I did not include auditor in the Extraction Pattern. This assures me that the processor will not discard those cases where the phrase is auditors since. Since the processor is working on the active search results I have no concern about phrases like we have been making amazing products since XXXX. Our search was for auditor* since.

This is one of the ‘spelling matters’ issues – if I specify auditor since the engine will not normalize auditors since. The word auditor was critical to get the right context but using it in the Extraction Pattern will reduce the yield since there will not be an exact match to auditor since when they have used auditors since.

The second spelling issue occurs because of formatting errors or typos. When I sorted the results by the value of tenure – you can see I had some results that didn’t make sense.

Someone accidentally inserted an extra space as they were typing in the year values or the underlying html has a tag separating parts of the number.

Then of course we have these cases

In the cases shown above the search correctly identified the context but there are words intervening between the word since and the value we want for year.

We collected the year value from 6,745 documents based strictly on the existence of a valid number following the word since. There were 115 documents with language “since at least YYYY” or “since fiscal YYYY” and other permutations and there were a total of 4 typos.

I keep trying to play around with a Python library called Fuzzy-Wuzzy to improve this yield and while we can make significant improvements for specific use cases – the problem is that I can’t seem to anticipate all use cases in a way that makes me comfortable implementing the library inside one of our functions. However, if you do a ContextExtraction and have some time on your hands I would encourage you to poke at the normalization with that library.

Detailed 2021 Insider Transaction Data Now Available

We completed processing of the 2021 Form 3/4/5 and their amendments today and started uploading the parsed and normalized data to the PREPROCESSED_EXTRACTION platform at about 10:00 AM (CT) (1/2/2022). It should be fully available by 2:00 PM.

As a reminder the data is available through a request file that contains the fields CIK, YEAR and PF. The CIK is the CIK of the ISSUER, YEAR is the transaction year and the PF value needs to be FY. Because of the density of this data we have a hard limit on 2,500 CIK-YEAR pairs per request file. One of the challenges is that when we are building the CSV file to return to you we don’t know all of the column headings until the request is processed. Believe it or not there are some forms that contain more than 10 footnotes for a number of fields. We have captured all of the footnotes and labeled them based on the cell/data value they relate to. In the example below there are 18 additional columns that report the content of those 18 footnotes. (Link to Original Filing)

The return data has every transaction reported in all filings made during the calendar year. We include the ACCESSION-NUMBER of the filing to help you identify the source filing. Each separate transaction is assigned one line. We include a field called datatype to indicate whether the row is describing either a derivativetransaction, nonderivativetransaction or a remark. REMARKS are assigned a row because in general a remark relates to the filing rather than any one specific transaction that is reported in the filing. This is clearly not the case in all instances but we cannot discern those remarks that relate to one specific transaction or those that are general to the filing.

I wanted to run some stats to compare this year’s activity to last but time has interfered. My gut tells me that this was a busier year for insiders than recent years – but don’t take that to the bank. 6,620 issuers reported trading data with 224,474 filings. We have a total of 875,357 transactions/remarks.

If you want to review the data I have included two attachments. One is a list of all of the column headings (COLUMN_HEADINGS) and the second is a CSV file with all of APPLE’s transactions (320193_SUMMARY).

I go back and forth on whether or not we should index (make them full text searchable) these filings. It is not that much of a challenge – but I am just not understanding the utility. Comments are always appreciated. I will observe that my guess is that you might need/want to identify those filers with 10B-5 transactions. I suspect though that once you did that the next step is to have simple access to the details of the transactions – so the search step is just to identify the filer. To that end I have run code to identify all issuers where there are one or more mentions of 10B-5 in the summary for 2021. You can access that list from here (10B5).

I will be pulling the 2020 insider data offline (and then each prior year one at a time) so we can add ACCEPTANCE-DATETIME to each transaction. This was not even a thought when we first worked with this data. It was my mistake and the only way to fix it is to reprocess each of the years. My apologies.

Updated 10 Hours after initial posting. I had a question about footnote mapping. If there is only one footnote attached to a data value in a cell then the footnote is placed in a column with the name datavalue.footnote. There is no number assigned. If a cell has two footnotes, then each subsequent footnote in that cell is assigned a number beginning with 1. So the second footnote is assigned datavalue.footnote_1. We do not assign the numeric value the filer has assigned because it would explode the number of columns since we need to key the footnote to the data value. For example, if you look at this Form 4 (Elon Musk Form 4). There are 29 non-derivative security transaction that each have a unique footnote attached to the data value for the transaction price. However, since there is only one footnote for each data value – each of those footnotes is in the same column of the results. There are two footnotes in the cell for exercise date value for the one non-derivative transaction, those footnotes are in columns with the labels exerciseprice.footnote and then exerciseprice.footnote_1.

If we used the number key of the footnote as part of the footnote name then we would have significantly more columns and in my opinion the meaning would be less clear.

directEDGAR

Search, Extraction & Normalization Engine

Month: January 2022

Accessing Metadata from Documents Extracted Through our Platform

Reminder – Searches with WORDS that are Operators Require ~ Appended to the Operator

2020 Insider Trading Data Updating With ACCEPTANCE-DATETIME Field

Context Normalization – Spelling Counts

Detailed 2021 Insider Transaction Data Now Available