New Metadata (Fields) is Starting to Appear

I stated in my last client update that we would start adding some new fields to our filings in about two weeks. Unfortunately, and as is normal, we ran into unexpected problems that needed to be addressed before we could start the process. I think all of the problems have been addressed and so this process has started. The screenshot below contains a small section from a search of the 2025 Proxy filings with the new metadata field names and content highlighted in red.

ACCEPTANCE is the EDGAR system acceptance time. COFILERS is a pipe-delimited list of all CIKs associated with the filing. FILINGDATE is the SEC determined date for the filing based on EDGAR system rules. This will generally match the RDATE – when they don’t the RDATE represents the date this version of the filing was disseminated through EDGAR. SECPATH is the path to the document that was returned in the search. This is the path to the actual document from the filing, not the landing page (accession.index.htm) for the filing except in those cases where the filing exists on EDGAR as an ACCESSION.txt file without separation of each of the constituent documents.

Most of this, except for the COFILERS was/is available by using one or another of the db. I wanted this particular information to just be immediately available.

The new metadata serves two purposes. First, to give you immediate access to data that might be relevant for your research. For example, if you are going to run an event study using tick data you might want the value for ACCEPTANCE. If you are trying to match this data with data from other sources that provide the accession number that value is a component of the SECPATH. I think this is one of the most common requests we get from clients – asking for help to match the document to an accession number so the results can more reliably be matched to something else. The other reason is to improve filtering (delivering only the results that are relevant to your search needs).

As part of this process I wanted to improve your ability to write code against the individual documents so we restructured the layout of the documents. If you are writing Python code against the archive all of the content should be a child of the body element if the document is an html file. Thus a very quick and direct way to access the body element to do more substantive work is:

# presuming you are just walking the 10-K Y2025 Archive
import glob
from lxml import html
import os
for cik_dir in glob.glob(r'S:\directEDGAR_ACCOUNTING\10KMASTER\Y2025-Y2025\FILINGS\*'):
    for rdate_dir in glob.glob(cik_dir + os.sep + 'R*'):
        for htm_f in glob.glob(rdate_dir + os.sep + '*.htm'):
            with open(htm_f, 'r', encoding = 'utf-8') as htm_fh:
                htm_text = htm_fh.read()
           htm_tree = html.fromstring(htm_text)
           body = tree.find('body')
           # now you can operate on the body and directly access the structural content, text and other features
           #  let's pretend you want the meta elements
           my_dict = {m.attrib['name']: m.attrib['content'] for m in 
                      tree.findall('.//head/meta')
                      if 'name' in m.attrib and 'content' in m.attrib
                      }
           #  we often use the cik-rdate stuff - this is our deid
           cik = cik_dir.split(os.sep)[-1]
           rdate = rdate_dir.split(os.sep)[-1]
           deid = cik + '-' + rdate
           my_dict['deid'] = deid

As I am writing this the 2025 Proxy and 10-K archives have been updated. I hope that before Monday (7/21) the 2025 10-Q and 8-K will be completed and then we will just be moving back in time. How long it will take is anyone’s guess. We will have to download the filings from 2022 to 1994. Unfortunately this will just take time. I know there are various cheats to get around the SEC rate limits but I have never felt they were worth the effort. We set a 1/2 second pause after every header/accession.txt file we download. This keeps us well within the 10 records per-second limitation that the SEC has imposed. It makes things slower but EDGAR is an amazing resource and I like having access to it.