New Features for Dehydrator & Severance was Inspirational!

A PhD student reached out a couple of weeks ago because the dehydrator stopped during execution and she wasn’t sure why and how to get the data she needed. As always, I am going to obscure the data that our users are asking about – so in the example below I am going to use Executive Compensation since that is part of our normal processing.

The root cause of the failure was fascinating – there were html comments as children of td elements in the source tables she was trying to normalize. I had never even thought of that and so our parser crashed on her. The standard procedure has been to read the log, find the last file parsed, then use that to identify the next one and remove it as the crash would have been caused by that file. Since this was parallel to some work I have been focused on for a while I decided to use her problem as a test bed for some improvements that have been needed.

First, I changed our sanitizer to anticipate comments as children of any elements not just the table element (which is all I had ever seen before). So those would not be a problem. But I also decided that we could better handle ‘failed’ cases. Previously if the error was related to something that we anticipated the hydrator wrote to the log file that the processing failed for some unspecified error. Below is a screenshot with an example of this prior treatment highlighted.

When this happened we just passed the file and you had to identify these and manually do something to capture the data. Now we are parsing these files without any attempt to identify the column headings. When we have a failed file we are now writing the rows of data from the html into a csv file cleverly named hydrator-{n}-non-rectangular-summary-data.csv where the n represents the hydration attempt in that particular folder. All we are doing is stacking the content. We are actually writing the content one failed file at a time and this was deliberate. I was mulling this over and was initially going to write everything at the end with uniform column headings but we cannot necessarily expect the data in column 3 for CIK 1234567 to be the same for CIK 7653421. So we write in a new set of column headings for each individual file that gets diverted to this part of the processor. Here is the new log entry for the table that was flagged above:

Below is an image of the content from that file after it has been stacked in the non-rectangular-summary-data.csv file. When I took the screenshot I also included the next file to hopefully add additional context to my observation above. That it is a reach to try to predict what values belong in which column when considering the entire universe of tables and data types that you might be extracting. This is why we stack them in this manner.

For testing, I pulled 10,364 executive compensation snips. There were 186 that failed on the first pass and the data from those files was accurately deposited in the new csv file. I should be careful here, I did not inspect each of the rows of the 186 source files to compare those to the corresponding rows in the csv file. I checked this one to confirm the data was accurately deposited and about five others. I hope I am not caught later with inaccuracies. If so I will try to address.

I did some additional re-engineering of the table parsers in the dehydrator. When I received the email from the PhD student I had been spending quite a bit of time looking at tables and code. To identify the column headings in a file we start with trying to use deterministic characteristics of the html. There are two that we have historically relied on. One is the presence of comments that define the column heading rows and the other is the presence of th elements in the rows. However, historically these have been infrequently used. We process thousands of tables each month and we do look closely at their construction and structure. We have noticed additional class attributes that have become more common that are also very deterministic identifiers of column heading boundaries. I included a new parser step that evaluates the existence of these attributes and assigns them a higher priority for helping set those boundaries. With this new parser step approximately 20% of the table column headings are parsed deterministically as opposed to less than 3% before when the only deterministic identifiers were comments and th elements.

If a table does not have deterministic indicators of column headings we then use a series of heuristics. I am not going to share much about those as I consider those to be our ‘secret-sauce’. But I was watching an Apple TV show called Severance one night after wrestling with how to implement more refined heuristics. If you have watched the show, the employees look at numbers for patterns. I ended up deciding to build my own Severance tool to integrate into the dehydrator to test for patterns that would help us deliver better results to you. It is working and it has made it easier to tweak some of our heuristic separators. I have some more on the way. Prior to this update we had four classes of markers, we now have six and I am working on three more. These will slow down dehydration – this new build added four minutes to parsing my sample folder but the results are great – it reduced the number of column headings from over 2,000 in this sample to 1,600 and I hope in the next build to reduce it even more significantly.

This update should be fully available by Monday 10/27. I will post about the next update when it is ready.

DB Document Results Filter is More Versatile than It Seems

Our directEDGAR platform offers at least ten ways to filter search results. The effect of most of these is clearly obvious (DATE, CIK, CIK/DATE, 8-K FILING REASON, DOCTYPE, SICCODE, CNAME, . . .). I used the DB DOCUMENT RESULTS filter today and realized that I have not done a good enough job of describing at least one important use case.

I had an email from a PhD student who was trying to collect a table from some specific filings. They needed some guidance. I want to obscure their data so the following example is based on snipping the Executive Compensation (EC) table. I encourage our users to start table snipping with a really narrow and explicit focus. The idea is that the more narrow the focus the less noise in the output. For example, our process for identifying EC tables starts by requiring the table to have PRINCIPAL, NAME YEAR, POSITION, SALARY, EQUITY, AWARDS, OTHER and TOTAL. I can’t remember from our testing but my recollection is that less than 1% of the tables found with this set of requirements are TYPE 1 errors. And that is the heart of what I want to explain. It is absolutely true that I will not find an EC table from every DEF 14A with those requirements but my results will be very clean. Now the problem is to collect tables from those filings that we missed on the first pass, for example a reasonable proportion of filers do not use NAME, PRINCIPAL POSITION or YEAR in the column headings.

When our table parser finishes one of the outputs is a csv file named MISSING. The file contains a single column labeled FILENAME. This column contains a list of files from our search results where the application did not return/identify a table. Here is a screenshot of the file.

If we rename the FILENAME column to DE_PATH we can then use that file to filter our previous search on just those documents. That is important – presumably our original search set out some important parameters and we want to focus our data collection on just those firms. So I renamed the FILENAME column to DE_PATH and saved the missing file. I then selected the DB Document Results Filter by activating the Use DB check box and then clicking the Set DB File button.

Once I hit the Okay button and then Perform Search the results will be filtered to only those filings/documents that were listed in the CSV file. Here is the result of that search. Now I can review these presumably relevant filings to identify how to identify the EC table using less restrictive criteria. I brought the focus to a filing where the EC table did not have a label for the YEAR column.

I would have to dig through our system to say this with absolute authority but my recollection is that there are some tables where the only way we can grab them is by using the words SALARY and TOTAL. However, if we were to use only those words to identify tables on the first pass then we would get a huge amount of noise.

We have some obligations to deliver normalized Executive and Director compensation data by 7:00 PM each filing day so we absolutely cannot waste time having a person trudge through noisy data. While all of this is automated our process started with what I described above. We are getting ready to add two new tables to our system. The collection of these table will be automated. But the way we learn to automate it is to begin by manually snipping and then reviewing tables as they exist and identify the characteristics that will provide the least amount of noise in the output on the first pass. For me to have this coded I have to be able to describe how to filter the tables to settle on the exact criteria to apply. The only way I can do that is to review filings, snip some tables and then review what was missed to understand why it was missed. I use the DB Document Results Filter repeatedly as I am iterating through the process.

If the world was perfect we could use science alone to do this work. Filings are created by humans and thus they are not perfect so we have to bring some tenacity to this process. Features like the DB Document Results Filter reduce at least some of the frictions associated with being persistent.

Remember the Include Historical CIKs Option

I received an email from a new user at one of our client schools. They were asking about the checkbox control in the CIK filter option. This is an amazing feature. We were the first to offer CIK filtering, the first to offer CIK/date filtering and I think still the only search application that has this Historical CIK feature though we have had it for almost 8 years. It was hard to implement but I think it important. Let’s take a look at it. The check-box can be seen in the image below.

If you are running a search and are filtering on CIK or CIK/DATE this button reviews your CIK and or CIK/DATE file and looks to see if you have any CIKs that we have identified as a successor or predecessor to another filer. If so then your file is augmented with the successor or predecessor (whichever is missing).

To illustrate the value – suppose you intend to identify 8-K filings made in the 2016-2020 for your research. The firms in your sample were identified using some other data source and you have a list of their CIKs. For the purposes of this demo I have four (4) CIKs listed as you can see in the image below.

Oh-oh – I had no results at all.

If I rerun the search and this time check the Include Historical CIKs button the results will be different. Here is a screenshot after submitting the same search but selecting the Include Historical CIKs.

Notice that the application reports that 8 CIK(s) were submitted. But we only had 4 in our list. That is because of the mapping process that occurs when you start the search. The application reads your list of CIKs and checks for the existence of related CIKs. We add additional rows to the filtering list based on our mapping. In the case of CIK/Date filtering for each CIK-MATCHDATE pair you have included if the CIK has one or more related CIKs we augment the file with a new row with the additional CIK-MATCHDATE pairs.

The next question of course, how to see the mapping that was made. There are two ways to do this. If you have a large number of CIKs it might be easiest to review the CIK_MAPPING database that is accessible from the Query Databases viewer. Just select the database from the Query List and hit execute to access the entire list. Scroll through the list to find your CIKs and identify the values for RELATED~

In the next image I pointed arrows to the CIK_MAPPING database and the Execute button. I then boxed around the BASE_CIK for 912752 and 1971213 – you can see that we are asserting that one of these is the successor entity to the other.

The second way to identify the augmented list is to access the filter file that was generated by your selection of CIKs. If you click on the Set CIK File button after you have completed a search the Select Existing CIK File box will be populated with the name of the file that was actually used in the search. Even if you used your own file, the application will read it and then store a copy (augmented or not) in the D:\PhotonUser\AppData\Local\Temp\somenumber directory. Note this is a session directory not a permanent directory.

While it is a bit painful to navigate to that directory you can and open the file to examine it. One of the problems though is it does not map the relationship as you can see below. Further, you can’t predict the relationship based on order.

One of the sanity checks we perform when we make these links it we compare the last balance sheet in the last 10-K for the predecessor to the same balance sheet in the first 10-K in the successor. In this case the issues are even more complicated because CIK 912725 has a continued filing obligation because of some debit securities. Nonetheless, based on my review of all of the facts available, I believe the time series of data related to CIK 1971213 (Sinclair Inc) extends to the filings made by the entity that has the current name Sinclair Broadcast Group LLC with the CIK 912752.

Names are not always predictors of the predecessor-successor relationships. For example, CIK 1999001 Six Flags Entertainment Corporation/NEW is mapped as the successor to CIK 811532 (CEDAR FAIR L P) despite the fact that the name of the filer with the CIK 701374 is Six Flags Entertainment Corp/OLD.

The last 10-K filed by CIK 701374 for the year ended 12/31/2023 reported total assets of 2,711,474,000. The first 10-K filed by CIK 1999001 for the year ended 12/31/2024 listed comparative values for 12/31/2023 with total assets of 2,240,533,000. While these don’t match, CEDAR FAIR’s last 10-K reports total assets of 2,240,533,000 for the fiscal year ended 12/31/2023 which matches the amount reported by Six Flags Entertainment Corporation/NEW.

We created this because early on clients and potential clients would lament the fact that they were losing observations – the most common example they would cite was Oracle (777676 -> 1341439). These are often hard to identify. The best check is when we can compare the balance sheets. However, knowing that these events even happen is hard. These entities do have some reporting obligations but my estimate is that only about 60% of the filers do everything correct. The good thing is that they are relatively rare.

Last note, we don’t automatically update your mapping dictionary. You have to update it. To do so from the File menu on the application select Options. From Options select Update Historical CIK and then press the Perform Update button.

I cannot make a guarantee that we have identified all mappings. The next time I get some bandwidth I am going to see if I can use our compensation databases to identify others. My hypothesis is that if I have a filer that has executive (or director) compensation in 2023 and none in 2024 and then another that has compensation in 2024 but not 2023 and there is overlap in the PERSON-CIKs of the executives then we should research those to see if the 2024 filer is the successor. However overlap alone will not be sufficient. There are a number of executives and directors that had reporting obligations because of their roles at World Wrestling Entertainment Inc (WWE) and have similar reporting obligations with TKO Group Holdings, Inc but TKO acquired WWE, it is not a successor entity.

As an aside – the reason I selected the 2016-2020 span is that none of the successor filers had filings in that span. Both the successor and predecessors had filings in the 2021-2024 span.

Interesting Problem – new code added to directEDGAR code repository.

I was contacted by a research assistant this morning who was trying to collect some interesting data. They had a list of CIKs and with a date variable. What they wanted was to determine/identify any 8-K filed by those companies in a ten day window around the date variable. I was going to point them to our CIK/Date Filter capability (CIK_DATE_FILTER) except I noticed that their list covered the entirety of EDGAR. Further, we have not finished standardizing the metadata at the document level and they wanted the SEC FILING date as well as the acceptance date time. The first constraint meant that the post-query filtering was going to be a drag – much better to search one index at a time. The second issue meant that the inconsistent metadata would mean that some of their time would be wasted.

I ended up writing two bits of code, the first took in their original list and added additional rows and a new field. The code checked each CIK and determined whether or not the CIK was in our CIK_MAPPING database. This is where we try to archive CIK changes. The student had CIK 1532063 with an event date of 6/30/1996. The problem was that CIK 1532063 is the successor to CIK 885721. CIK 1532063 made their first filing 10/6/2011. Therefore without augmenting their CIK list to add an entry for CIK 885721 with the date 6/30/1996 the entry for 1532063 for 6/30/1996 would be missing.

The first bit of code is called augment_cik_data.py and it takes an input csv file with a column labeled ‘cik’ and ‘last_date’ and compares the CIK values to determine if they are in the CIK_MAPPING database. If so, the entry is duplicated with the value from the RELATED_CIK_1 field in the database as the value in the ‘cik’ column and the original value for ‘cik’ added to a new column named ‘BASE_CIK’. Below is a screenshot of the csv file. The entries in red text are the augmented entries.

I then used this file as the input to a second script named pull_matching_from_8K_meta.py. This reads the newly augmented csv file and then finds matching entries based on the CIK date criteria.

To run the first script – log into the platform and upload a csv file with your sample. To follow this example exactly you need a column labeled ‘cik’ and a column labeled ‘last_date’. The cik needs to be an integer and the last_date values need to be in the form YYYYMMDD. from the Catalog icon at the top of the frame select IDLE – see the image below.

When Python opens from the File menu select Open as illustrated in the next screenshot:

Navigate to the platform share and the folder PythonCode to select the augment_cik_data.py script.

When the file opens you will need to modify the input and output paths as illustrated in the next image.

You need to save the file. However, you don’t have write privileges on the share so you will want to save to the same folder as your csv file. Once you have saved it with the edited paths (and other edits you might make) use the Run Module command to run the file.

When the code finishes (very quick) there will be a completion message in Idle. Now that you have supplemented the original sample with any matched CIKs you want to use that output file as the input to pull_matching_from_8K_meta.py. Follow the same steps as above to find the file on the Share, modify the path and save it to your Temporary Files folder and then run it.

The first time I ran this it took about six minutes so I added an index for SEC_FILING_DATE and it took less than a minute.

Are 10-Qs Going to Go Away?

The Wall Street Journal just reported that President Trump is considering eliminating quarterly reporting for US issuers (WSJ Article). Wow! It is an interesting argument – will companies somehow perform better if they were not having to report earnings four times a year? What about investor uncertainty? I will try later to post links to some academic research that has established that information in 10-Qs has significant price implications. I spent much of the weekend trying to sort out how we can quickly get the new data that I expect to see when companies start filing under ASU-2024-03 but frankly I was wondering in the back of my mind if the SEC would somehow lessen the impact of that standard. I did not imagine this.

Why Redact After the Fact?

There is a joke in my household – how many trips to Lowe’s will it take Burch (dad) to fix something? It takes multiple trips because I often don’t know the full scope of what I am dealing with until I am in the middle of it. That same concept applies to the rebuild of our indexes with the new metadata – it seemed like an easy enough task until we hit issues that we were not expecting and I feel like I need to understand them.

As you can imagine we have various logging processes. One of those records details about the filings we download in to our EdgarIndex database. This is based on the RSS feed which we inspect about every 10 minutes. I was using our EdgarIndex to collect fresh copies of 2021 10-K filings and there was a logging message that this filing was not available (www.sec.gov/Archives/edgar/data/1108524/000110852421000014/0001108524-21-000014.txt) today (8/8/2025). This seemed weird because our system is pretty robust, I was pretty sure that entry was not the result of a system hallucination. I went to EDGAR and confirmed, that filing is no longer accessible. The CIK (1108524) is Salesforce.com’s

Time to do some research – I verified that the log entry was valid in the sense that there was nothing unexpected about it. I then went to check to see if we had a 10-K filing that matched the expected RDATE-CDATE-FVALUE based on the log entry. Since the log entry showed the filing was their 12/31/2021 10-K which was accepted at 4:51 PM on 3/17/2021 and we downloaded it at 5:05 PM I expected to see a 10-K in an RDATE-CDATE-FVALUE folder labeled R20210317-C20201231-F14. The 10-K was present.

However, there is no filing on EDGAR – so I was curious and went to their landing page on EDGAR and I saw instead an entry for a filing with a calendar year 2022 accession number.

We have that one as well. This is another example of why we don’t use the Filing Date as a primary indicator in our archive. That particular version was submitted on 2/24/2022 at 4:13 PM. So it turns out that we have two copies of this filing, the original one and the replacement. I pulled the feed from 2/24/2022 and confirmed that this was one of two filings (the other was a 10-Q submitted on 6/1/2020). Deleted on 2/24/2022 by Salesforce.com.

Because we have the original (3/17/2021) filing as well as the replacement submitted on 2/24/2022 I was more than curious about what changed. I compared the documents between the two filings. The only content changes were additional redactions in the NEWER version of an exhibit related to the aircraft timeshare agreement between Mr. Benioff and Salesforce.com. I was lazy I just used the DocumentExtractionTextOnly feature to pull the text from the filing and each of the exhibits. I then read in each line as a list and also created a set (to address blank lines and then compared each of the documents (from the old and new filings). I looked for lines in one that were not in the other using set operations.

The newer version had the model/type of the plane redacted as well as the home base and contact information for someone who I suspect is Mr. Benioff’s CPA/Financial Advisor.

Finally – to my point – why redact once the filing has been released? Given the length of time between the original filing and the deletion I idly searched for some of the lines that were newly redacted. I found four hits when searching Google for “certain Gulfstream GVI (G650ER)“. Here is a link to one of those four searches (Justia Hit) That content was redacted in the replacement filing. Here is the result of that search in directEDGAR in the 2021-2024 10-K index collection.

I reported above that there was a 10-Q filing that was also updated on the same date. The original 10-Q was filed on 6/1/2020. It had even more details about the aircraft that was subject to the time sharing agreement than the 10-K. The agreement in that 10-Q is also publicly available through other sources.

Given the persistence of information disseminated on the internet I just don’t understand why Salesforce.com devoted the money and especially the time to do these redactions. It is not automatic. They had to contact the SEC, there were probably two attorneys who reviewed the content and maybe 19 letters (that is an exaggeration) to the SEC to convince them there was a valid reason to pull and replace the filings. Sure, the filings are not directly available on EDGAR any longer but they were disseminated to both users who then redistribute them to other platforms and users who retain them for internal use only. And then there are folks like us who collect them almost immediately and archive them and make the content available. I just can’t figure out the benefit of the additional redactions.

For our clients, I hope this helps you understand why this transformation is taking so darn long. It is more than just re-tagging the filings. As we encounter these hurdles/unexpected issues, we have to stop and consider our options to make sure we handle these in an optimal way.

As a total aside – the tail number of the plane was available in one of the earlier (and since deleted) filings. I clearly had too much time on my hands. I was curious what a search for it would reveal. The tail number is N650HA – and it is for sale. This link was active as of 8/9/2025 (Gulfstream for sale).

New Metadata (Fields) is Starting to Appear

I stated in my last client update that we would start adding some new fields to our filings in about two weeks. Unfortunately, and as is normal, we ran into unexpected problems that needed to be addressed before we could start the process. I think all of the problems have been addressed and so this process has started. The screenshot below contains a small section from a search of the 2025 Proxy filings with the new metadata field names and content highlighted in red.

ACCEPTANCE is the EDGAR system acceptance time. COFILERS is a pipe-delimited list of all CIKs associated with the filing. FILINGDATE is the SEC determined date for the filing based on EDGAR system rules. This will generally match the RDATE – when they don’t the RDATE represents the date this version of the filing was disseminated through EDGAR. SECPATH is the path to the document that was returned in the search. This is the path to the actual document from the filing, not the landing page (accession.index.htm) for the filing except in those cases where the filing exists on EDGAR as an ACCESSION.txt file without separation of each of the constituent documents.

Most of this, except for the COFILERS was/is available by using one or another of the db. I wanted this particular information to just be immediately available.

The new metadata serves two purposes. First, to give you immediate access to data that might be relevant for your research. For example, if you are going to run an event study using tick data you might want the value for ACCEPTANCE. If you are trying to match this data with data from other sources that provide the accession number that value is a component of the SECPATH. I think this is one of the most common requests we get from clients – asking for help to match the document to an accession number so the results can more reliably be matched to something else. The other reason is to improve filtering (delivering only the results that are relevant to your search needs).

As part of this process I wanted to improve your ability to write code against the individual documents so we restructured the layout of the documents. If you are writing Python code against the archive all of the content should be a child of the body element if the document is an html file. Thus a very quick and direct way to access the body element to do more substantive work is:

# presuming you are just walking the 10-K Y2025 Archive
import glob
from lxml import html
import os
for cik_dir in glob.glob(r'S:\directEDGAR_ACCOUNTING\10KMASTER\Y2025-Y2025\FILINGS\*'):
    for rdate_dir in glob.glob(cik_dir + os.sep + 'R*'):
        for htm_f in glob.glob(rdate_dir + os.sep + '*.htm'):
            with open(htm_f, 'r', encoding = 'utf-8') as htm_fh:
                htm_text = htm_fh.read()
           htm_tree = html.fromstring(htm_text)
           body = tree.find('body')
           # now you can operate on the body and directly access the structural content, text and other features
           #  let's pretend you want the meta elements
           my_dict = {m.attrib['name']: m.attrib['content'] for m in 
                      tree.findall('.//head/meta')
                      if 'name' in m.attrib and 'content' in m.attrib
                      }
           #  we often use the cik-rdate stuff - this is our deid
           cik = cik_dir.split(os.sep)[-1]
           rdate = rdate_dir.split(os.sep)[-1]
           deid = cik + '-' + rdate
           my_dict['deid'] = deid

As I am writing this the 2025 Proxy and 10-K archives have been updated. I hope that before Monday (7/21) the 2025 10-Q and 8-K will be completed and then we will just be moving back in time. How long it will take is anyone’s guess. We will have to download the filings from 2022 to 1994. Unfortunately this will just take time. I know there are various cheats to get around the SEC rate limits but I have never felt they were worth the effort. We set a 1/2 second pause after every header/accession.txt file we download. This keeps us well within the 10 records per-second limitation that the SEC has imposed. It makes things slower but EDGAR is an amazing resource and I like having access to it.

CEO and CFO Fields Added to EC Data

I think the headline tells the story – we have added two new fields to the Executive Compensation data – CEO and CFO. Each takes the value of 1 or 0. 1 indicates that we have identified that person in that role and 0 otherwise.

We had to make some data collection decisions that we might or might not revisit later. One is that if there is a transition during the year we identify the CEO/CFO as the person in that role on the last day of the fiscal year. For example, Brian Niccol was appointed CEO of Starbucks on September 9, 2024. Their FYE was 9/30. Prior to Mr. Niccol’s appointment Rachel Ruggeri served as the interim CEO from August 12 until September 9 and with respect to the 2024 fiscal year Laxman Narasimhan was the CEO from 9/30/2023 until August 11, 2024. Our compensation data has Ms. Ruggeri’s title for 2024 as EXECUTIVE VICE PRESIDENT, CHIEF FINANCIAL OFFICER AND FORMER INTERIM CHIEF EXECUTIVE OFFICER. Mr. Narasimhan’s title for 2024 is reported as FORMER CHIEF EXECUTIVE OFFICER.

Here is a screenshot after using the Query feature to select all CFOs with a salary greater than 650,000.

As a reminder – our EC data is generally updated each week.

Comprehensive Data Collection – Chair and Lead Director with their Tenure

Data collection can be tedious – but there are ways to leverage our platform to really speed up even the most mundane task. I am working on a project with some colleagues and I was tasked with identifying the Lead Independent Director and/or Chair of the board as well as their tenure. This data needed to reflect these positions in calendar year 2024 and the sample is those companies that self-reported as Large Accelerated Filers with a 12/31 (approximate) FYE. The particular challenge for this is to identify the tenure as Lead Independent Director. I am finding that this is not directly disclosed in over 30% of the filings. There is a really easy way to address that – I describe below how to run a second search in another instance of directEDGAR to identify the tenure.

Who wants to type names as well as the critical metadata that is required? Not me – so my first step was to pull the executive and director compensation data for my sample firms. Since these were 12/31 firms I would need to pull 2024 EC and DC data. Below is a screenshot of my pull of the EC data using our Query tool. If you look closely, the CIK Filter button is blue. I have a file that has the 1,297 CIKs in my sample and that file was selected. You can also see that I have the Criteria set as YEAR=2024. I saved the results as a CSV file and then repeated the same steps to collect the DC data.

I then combined the two files. To do so I did have to account for the fact that there are some minor differences in the column headings. For example the EC data has PERSON-TITLE, SALARY and BONUS whereas the DC data does not have PERSON-TITLE, SALARY or BONUS and instead has CASH. Despite these differences it took just a few minutes to arrange the columns – I did add a new column defining the nature of the data (EC/DC). I did that because I wanted to sort these and keep the EC and DC data separate.

There is a little bit of a challenge here. The order of the search results will not match the data order unless we force it. That problem is easily solvable. What I am going to do is to run a search in the 2025 Proxy archive for (DOCTYPE contains(DEF*)) filtered on my CIK list. I am then going to use the SummaryExtraction file to sort my EC/DC data to MATCH the search result order. Here is something important – the search result order will always match this order (if there is a matching document).

After downloading the SummaryExtraction file I inserted a new column to the left of the CIK – ORDER and just numbered the results in order.

I then copied those two columns to my combined compensation file and used EXCEL’s VLOOKUP function to copy the value for ORDER into the compensation data. Here is a screenshot of some of the rows in that file after I added the ORDER value.

This took me about ten minutes but it will save me significant time as I go forward. My next step is to identify those companies that have a lead independent director so I can then identify the name of that person. Not every company will have a lead independent director – particularly those with an independent chair. So I am not going to look at the initial search results. The first search was just to define the order I want to use for my data collection. I am going to run a new search. I am still limiting the results by my CIK list but now I am searching for (DOCTYPE contains(DEF*)) and (lead director or lead independent director) andany since. Here is a screenshot of the result of that search.

I want to explain this search. It only returns documents that

  1. Are DEF 14A
  2. Have either
    1. The phrase Lead Director, or
    2. The phrase Lead Independent Director
  3. If the word Since is present in a document that meets both 1 & 2 above then all instances of the word Since will be highlighted as well.

If you compare these results with the image above from the search of (DOCTYPE contains (DEF*)) you will notice we have fewer results. That just means that those companies not listed in this search do not have either phrase we were searching for. That is fine – I am focused on identifying whether or not there is a person with the title LEAD DIRECTOR or LEAD INDEPENDENT director and I want their tenure.

My goal at this stage is to identify the Lead Independent Director and their tenure in that position. I am clearly going to be lucky because there will be cases where the chair is also identified – if you scan that you can see that Ellen Gordon is the chair and has been since 2015. That is a lucky accident for me. Since I included both EC and DC data I will go ahead and record that information.

The filing expressly reports that they (Tootsie Roll) do not have a lead independent director – so I very quickly completed my data collection for Tootsie Roll.

In a world optimized for my data collection needs the disclosures would all fall into a pattern like this one for BOSTON BEER – very clear indication that Mr. Nemeth is the Lead Director and has served in that position since May 2024.

As many of you know, the world is not optimized for our data collection. Sometimes we need a boost. Below is a screenshot of the location where I identified the name of the Lead Independent Director in Seacoast Banking Corp of Florida’s proxy.

After careful review I could not find the year Mr. Fogal was first elected to that position. So here is an important trick. I suspected that the company has probably treated that information as a boilerplate disclosure. So I started another instance of the directEDGAR application in my Appstream session and used the clues in the disclosure above to search for prior disclosures to see if I could identify when he started as Lead Independent Director. In the next screenshot you can see I have selected proxy filings back to 2005.

Based on the disclosure above the search I ran was elect* w/5 fogal w/10 lead independent. No need to filter on DOCTYPE or do anything more complicated. Here is a screenshot of that search and the evidence I need to record 2018 as the beginning of his tenure as Lead Independent Director.

You can see that my original search is partially visible in the background. I am using the second instance to help identify the tenure once I have identified the Lead Independent Director if their start date is not mentioned in the filing. This keeps me focused and moving forward without having to agonize over how to collect this data. Names are fairly unique, especially when we limit on phrases like Lead Independent Director. I did not run into much noise when I constructed these secondary searches.

In the Weeds – Another Interesting Example Why We Use the RDATE as our Proxy for the Filing Date

We are going to reconstruct our filing archive. I was testing the code that will process the accession.txt files today on 2024 proxy filings. Rather than pull new filings I was working with the 2024 files we pulled during our production run in 2024. It is a bit tedious but necessary to identify all of the issues that we had not thought of when we are working on the code.

Here is the link to the index page for a DEF 14A filing made by Biora Therapeutics – notice that the filing date is reported at 2024-09-16. If you look carefully at the index page you will not see the date 9/20/2024. However, ultimately the filing will end up in the directEDGAR archive with a DEID of 1580063-R20240920-C20241009-F36. The reason for that is kind of interesting if you are an SEC wonk.

Here is a little extract from the accession.txt file that was in our archive and originally pulled on 3:45 PM (CDT) on 9/16:

</SEC-HEADER>
<DOCUMENT>
<TYPE>DEF 14A
<SEQUENCE>1
<FILENAME>d48307ddef14a.htm
<DESCRIPTION>DEF 14A
<TEXT>
</TEXT>
</DOCUMENT>
<DOCUMENT>

Notice that there is no content between the SGML open and close TEXT tags despite other clues present that there should have been an HTML file with the filename d48307ddef14a.htm between those tags. Based on these clues I concluded that either the EDGAR processor ripped out the original filing during the submission process or the filer somehow failed to map in the HTML file when constructing the accession.txt file.

If we look at the associated header file from EDGAR – the current version differs from the one we originally pulled. I have pasted the first few lines from each of the header files below:

The date next to the header ‘declaration’ differs between the two filings. However, what I find really interesting is that when this filing was originally submitted a DATE-OF-FILING-DATE-CHANGE tag was added with the date of the original submission. This tag is not found in every filing. Per the EDGAR filer manual it is supposed to indicate the “Date when the last Post Acceptance occurred.” It is an optional tag and we generally see it when the SEC-HEADER associated date does not equal the filing date. I should clarify though – we see it much more frequently than expected. We see it even when the complete filing does not appear to be changed and the date next to the header tag matches the FILING-DATE. The existence of that tag is not a clear signal that something might be amiss.

Chasing these kinds of things is probably a waste of time but I am hugely curious about all things EDGAR so I decided to pull the feed archive for 9/16/2024. I drilled through it and found the original filing and it matched the one we pulled on 9/16/2024 – specifically the HTML content was missing. Based on my experience with these I then pulled the feed archive for 9/20/2024 and found another version of the accession.txt file (saved as an accession.nc file) and it matched the current version that is available from the index page.

What can we conclude – I do not believe the DEF 14A was available to the public on 9/16. If you were running an event study and this filing was in your sample you would be introducing error by using the Filing Date that is reported on the landing page, in the EDGAR indexes and in the filing itself.

What I find particularly interesting is that it is not clear to me that we can rely on the file dates as reported in the EDGAR archive for the filings. Here is a screenshot of the accession archive directory for this filing.

Based on everything above, I am suspicious that the LAST MODIFIED date for this filing is correct. The htm file was not included in the dissemination feed on 9/16 but was present on 9/20. Thus, I suspect that the filing was just not available until 9/20. I find it interesting that the EDGAR code altered the LAST MODIFIED date to match the acceptance date-time.

I did warn you that this was a bit wonky. More on the archive rebuild later.