AI Automation Does Not Make the World Better

Is it an age thing – I really dislike all of the ‘AI’ junk that invades my work! How about when you are trying to compose an email and your provider happily types in front of you. Today I am normalizing column heading extracted from tables that report the status of shares to be issued and available in Equity Compensation Plans (often reported in ITEM 12 in the 10-K). Step 1 is tedious because of slight but potentially meaningful differences in column headings even though the actual column headings are specified by CFR 17 SEC 229.201.

I want to map the standard column headings into TO_BE_ISSUED, AVERAGE EXERCISE PRICE and AVAILABLE. So I am carefully reading the original column headings and making the determination. I accidentally hit enter when Excel analyzed and determined that I was typing AVAILABLE and it generously populated the cells below – with nonsense no less. Here are some of the 64 Automatic Mappings that were auto-populated

ORIG_LABEL	NEW_LABEL (from Excel)
'NUMBER OF SECURITIES AVAILABLE FOR FUTURE ISSUANCE UNDER EQUITY COMPENSATION PLAN (EXCLUDING SECURITIES REFLECTED IN COLUMN) (# OF SHARES)	AVAIABLE
'WEIGHTED AVERAGE EXERCISE PRICE OF OUTSTANDING OPTIONS AND RIGHTS 1 EQUITY COMPENSATION PLANS APPROVED BY SHAREHOLDERS:	AVERAGE
'PLAN CATEGORY NUMBER OF SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING OPTIONS, WARRANTS AND RIGHTS	CATEGORY
'EQUITY PLAN SUMMARY COLUMN NUMBER OF SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING OPTIONS, WARRANTS, AND RIGHTS	COLUMN
'NUMBER OF SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING OPTIONS, WARRANTS AND RIGHTS AND THE VESTING OF UNVESTED RESTRICTED STOCK UNITS	EXERCISE
'EQUITY COMPENSATION PLAN INFORMATION NUMBER OF SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING OPTIONS WARRANTS AND RIGHTS	INFORMATION
'NUMBER OF SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING OPTIONS, WARRANTS AND RIGHTS OR SETTLEMENT OF RESTRICTED STOCK UNITS	ISSUED
'NUMBER OF SECURITIES TO BE ISSUED UPON EXERCISE OR SETTLEMENT OF OUTSTANDING OPTIONS, WARRANTS AND RIGHTS	OF
'EQUITY COMPENSATION PLAN INFORMATION AT DECEMBER 31, 2024 WEIGHTED-AVERAGE EXERCISE PRICE OF OUTSTANDING OPTIONS, WARRANTS AND RIGHTS	PLAN
'NUMBER OF DEFINED SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING OPTIONS, WARRANTS AND RIGHTS DECEMBER 31, 2024	SECURITIES
NUMBER OF SECURITIES THAT MAY BE ISSUED UPON EXERCISE OR VESTING OF OUTSTANDING OPTIONS, WARRANTS AND RIGHTS AT TARGET	THATMAY
'NUMBER OF SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING OPTIONS OR SETTLEMENT OF LTIP PSUS AND RSUS	TO
'NUMBER OF SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING OPTION AWARDS, RESTRICTED STOCK UNITS, AND PERFORMANCE STOCK UNITS	TO B ISS
NUMBER OF SECURITIES AVAILABLE FOR FUTURE ISSUANCE UNDER EQUITY COMPENSATION PLANS (EXCLUDING SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING SHARE OPTIONS)	UNDER
NUMBER OF SECURITIES AVAILABLE FOR FUTURE ISSUANCE UNDER EQUITY COMPENSATION PLANS (EXCLUDING SECURITIES TO BE ISSUED UPON EXERCISE OF OUTSTANDING SHARE OPTIONS)	UPON

Those of you that are avid readers know that I can substitute their for there. I have never lost sleep over those errors – I only have one brain cell. This interference is frustrating and unproductive.

I am not sure how Excel does this – but I think it could border on industrial espionage. It might be that Excel is using a C library similar to FuzzyWuzzy and the analysis and substitution decisions are happening locally. But what if they are communicating with a thread in the cloud? There are two concerns I have about this.

First, what if I am working with confidential data – does #Microsoft see all of our data and are they able to act on it? Second, I had to correct these errors, am I serving as an uncompensated trainer for Microsoft? The initial predictions were delivered, I corrected them. Did that info get delivered back to Microsoft and is that going to feed the monster? Two years ago I would have thought that possibility was idiotic – I am not so sure now.

I published this originally about an hour ago. It took me twenty minutes to clean up the mess that was caused by the original automation. I am back on track, happily making my mappings and it happened again. I was read for it this time – When Excel stared volunteering to do my work I froze and captured it.

If you can see above – I had typed REM and Excel volunteered AINDING – and look below – if I had not caught that we would have had the ANCEMAINCURITIES – tell me this is not a problem.

What is the truth?

We are desperately trying to finish up a new comprehensive audit fee database. One of the fields is the AUDIT REPORT DATE. But then we run into issues like this one. Here is a screenshot of the audit signature following the audit report included in New Concept Energy’s 10-K for the FYE 12/31/2024. Here is a link to the filing CIK 105744 10-K.

The dissemination date of this 10-K was 3/24/2025. Approximately one year and one week after the signature date. We discovered this while stress testing our data collection strategies. We have a fairly robust regular expression to use to find these dates – but the discrepancy between the dissemination date and the audit report date warranted further research (we don’t know how we make mistakes until we find the mistakes).

So now the question is what to do about this? I wonder if this is a clue (gosh I hate giving away this research idea) but anyway, is this a clue that the “Company’s internal control over financial reporting was ~~effective~~ ineffective as of December 31, 2024.” Note, the quote is from the 10-K but the date issue, while perhaps trivial might be evidence that the company does not have effective controls. I started this paragraph wondering what we should do about this. We have a couple of choices:

Leave as is.
Leave as is and flag.
Make an educated guess and change the date to 3/17/2025.
Leave the field blank.

I think what we need to do is confirm that we correctly pulled the value that is being asserted, audit report date (in this case) – and leave the value if we have collected it correctly but flag the issue.

I teased in the middle of the prior paragraph – it is kind of interesting the number and nature of errors that we find. Could this be evidence of ineffective controls in that the company does not have enough staff to carefully review and evaluate their disclosures?

Another New Dehydrator Feature

As illustrated in the screenshot below – I updated the Dehydrator to include what I have cleverly named the hydrator-{n}-grid-data.csv.

The n represents the count of dehydration artifacts in the directory that was analyzed. This new output reports the results of dehydration on a row by column basis. To keep the focus on the data I have hidden most of the metadata fields in the file – but here is a screenshot of the datavalues:

The RID represents the index of the data rows the data was pulled from. The ROW-LABEL reports the actual content from that row in the first column from the left. The CID represents the column index where 1 represents the first column to the right of the ROW-LABEL column. The value I have blocked above (818019) can be seen in this table which was the source of the data.

While it may look like the ROW-LABEL is just JOHN M. HOLMES, the actual html tells a different story. In this case both the name and the title are marked in the html as data belonging to each of the first three rows.

The entire Dehydration/Rehydration process is designed to allow you to more easily normalize the column headings (and/or row labels) to create a more compact matrix of data when working with hundreds or thousands of snipped tables from the filings. For example, while 3,111 DEF 14A filed in 2024 and so far in 2025 used OPTION AWARDS as a column heading there were 83 other variants used (OPTION-BASED AWARDS, OPTIONS/SAR AWARDS, . . . AWARDED OPTIONS). With the column-list.csv file you are able to identify one label that is to be applied to all of these phrases that carry the same semantic meaning. This is great when you want the entire result set normalized.

If you are not looking to normalize the whole table, instead if you only want particular data values then this feature is designed to help you get faster access to the data you are looking for. In this case, if all you wanted were option like awards then you can easily sort on the relevant words for the COLUMN-LABEL and that data is quickly available in a very compact form. I had two different PhD students who described this issue (not with compensation data) over the summer so I wrote the code to address their particular problems and have finally been able to integrate it into dehydration. This does not add much overhead to the Dehydration process so after agonizing a bit about offering a switch for it I just decided to make it a default feature.

I am going into the weeds next with this. There is another very important reason for this artifact. We have been trying to identify a reliable, but auditable way to reduce the column headings and row labels that are written to the column-list.csv file. We can’t really do this until we have access to all of the column headings and row labels in a particular collection of tables. This output gives us that access. If I am analyzing every single column heading at one time then I can apply another set of rules to reduce them as long as I can also keep track of where each particular column heading was used and if I have a way to share that information with you when complete. Seeing a label in a list of labels is not quite as useful as seeing it in the actual context adjacent to the other labels and data values.

In addition to the creation of this new artifact we did some more tweaking to the parser rules to improve the header and data row separation process. The new version incorporates these additional rules.

New Features for Dehydrator & Severance was Inspirational!

A PhD student reached out a couple of weeks ago because the dehydrator stopped during execution and she wasn’t sure why and how to get the data she needed. As always, I am going to obscure the data that our users are asking about – so in the example below I am going to use Executive Compensation since that is part of our normal processing.

The root cause of the failure was fascinating – there were html comments as children of td elements in the source tables she was trying to normalize. I had never even thought of that and so our parser crashed on her. The standard procedure has been to read the log, find the last file parsed, then use that to identify the next one and remove it as the crash would have been caused by that file. Since this was parallel to some work I have been focused on for a while I decided to use her problem as a test bed for some improvements that have been needed.

First, I changed our sanitizer to anticipate comments as children of any elements not just the table element (which is all I had ever seen before). So those would not be a problem. But I also decided that we could better handle ‘failed’ cases. Previously if the error was related to something that we anticipated the hydrator wrote to the log file that the processing failed for some unspecified error. Below is a screenshot with an example of this prior treatment highlighted.

When this happened we just passed the file and you had to identify these and manually do something to capture the data. Now we are parsing these files without any attempt to identify the column headings. When we have a failed file we are now writing the rows of data from the html into a csv file cleverly named hydrator-{n}-non-rectangular-summary-data.csv where the n represents the hydration attempt in that particular folder. All we are doing is stacking the content. We are actually writing the content one failed file at a time and this was deliberate. I was mulling this over and was initially going to write everything at the end with uniform column headings but we cannot necessarily expect the data in column 3 for CIK 1234567 to be the same for CIK 7653421. So we write in a new set of column headings for each individual file that gets diverted to this part of the processor. Here is the new log entry for the table that was flagged above:

Below is an image of the content from that file after it has been stacked in the non-rectangular-summary-data.csv file. When I took the screenshot I also included the next file to hopefully add additional context to my observation above. That it is a reach to try to predict what values belong in which column when considering the entire universe of tables and data types that you might be extracting. This is why we stack them in this manner.

For testing, I pulled 10,364 executive compensation snips. There were 186 that failed on the first pass and the data from those files was accurately deposited in the new csv file. I should be careful here, I did not inspect each of the rows of the 186 source files to compare those to the corresponding rows in the csv file. I checked this one to confirm the data was accurately deposited and about five others. I hope I am not caught later with inaccuracies. If so I will try to address.

I did some additional re-engineering of the table parsers in the dehydrator. When I received the email from the PhD student I had been spending quite a bit of time looking at tables and code. To identify the column headings in a file we start with trying to use deterministic characteristics of the html. There are two that we have historically relied on. One is the presence of comments that define the column heading rows and the other is the presence of th elements in the rows. However, historically these have been infrequently used. We process thousands of tables each month and we do look closely at their construction and structure. We have noticed additional class attributes that have become more common that are also very deterministic identifiers of column heading boundaries. I included a new parser step that evaluates the existence of these attributes and assigns them a higher priority for helping set those boundaries. With this new parser step approximately 20% of the table column headings are parsed deterministically as opposed to less than 3% before when the only deterministic identifiers were comments and th elements.

If a table does not have deterministic indicators of column headings we then use a series of heuristics. I am not going to share much about those as I consider those to be our ‘secret-sauce’. But I was watching an Apple TV show called Severance one night after wrestling with how to implement more refined heuristics. If you have watched the show, the employees look at numbers for patterns. I ended up deciding to build my own Severance tool to integrate into the dehydrator to test for patterns that would help us deliver better results to you. It is working and it has made it easier to tweak some of our heuristic separators. I have some more on the way. Prior to this update we had four classes of markers, we now have six and I am working on three more. These will slow down dehydration – this new build added four minutes to parsing my sample folder but the results are great – it reduced the number of column headings from over 2,000 in this sample to 1,600 and I hope in the next build to reduce it even more significantly.

This update should be fully available by Monday 10/27. I will post about the next update when it is ready.

DB Document Results Filter is More Versatile than It Seems

Our directEDGAR platform offers at least ten ways to filter search results. The effect of most of these is clearly obvious (DATE, CIK, CIK/DATE, 8-K FILING REASON, DOCTYPE, SICCODE, CNAME, . . .). I used the DB DOCUMENT RESULTS filter today and realized that I have not done a good enough job of describing at least one important use case.

I had an email from a PhD student who was trying to collect a table from some specific filings. They needed some guidance. I want to obscure their data so the following example is based on snipping the Executive Compensation (EC) table. I encourage our users to start table snipping with a really narrow and explicit focus. The idea is that the more narrow the focus the less noise in the output. For example, our process for identifying EC tables starts by requiring the table to have PRINCIPAL, NAME YEAR, POSITION, SALARY, EQUITY, AWARDS, OTHER and TOTAL. I can’t remember from our testing but my recollection is that less than 1% of the tables found with this set of requirements are TYPE 1 errors. And that is the heart of what I want to explain. It is absolutely true that I will not find an EC table from every DEF 14A with those requirements but my results will be very clean. Now the problem is to collect tables from those filings that we missed on the first pass, for example a reasonable proportion of filers do not use NAME, PRINCIPAL POSITION or YEAR in the column headings.

When our table parser finishes one of the outputs is a csv file named MISSING. The file contains a single column labeled FILENAME. This column contains a list of files from our search results where the application did not return/identify a table. Here is a screenshot of the file.

If we rename the FILENAME column to DE_PATH we can then use that file to filter our previous search on just those documents. That is important – presumably our original search set out some important parameters and we want to focus our data collection on just those firms. So I renamed the FILENAME column to DE_PATH and saved the missing file. I then selected the DB Document Results Filter by activating the Use DB check box and then clicking the Set DB File button.

Once I hit the Okay button and then Perform Search the results will be filtered to only those filings/documents that were listed in the CSV file. Here is the result of that search. Now I can review these presumably relevant filings to identify how to identify the EC table using less restrictive criteria. I brought the focus to a filing where the EC table did not have a label for the YEAR column.

I would have to dig through our system to say this with absolute authority but my recollection is that there are some tables where the only way we can grab them is by using the words SALARY and TOTAL. However, if we were to use only those words to identify tables on the first pass then we would get a huge amount of noise.

We have some obligations to deliver normalized Executive and Director compensation data by 7:00 PM each filing day so we absolutely cannot waste time having a person trudge through noisy data. While all of this is automated our process started with what I described above. We are getting ready to add two new tables to our system. The collection of these table will be automated. But the way we learn to automate it is to begin by manually snipping and then reviewing tables as they exist and identify the characteristics that will provide the least amount of noise in the output on the first pass. For me to have this coded I have to be able to describe how to filter the tables to settle on the exact criteria to apply. The only way I can do that is to review filings, snip some tables and then review what was missed to understand why it was missed. I use the DB Document Results Filter repeatedly as I am iterating through the process.

If the world was perfect we could use science alone to do this work. Filings are created by humans and thus they are not perfect so we have to bring some tenacity to this process. Features like the DB Document Results Filter reduce at least some of the frictions associated with being persistent.

Remember the Include Historical CIKs Option

I received an email from a new user at one of our client schools. They were asking about the checkbox control in the CIK filter option. This is an amazing feature. We were the first to offer CIK filtering, the first to offer CIK/date filtering and I think still the only search application that has this Historical CIK feature though we have had it for almost 8 years. It was hard to implement but I think it important. Let’s take a look at it. The check-box can be seen in the image below.

If you are running a search and are filtering on CIK or CIK/DATE this button reviews your CIK and or CIK/DATE file and looks to see if you have any CIKs that we have identified as a successor or predecessor to another filer. If so then your file is augmented with the successor or predecessor (whichever is missing).

To illustrate the value – suppose you intend to identify 8-K filings made in the 2016-2020 for your research. The firms in your sample were identified using some other data source and you have a list of their CIKs. For the purposes of this demo I have four (4) CIKs listed as you can see in the image below.

Oh-oh – I had no results at all.

If I rerun the search and this time check the Include Historical CIKs button the results will be different. Here is a screenshot after submitting the same search but selecting the Include Historical CIKs.

Notice that the application reports that 8 CIK(s) were submitted. But we only had 4 in our list. That is because of the mapping process that occurs when you start the search. The application reads your list of CIKs and checks for the existence of related CIKs. We add additional rows to the filtering list based on our mapping. In the case of CIK/Date filtering for each CIK-MATCHDATE pair you have included if the CIK has one or more related CIKs we augment the file with a new row with the additional CIK-MATCHDATE pairs.

The next question of course, how to see the mapping that was made. There are two ways to do this. If you have a large number of CIKs it might be easiest to review the CIK_MAPPING database that is accessible from the Query Databases viewer. Just select the database from the Query List and hit execute to access the entire list. Scroll through the list to find your CIKs and identify the values for RELATED~

In the next image I pointed arrows to the CIK_MAPPING database and the Execute button. I then boxed around the BASE_CIK for 912752 and 1971213 – you can see that we are asserting that one of these is the successor entity to the other.

The second way to identify the augmented list is to access the filter file that was generated by your selection of CIKs. If you click on the Set CIK File button after you have completed a search the Select Existing CIK File box will be populated with the name of the file that was actually used in the search. Even if you used your own file, the application will read it and then store a copy (augmented or not) in the D:\PhotonUser\AppData\Local\Temp\somenumber directory. Note this is a session directory not a permanent directory.

While it is a bit painful to navigate to that directory you can and open the file to examine it. One of the problems though is it does not map the relationship as you can see below. Further, you can’t predict the relationship based on order.

One of the sanity checks we perform when we make these links it we compare the last balance sheet in the last 10-K for the predecessor to the same balance sheet in the first 10-K in the successor. In this case the issues are even more complicated because CIK 912725 has a continued filing obligation because of some debit securities. Nonetheless, based on my review of all of the facts available, I believe the time series of data related to CIK 1971213 (Sinclair Inc) extends to the filings made by the entity that has the current name Sinclair Broadcast Group LLC with the CIK 912752.

Names are not always predictors of the predecessor-successor relationships. For example, CIK 1999001 Six Flags Entertainment Corporation/NEW is mapped as the successor to CIK 811532 (CEDAR FAIR L P) despite the fact that the name of the filer with the CIK 701374 is Six Flags Entertainment Corp/OLD.

The last 10-K filed by CIK 701374 for the year ended 12/31/2023 reported total assets of 2,711,474,000. The first 10-K filed by CIK 1999001 for the year ended 12/31/2024 listed comparative values for 12/31/2023 with total assets of 2,240,533,000. While these don’t match, CEDAR FAIR’s last 10-K reports total assets of 2,240,533,000 for the fiscal year ended 12/31/2023 which matches the amount reported by Six Flags Entertainment Corporation/NEW.

We created this because early on clients and potential clients would lament the fact that they were losing observations – the most common example they would cite was Oracle (777676 -> 1341439). These are often hard to identify. The best check is when we can compare the balance sheets. However, knowing that these events even happen is hard. These entities do have some reporting obligations but my estimate is that only about 60% of the filers do everything correct. The good thing is that they are relatively rare.

Last note, we don’t automatically update your mapping dictionary. You have to update it. To do so from the File menu on the application select Options. From Options select Update Historical CIK and then press the Perform Update button.

I cannot make a guarantee that we have identified all mappings. The next time I get some bandwidth I am going to see if I can use our compensation databases to identify others. My hypothesis is that if I have a filer that has executive (or director) compensation in 2023 and none in 2024 and then another that has compensation in 2024 but not 2023 and there is overlap in the PERSON-CIKs of the executives then we should research those to see if the 2024 filer is the successor. However overlap alone will not be sufficient. There are a number of executives and directors that had reporting obligations because of their roles at World Wrestling Entertainment Inc (WWE) and have similar reporting obligations with TKO Group Holdings, Inc but TKO acquired WWE, it is not a successor entity.

As an aside – the reason I selected the 2016-2020 span is that none of the successor filers had filings in that span. Both the successor and predecessors had filings in the 2021-2024 span.

Interesting Problem – new code added to directEDGAR code repository.

I was contacted by a research assistant this morning who was trying to collect some interesting data. They had a list of CIKs and with a date variable. What they wanted was to determine/identify any 8-K filed by those companies in a ten day window around the date variable. I was going to point them to our CIK/Date Filter capability (CIK_DATE_FILTER) except I noticed that their list covered the entirety of EDGAR. Further, we have not finished standardizing the metadata at the document level and they wanted the SEC FILING date as well as the acceptance date time. The first constraint meant that the post-query filtering was going to be a drag – much better to search one index at a time. The second issue meant that the inconsistent metadata would mean that some of their time would be wasted.

I ended up writing two bits of code, the first took in their original list and added additional rows and a new field. The code checked each CIK and determined whether or not the CIK was in our CIK_MAPPING database. This is where we try to archive CIK changes. The student had CIK 1532063 with an event date of 6/30/1996. The problem was that CIK 1532063 is the successor to CIK 885721. CIK 1532063 made their first filing 10/6/2011. Therefore without augmenting their CIK list to add an entry for CIK 885721 with the date 6/30/1996 the entry for 1532063 for 6/30/1996 would be missing.

The first bit of code is called augment_cik_data.py and it takes an input csv file with a column labeled ‘cik’ and ‘last_date’ and compares the CIK values to determine if they are in the CIK_MAPPING database. If so, the entry is duplicated with the value from the RELATED_CIK_1 field in the database as the value in the ‘cik’ column and the original value for ‘cik’ added to a new column named ‘BASE_CIK’. Below is a screenshot of the csv file. The entries in red text are the augmented entries.

I then used this file as the input to a second script named pull_matching_from_8K_meta.py. This reads the newly augmented csv file and then finds matching entries based on the CIK date criteria.

To run the first script – log into the platform and upload a csv file with your sample. To follow this example exactly you need a column labeled ‘cik’ and a column labeled ‘last_date’. The cik needs to be an integer and the last_date values need to be in the form YYYYMMDD. from the Catalog icon at the top of the frame select IDLE – see the image below.

When Python opens from the File menu select Open as illustrated in the next screenshot:

Navigate to the platform share and the folder PythonCode to select the augment_cik_data.py script.

When the file opens you will need to modify the input and output paths as illustrated in the next image.

You need to save the file. However, you don’t have write privileges on the share so you will want to save to the same folder as your csv file. Once you have saved it with the edited paths (and other edits you might make) use the Run Module command to run the file.

When the code finishes (very quick) there will be a completion message in Idle. Now that you have supplemented the original sample with any matched CIKs you want to use that output file as the input to pull_matching_from_8K_meta.py. Follow the same steps as above to find the file on the Share, modify the path and save it to your Temporary Files folder and then run it.

The first time I ran this it took about six minutes so I added an index for SEC_FILING_DATE and it took less than a minute.

Are 10-Qs Going to Go Away?

The Wall Street Journal just reported that President Trump is considering eliminating quarterly reporting for US issuers (WSJ Article). Wow! It is an interesting argument – will companies somehow perform better if they were not having to report earnings four times a year? What about investor uncertainty? I will try later to post links to some academic research that has established that information in 10-Qs has significant price implications. I spent much of the weekend trying to sort out how we can quickly get the new data that I expect to see when companies start filing under ASU-2024-03 but frankly I was wondering in the back of my mind if the SEC would somehow lessen the impact of that standard. I did not imagine this.

Why Redact After the Fact?

There is a joke in my household – how many trips to Lowe’s will it take Burch (dad) to fix something? It takes multiple trips because I often don’t know the full scope of what I am dealing with until I am in the middle of it. That same concept applies to the rebuild of our indexes with the new metadata – it seemed like an easy enough task until we hit issues that we were not expecting and I feel like I need to understand them.

As you can imagine we have various logging processes. One of those records details about the filings we download in to our EdgarIndex database. This is based on the RSS feed which we inspect about every 10 minutes. I was using our EdgarIndex to collect fresh copies of 2021 10-K filings and there was a logging message that this filing was not available (www.sec.gov/Archives/edgar/data/1108524/000110852421000014/0001108524-21-000014.txt) today (8/8/2025). This seemed weird because our system is pretty robust, I was pretty sure that entry was not the result of a system hallucination. I went to EDGAR and confirmed, that filing is no longer accessible. The CIK (1108524) is Salesforce.com’s

Time to do some research – I verified that the log entry was valid in the sense that there was nothing unexpected about it. I then went to check to see if we had a 10-K filing that matched the expected RDATE-CDATE-FVALUE based on the log entry. Since the log entry showed the filing was their 12/31/2021 10-K which was accepted at 4:51 PM on 3/17/2021 and we downloaded it at 5:05 PM I expected to see a 10-K in an RDATE-CDATE-FVALUE folder labeled R20210317-C20201231-F14. The 10-K was present.

However, there is no filing on EDGAR – so I was curious and went to their landing page on EDGAR and I saw instead an entry for a filing with a calendar year 2022 accession number.

We have that one as well. This is another example of why we don’t use the Filing Date as a primary indicator in our archive. That particular version was submitted on 2/24/2022 at 4:13 PM. So it turns out that we have two copies of this filing, the original one and the replacement. I pulled the feed from 2/24/2022 and confirmed that this was one of two filings (the other was a 10-Q submitted on 6/1/2020). Deleted on 2/24/2022 by Salesforce.com.

Because we have the original (3/17/2021) filing as well as the replacement submitted on 2/24/2022 I was more than curious about what changed. I compared the documents between the two filings. The only content changes were additional redactions in the NEWER version of an exhibit related to the aircraft timeshare agreement between Mr. Benioff and Salesforce.com. I was lazy I just used the DocumentExtractionTextOnly feature to pull the text from the filing and each of the exhibits. I then read in each line as a list and also created a set (to address blank lines and then compared each of the documents (from the old and new filings). I looked for lines in one that were not in the other using set operations.

The newer version had the model/type of the plane redacted as well as the home base and contact information for someone who I suspect is Mr. Benioff’s CPA/Financial Advisor.

Finally – to my point – why redact once the filing has been released? Given the length of time between the original filing and the deletion I idly searched for some of the lines that were newly redacted. I found four hits when searching Google for “certain Gulfstream GVI (G650ER)“. Here is a link to one of those four searches (Justia Hit) That content was redacted in the replacement filing. Here is the result of that search in directEDGAR in the 2021-2024 10-K index collection.

I reported above that there was a 10-Q filing that was also updated on the same date. The original 10-Q was filed on 6/1/2020. It had even more details about the aircraft that was subject to the time sharing agreement than the 10-K. The agreement in that 10-Q is also publicly available through other sources.

Given the persistence of information disseminated on the internet I just don’t understand why Salesforce.com devoted the money and especially the time to do these redactions. It is not automatic. They had to contact the SEC, there were probably two attorneys who reviewed the content and maybe 19 letters (that is an exaggeration) to the SEC to convince them there was a valid reason to pull and replace the filings. Sure, the filings are not directly available on EDGAR any longer but they were disseminated to both users who then redistribute them to other platforms and users who retain them for internal use only. And then there are folks like us who collect them almost immediately and archive them and make the content available. I just can’t figure out the benefit of the additional redactions.

For our clients, I hope this helps you understand why this transformation is taking so darn long. It is more than just re-tagging the filings. As we encounter these hurdles/unexpected issues, we have to stop and consider our options to make sure we handle these in an optimal way.

As a total aside – the tail number of the plane was available in one of the earlier (and since deleted) filings. I clearly had too much time on my hands. I was curious what a search for it would reveal. The tail number is N650HA – and it is for sale. This link was active as of 8/9/2025 (Gulfstream for sale).

New Metadata (Fields) is Starting to Appear

I stated in my last client update that we would start adding some new fields to our filings in about two weeks. Unfortunately, and as is normal, we ran into unexpected problems that needed to be addressed before we could start the process. I think all of the problems have been addressed and so this process has started. The screenshot below contains a small section from a search of the 2025 Proxy filings with the new metadata field names and content highlighted in red.

ACCEPTANCE is the EDGAR system acceptance time. COFILERS is a pipe-delimited list of all CIKs associated with the filing. FILINGDATE is the SEC determined date for the filing based on EDGAR system rules. This will generally match the RDATE – when they don’t the RDATE represents the date this version of the filing was disseminated through EDGAR. SECPATH is the path to the document that was returned in the search. This is the path to the actual document from the filing, not the landing page (accession.index.htm) for the filing except in those cases where the filing exists on EDGAR as an ACCESSION.txt file without separation of each of the constituent documents.

Most of this, except for the COFILERS was/is available by using one or another of the db. I wanted this particular information to just be immediately available.

The new metadata serves two purposes. First, to give you immediate access to data that might be relevant for your research. For example, if you are going to run an event study using tick data you might want the value for ACCEPTANCE. If you are trying to match this data with data from other sources that provide the accession number that value is a component of the SECPATH. I think this is one of the most common requests we get from clients – asking for help to match the document to an accession number so the results can more reliably be matched to something else. The other reason is to improve filtering (delivering only the results that are relevant to your search needs).

As part of this process I wanted to improve your ability to write code against the individual documents so we restructured the layout of the documents. If you are writing Python code against the archive all of the content should be a child of the body element if the document is an html file. Thus a very quick and direct way to access the body element to do more substantive work is:

# presuming you are just walking the 10-K Y2025 Archive
import glob
from lxml import html
import os
for cik_dir in glob.glob(r'S:\directEDGAR_ACCOUNTING\10KMASTER\Y2025-Y2025\FILINGS\*'):
    for rdate_dir in glob.glob(cik_dir + os.sep + 'R*'):
        for htm_f in glob.glob(rdate_dir + os.sep + '*.htm'):
            with open(htm_f, 'r', encoding = 'utf-8') as htm_fh:
                htm_text = htm_fh.read()
           htm_tree = html.fromstring(htm_text)
           body = tree.find('body')
           # now you can operate on the body and directly access the structural content, text and other features
           #  let's pretend you want the meta elements
           my_dict = {m.attrib['name']: m.attrib['content'] for m in 
                      tree.findall('.//head/meta')
                      if 'name' in m.attrib and 'content' in m.attrib
                      }
           #  we often use the cik-rdate stuff - this is our deid
           cik = cik_dir.split(os.sep)[-1]
           rdate = rdate_dir.split(os.sep)[-1]
           deid = cik + '-' + rdate
           my_dict['deid'] = deid

As I am writing this the 2025 Proxy and 10-K archives have been updated. I hope that before Monday (7/21) the 2025 10-Q and 8-K will be completed and then we will just be moving back in time. How long it will take is anyone’s guess. We will have to download the filings from 2022 to 1994. Unfortunately this will just take time. I know there are various cheats to get around the SEC rate limits but I have never felt they were worth the effort. We set a 1/2 second pause after every header/accession.txt file we download. This keeps us well within the 10 records per-second limitation that the SEC has imposed. It makes things slower but EDGAR is an amazing resource and I like having access to it.