CEO and CFO Fields Added to EC Data

I think the headline tells the story – we have added two new fields to the Executive Compensation data – CEO and CFO. Each takes the value of 1 or 0. 1 indicates that we have identified that person in that role and 0 otherwise.

We had to make some data collection decisions that we might or might not revisit later. One is that if there is a transition during the year we identify the CEO/CFO as the person in that role on the last day of the fiscal year. For example, Brian Niccol was appointed CEO of Starbucks on September 9, 2024. Their FYE was 9/30. Prior to Mr. Niccol’s appointment Rachel Ruggeri served as the interim CEO from August 12 until September 9 and with respect to the 2024 fiscal year Laxman Narasimhan was the CEO from 9/30/2023 until August 11, 2024. Our compensation data has Ms. Ruggeri’s title for 2024 as EXECUTIVE VICE PRESIDENT, CHIEF FINANCIAL OFFICER AND FORMER INTERIM CHIEF EXECUTIVE OFFICER. Mr. Narasimhan’s title for 2024 is reported as FORMER CHIEF EXECUTIVE OFFICER.

Here is a screenshot after using the Query feature to select all CFOs with a salary greater than 650,000.

As a reminder – our EC data is generally updated each week.

Comprehensive Data Collection – Chair and Lead Director with their Tenure

Data collection can be tedious – but there are ways to leverage our platform to really speed up even the most mundane task. I am working on a project with some colleagues and I was tasked with identifying the Lead Independent Director and/or Chair of the board as well as their tenure. This data needed to reflect these positions in calendar year 2024 and the sample is those companies that self-reported as Large Accelerated Filers with a 12/31 (approximate) FYE. The particular challenge for this is to identify the tenure as Lead Independent Director. I am finding that this is not directly disclosed in over 30% of the filings. There is a really easy way to address that – I describe below how to run a second search in another instance of directEDGAR to identify the tenure.

Who wants to type names as well as the critical metadata that is required? Not me – so my first step was to pull the executive and director compensation data for my sample firms. Since these were 12/31 firms I would need to pull 2024 EC and DC data. Below is a screenshot of my pull of the EC data using our Query tool. If you look closely, the CIK Filter button is blue. I have a file that has the 1,297 CIKs in my sample and that file was selected. You can also see that I have the Criteria set as YEAR=2024. I saved the results as a CSV file and then repeated the same steps to collect the DC data.

I then combined the two files. To do so I did have to account for the fact that there are some minor differences in the column headings. For example the EC data has PERSON-TITLE, SALARY and BONUS whereas the DC data does not have PERSON-TITLE, SALARY or BONUS and instead has CASH. Despite these differences it took just a few minutes to arrange the columns – I did add a new column defining the nature of the data (EC/DC). I did that because I wanted to sort these and keep the EC and DC data separate.

There is a little bit of a challenge here. The order of the search results will not match the data order unless we force it. That problem is easily solvable. What I am going to do is to run a search in the 2025 Proxy archive for (DOCTYPE contains(DEF*)) filtered on my CIK list. I am then going to use the SummaryExtraction file to sort my EC/DC data to MATCH the search result order. Here is something important – the search result order will always match this order (if there is a matching document).

After downloading the SummaryExtraction file I inserted a new column to the left of the CIK – ORDER and just numbered the results in order.

I then copied those two columns to my combined compensation file and used EXCEL’s VLOOKUP function to copy the value for ORDER into the compensation data. Here is a screenshot of some of the rows in that file after I added the ORDER value.

This took me about ten minutes but it will save me significant time as I go forward. My next step is to identify those companies that have a lead independent director so I can then identify the name of that person. Not every company will have a lead independent director – particularly those with an independent chair. So I am not going to look at the initial search results. The first search was just to define the order I want to use for my data collection. I am going to run a new search. I am still limiting the results by my CIK list but now I am searching for (DOCTYPE contains(DEF*)) and (lead director or lead independent director) andany since. Here is a screenshot of the result of that search.

I want to explain this search. It only returns documents that

  1. Are DEF 14A
  2. Have either
    1. The phrase Lead Director, or
    2. The phrase Lead Independent Director
  3. If the word Since is present in a document that meets both 1 & 2 above then all instances of the word Since will be highlighted as well.

If you compare these results with the image above from the search of (DOCTYPE contains (DEF*)) you will notice we have fewer results. That just means that those companies not listed in this search do not have either phrase we were searching for. That is fine – I am focused on identifying whether or not there is a person with the title LEAD DIRECTOR or LEAD INDEPENDENT director and I want their tenure.

My goal at this stage is to identify the Lead Independent Director and their tenure in that position. I am clearly going to be lucky because there will be cases where the chair is also identified – if you scan that you can see that Ellen Gordon is the chair and has been since 2015. That is a lucky accident for me. Since I included both EC and DC data I will go ahead and record that information.

The filing expressly reports that they (Tootsie Roll) do not have a lead independent director – so I very quickly completed my data collection for Tootsie Roll.

In a world optimized for my data collection needs the disclosures would all fall into a pattern like this one for BOSTON BEER – very clear indication that Mr. Nemeth is the Lead Director and has served in that position since May 2024.

As many of you know, the world is not optimized for our data collection. Sometimes we need a boost. Below is a screenshot of the location where I identified the name of the Lead Independent Director in Seacoast Banking Corp of Florida’s proxy.

After careful review I could not find the year Mr. Fogal was first elected to that position. So here is an important trick. I suspected that the company has probably treated that information as a boilerplate disclosure. So I started another instance of the directEDGAR application in my Appstream session and used the clues in the disclosure above to search for prior disclosures to see if I could identify when he started as Lead Independent Director. In the next screenshot you can see I have selected proxy filings back to 2005.

Based on the disclosure above the search I ran was elect* w/5 fogal w/10 lead independent. No need to filter on DOCTYPE or do anything more complicated. Here is a screenshot of that search and the evidence I need to record 2018 as the beginning of his tenure as Lead Independent Director.

You can see that my original search is partially visible in the background. I am using the second instance to help identify the tenure once I have identified the Lead Independent Director if their start date is not mentioned in the filing. This keeps me focused and moving forward without having to agonize over how to collect this data. Names are fairly unique, especially when we limit on phrases like Lead Independent Director. I did not run into much noise when I constructed these secondary searches.

In the Weeds – Another Interesting Example Why We Use the RDATE as our Proxy for the Filing Date

We are going to reconstruct our filing archive. I was testing the code that will process the accession.txt files today on 2024 proxy filings. Rather than pull new filings I was working with the 2024 files we pulled during our production run in 2024. It is a bit tedious but necessary to identify all of the issues that we had not thought of when we are working on the code.

Here is the link to the index page for a DEF 14A filing made by Biora Therapeutics – notice that the filing date is reported at 2024-09-16. If you look carefully at the index page you will not see the date 9/20/2024. However, ultimately the filing will end up in the directEDGAR archive with a DEID of 1580063-R20240920-C20241009-F36. The reason for that is kind of interesting if you are an SEC wonk.

Here is a little extract from the accession.txt file that was in our archive and originally pulled on 3:45 PM (CDT) on 9/16:

</SEC-HEADER>
<DOCUMENT>
<TYPE>DEF 14A
<SEQUENCE>1
<FILENAME>d48307ddef14a.htm
<DESCRIPTION>DEF 14A
<TEXT>
</TEXT>
</DOCUMENT>
<DOCUMENT>

Notice that there is no content between the SGML open and close TEXT tags despite other clues present that there should have been an HTML file with the filename d48307ddef14a.htm between those tags. Based on these clues I concluded that either the EDGAR processor ripped out the original filing during the submission process or the filer somehow failed to map in the HTML file when constructing the accession.txt file.

If we look at the associated header file from EDGAR – the current version differs from the one we originally pulled. I have pasted the first few lines from each of the header files below:

The date next to the header ‘declaration’ differs between the two filings. However, what I find really interesting is that when this filing was originally submitted a DATE-OF-FILING-DATE-CHANGE tag was added with the date of the original submission. This tag is not found in every filing. Per the EDGAR filer manual it is supposed to indicate the “Date when the last Post Acceptance occurred.” It is an optional tag and we generally see it when the SEC-HEADER associated date does not equal the filing date. I should clarify though – we see it much more frequently than expected. We see it even when the complete filing does not appear to be changed and the date next to the header tag matches the FILING-DATE. The existence of that tag is not a clear signal that something might be amiss.

Chasing these kinds of things is probably a waste of time but I am hugely curious about all things EDGAR so I decided to pull the feed archive for 9/16/2024. I drilled through it and found the original filing and it matched the one we pulled on 9/16/2024 – specifically the HTML content was missing. Based on my experience with these I then pulled the feed archive for 9/20/2024 and found another version of the accession.txt file (saved as an accession.nc file) and it matched the current version that is available from the index page.

What can we conclude – I do not believe the DEF 14A was available to the public on 9/16. If you were running an event study and this filing was in your sample you would be introducing error by using the Filing Date that is reported on the landing page, in the EDGAR indexes and in the filing itself.

What I find particularly interesting is that it is not clear to me that we can rely on the file dates as reported in the EDGAR archive for the filings. Here is a screenshot of the accession archive directory for this filing.

Based on everything above, I am suspicious that the LAST MODIFIED date for this filing is correct. The htm file was not included in the dissemination feed on 9/16 but was present on 9/20. Thus, I suspect that the filing was just not available until 9/20. I find it interesting that the EDGAR code altered the LAST MODIFIED date to match the acceptance date-time.

I did warn you that this was a bit wonky. More on the archive rebuild later.

Record Dates as Reported in the Proxy

I had a client reach out recently and asked for some help developing a strategy to collect the record dates for annual meeting. They were imagining that they were going to have to run a search for the phrase “Record Date” and then spend significant time looking at the results to transfer the date to their data collection worksheet. This client had no significant experience with Python. I wanted to demonstrate how careful use of AI tools could really accelerate this work. I also wanted to help them dip their toes into using Python.

We ran a search on the 2025 proxy archive for ((DOCTYPE contains(DEF)) or (DOCTYPE contains(PRE)) ) and (date(January 1 2025 to december 31 2025) w/10 (record date) ) When we did this the only proxies available were those filed through 4/11/2025 – but the record date is expected to be after the proxy filing and they needed to do this for all years back to 2014. I was helping them create a pattern that could be used again. Here is the result of that search:

I showed them how to set the span of the context extraction – for something like this a tighter span is generally better. I set the span to 3 by using the File/Options selection from the application menu.

We then used the Extraction\ContextExtraction feature to save to their Temporary Files folder on their instance. Here is a screenshot of the file contents centered on the disclosure from Roblox’s proxy as seen in the image above.

We then brainstormed with ChatGPT – I created a small sample of the file and – remember this is taking place in a conversation – I was not trying to be formal but wanted the client to see how simple the process could be.

Here is a sample CSV file that was created pulling the relevant context from a search using the directEDGAR platform.  (note I passed a sample of the csv file)

The CONTEXT column is the text from the document. 
The FILENAME defines the document.  
Notice that one document can have multiple context outputs and they might or might not overlap.  
I want to use Python to analyze the context to determine and have included in a separate column the date, in this example the date is Month, dd, YYYY but there can be variations - I have seen m/d/yyyy and sometimes the comma is missing after the date in the Month dd, YYYY format.  
If there is an explicit reference to either record or record date in a particular single context extraction I want that noted, if there is not an explicit reference to a record date I want that noted.  My actual sample is more than 3,000 rows so I thought it best to offer just a small sample

Note - my preference is to use the csv module to read and write

We looked at the initial results and ChatGPT pulled the dates but the formats varied so we asked:

Lets normalize the dates in m/d/y format

Here is the resulting code:

import csv
import re
from datetime import datetime

def extract_date(text):
date_patterns = [
r’\b(?:January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{1,2},?\s+\d{4}’,
r’\b\d{1,2}/\d{1,2}/\d{4}’
]
for pattern in date_patterns:
match = re.search(pattern, text, re.IGNORECASE)
if match:
return match.group(0)
return “”

def normalize_date(date_str):
date_formats = [
“%B %d, %Y”, # March 27, 2025
“%B %d %Y”, # March 27 2025
“%m/%d/%Y”,
]
for fmt in date_formats:
try:
parsed_date = datetime.strptime(date_str.strip(), fmt)
return parsed_date.strftime(“%-m/%-d/%Y”) # Unix-style no leading 0
except ValueError:
continue
return date_str

def has_record_mention(text):
keywords = [‘record date’, ‘record’]
text_lower = text.lower()
return any(keyword in text_lower for keyword in keywords)

def process_file(input_path, output_path):
with open(input_path, mode=’r’, newline=”, encoding=’utf-8′) as infile, \
open(output_path, mode=’w’, newline=”, encoding=’utf-8′) as outfile:

reader = csv.DictReader(infile)
fieldnames = reader.fieldnames + [‘EXTRACTED_DATE’, ‘RECORD_DATE_MENTIONED’]
writer = csv.DictWriter(outfile, fieldnames=fieldnames)
writer.writeheader()

for row in reader:
context = row.get(‘CONTEXT’, ”)
date_found = extract_date(context)
row[‘EXTRACTED_DATE’] = normalize_date(date_found) if date_found else “”
row[‘RECORD_DATE_MENTIONED’] = “Yes” if has_record_mention(context) else “No”
writer.writerow(row)

input_csv = “record_dates_chat_gpt_sample.csv” # Change to your input file
output_csv = “record_dates_processed_normalized.csv” # Desired output file
process_file(input_csv, output_csv)
print(f”Processing complete. Output saved to ‘{output_csv}'”)

Here is the result of running their input file through that code:

We are not done yet – the client astutely noticed that this was not the whole collection of PROXY filings. So we needed to see what we missed. I ran a new search with them xfirstword and not (date(January 1 2025 to December 31 2025) w/10 record date) and ( (DOCTYPE contains(def)) or (DOCTYPE contains(PRE)) ). This identified the filings that did not match the criteria. We used the infamous CTRL+F to iterate through the documents and we learned two things. First there is additional variety – we saw language such as stockholders as of record on someDate. But for the most part the record date was not set so we saw many of these examples.

Our client had to get to another meeting but they did report that they were confident about the next steps. Actually, the were pretty ecstatic – we collected dates from 2,500 filings in 30 minutes and that included a little bit of wrangling with ChatGPT.

Unexpected Benefit of Newly Structured Compensation Data

I was reading a recent paper in Accounting Horizons that showed up in my Google Scholar feed with a directEDGAR citation. Professors Bright and Kim from Lehigh University and Professor Koo from Kyungpook National University collaborated on CEO outside Board Service and Managerial Short-Termism. They report finding that suggest CEO’s limited outside board service is associated with the CEO seeming to be more willing to focus on a longer performance horizon. Congratulations to all for having their research recognized and published. I have reached out to Professors Bright and Kim they earned some ice-cream from eCreamery and the shirt off my back!

As I was reading the paper I noted that they needed to identify CEOs with outside directorships. They used BoardEx to identify their sample. I read further that they used our platform to collect stock compensation for their CEOs. This got me wondering – since we use the PERSON-CIK as a key identifier when processing compensation data could I create a sample to identify all executives with outside directorships using directEDGAR? I think I can.

Here is one pass on the code to do so – the results are saved to a CSV file. Because I agonize over every little detail there are too many lines in the output file. I decided to match if the officer served as an outside director in the year before, the year of, or the year after the compensation year.

import sqlite3
import csv


# Paths to the databases
exec_db_path = "S:/directEDGAR_DATA/EXECUTIVE_COMP.db"
dir_db_path = "S:/directEDGAR_DATA/DIRECTOR_COMP.db"

# Establish database connections
with sqlite3.connect(exec_db_path) as exec_conn, sqlite3.connect(dir_db_path) as dir_conn:
    # Cursor to retrieve distinct years in EXECUTIVE_COMP table
    exec_cursor = exec_conn.cursor()
    exec_cursor.execute("SELECT DISTINCT YEAR FROM EXECUTIVE_COMP")
    years = exec_cursor.fetchall()  # List of years in EXECUTIVE_COMP

    output = []  # Collect results here

    for year_tuple in years:
        year = year_tuple[0] # Extract year value

        # Fetch all executives for the given year, filtering for numeric PERSON_CIK
        exec_cursor.execute("""
            SELECT CIK, YEAR, PERSON_TITLE, PERSON_NAME, PERSON_CIK 
            FROM EXECUTIVE_COMP
            WHERE YEAR = ? AND PERSON_CIK NOT LIKE '%AVAI%' 
              AND CAST(PERSON_CIK AS INTEGER) IS NOT NULL
        """, (year,))
        exec_records = exec_cursor.fetchall()


        # Fetch all director records for YEAR-1, YEAR, and YEAR+1, with numeric PERSON_CIK
        for yr in [year - 1, year, year + 1]:   #this checks the year before and after
            dir_cursor = dir_conn.cursor()
            dir_cursor.execute("""
                SELECT CIK AS CIK_DIRECTORSHIP, YEAR AS DIRECTORSHIP_YEAR, PERSON_CIK, AGE, SINCE 
                FROM DIRECTORCOMP
                WHERE YEAR = ? AND PERSON_CIK NOT LIKE '%AVA%' 
                  AND CAST(PERSON_CIK AS INTEGER) IS NOT NULL
            """, (yr,))
            dir_records = dir_cursor.fetchall()

            # Build a dictionary for fast lookup of directors by PERSON_CIK
            dir_dict = {int(record[2]): record for record in dir_records if record[2].isdigit()}

            # Match executives to directors based on PERSON_CIK
            for exec_record in exec_records:
                exec_cik, exec_year, person_title, person_name, person_cik = exec_record
                if person_cik.isdigit():  # Only process if PERSON_CIK is numeric
                    person_cik = int(person_cik)  # Convert to integer for matching
                    # Check if the executive PERSON_CIK matches a director entry
                    if person_cik in dir_dict:
                        dir_cik, dir_year, _, dir_age, dir_since = dir_dict[person_cik]
                        if dir_cik == exec_cik:
                            continue
                        output.append((exec_cik, exec_year, person_title, person_name, person_cik, dir_cik, dir_year,
                                       dir_age, dir_since))
            print(year, len(output))

# Display or save the output as needed
print(len(output))

# Specify the output CSV file path


# Column headers for the CSV
headers = ["CIK", "YEAR", "PERSON_TITLE", "PERSON_NAME", "PERSON_CIK", "CIK_DIRECTORSHIP", "DIRECTORSHIP_YEAR",
           "DIRECTOR_AGE", "DIRECTOR_SINCE"]

# Write the output list to a CSV file
output_csv_path = r"D:\PhotonUser\My Files\Temporary Files\MATCHED_EXECS.csv"
with open(output_csv_path, mode="w", newline="") as file:
    writer = csv.writer(file)
    writer.writerow(headers)  # Write the headers
    writer.writerows(output)  # Write the data rows

print(f"Results have been saved to {output_csv_path}")

Tim Cook has been an outside director at Nike (CIK: 320187) since 2005. Here is a cut of the output file (that has more than 100,000 rows) with some of Mr. Cooks service reported:

CIKYEARPERSON_TITLEPERSON_NAMEPERSON_CIKCIK_DIRECTORSHIPDIRECTORSHIP_YEARDIRECTOR_AGEDIRECTOR_SINCE
3201932011CHIEF EXECUTIVE OFFICERTIMOTHY D. COOK12141563201872011502005
3201932012CHIEF EXECUTIVE OFFICERTIMOTHY COOK12141563201872012512005
3201932013CHIEF EXECUTIVE OFFICERTIMOTHY COOK12141563201872013522005
3201932014CHIEF EXECUTIVE OFFICERTIM COOK12141563201872014532005
3201932015CHIEF EXECUTIVE OFFICERTIM COOK12141563201872015542005
3201932016CHIEF EXECUTIVE OFFICERTIM COOK12141563201872016552005
3201932017 CHIEF EXECUTIVE OFFICERTIM COOK12141563201872017562005
3201932018CHIEF EXECUTIVE OFFICERTIM COOK12141563201872018572005
3201932019CHIEF EXECUTIVE OFFICERTIM COOK12141563201872019582005
3201932020CHIEF EXECUTIVE OFFICERTIM COOK12141563201872020592005
3201932021CHIEF EXECUTIVE OFFICERTIM COOK12141563201872021602005
3201932022CHIEF EXECUTIVE OFFICERTIM COOK12141563201872022612005
3201932023CHIEF EXECUTIVE OFFICERTIM COOK12141563201872023622005

Mr. Cook’s PERSON-CIK is 1214156 – so by checking for his CIK in the DC data by year we have him in our output.

Of course – we are not perfect – in the Bright, Kim and Koo paper they focused on CEOs. We have yet to normalize the titles of the officers. Now that we finally have a complete pipeline to make the compensation data more easily available I think I am going to have one of our interns start working on an officer title mapping dictionary. This should not be challenging, especially when considering key positions like the Chief Executive Officer and the Chief Financial Officer.

As I was writing the last paragraph I started wondering – well suppose I wanted that now – how hard would it be to start cleaning that up. Excel has a great feature – remove duplicates and so I copied the titles to a new sheet and ran that function – out of the more than 106,000 total matches there were only 7,471 unique titles

In this reduced set there were 1,521 titles with the phrase CHIEF EXECUTIVE OFFICER. Of course it is not that simple – if you look at the next screenshot you will see that title is part of the title of an officer who is the CEO of a business unit or subsidiary of the reporting company.

Based on this quick review I am convinced of two things. First, if I wanted to recreate their sample and test another hypothesis I feel confident I could do so starting with that matching code from above. The one thing I have not yet accounted for is a merge with say Compustat. Limiting my sample to those firms covered in Compustat would further reduce the number of titles I would have to analyze. I suspect 15-20% of the executive we have from the first pass would not match to companies available in Compustat. Further, others would be excluded once we started controlling for the usual (for example they excluded all financial firms). Frankly, I think if I wanted to do this for a project today I am confident I could identify all of the CEOs from this file in relatively short order – perhaps a couple of hours. And while this may seem daunting – I began this journey by collecting acquisition purchase price allocations from micro-fiche 10-Ks and annual reports. This is nothing!

A final thought as I wrap this up – so initially it might seem a bit tedious to identify the CEOs by analyzing the titles. I was sharing this discovery/directEDGAR extension with one of my colleagues and she reminded me that she was a co-author on another paper 6/7 years ago. They also needed to match data from BoardEx to our compensation data. She described the process I helped her set-up to match by names across the two samples. That process took much more time and was hugely more tedious then it would take to normalize these titles.

Some Vindication on N-PX Analysis

On 10/26 I posted about the challenges associated with analyzing the data quality of N-PX filings (N-PX Update). Today I was making a plan for the weekend, which included working on the code we are using to process the N-PX filings and saw that SRH Total Return Fund (CIK: 102426) filed an amended N-PX (N-PX/A) filing on 10/28. Of course I was curious so I opened it and discovered it was a restatement of their 8/30 filing (this filing was the subject of the prior post).

Sure enough – it appears that we made the right decision to deflate each of the reported values by 1,000,000 as a quick comparison of the data in the two filings suggests that the values in the amended filing match the values after scaling.

Because this filing is a restatement – when this N-PX/A is processed we will actually remove from the database all of the rows from the prior filing and then insert the rows from this new filing. In this case the values for sharesVoted and sharesVoted_ORG will match. Until that happens the sharesVoted and sharesVoted_ORG will differ by a factor of 1,000,000.

I would like to claim that my earlier post caused them to review their filing but I have no way of knowing that.

More Explanation of N-PX Issues

After collecting all of the data from the N-PX filings that were available in the new form we did some evaluation of the data quality. I wrote an earlier post about some of the cases – but I want to expand on that because of some new findings.

One of the problems we have when we collect data is the challenge associated with evaluating the quality of the data. Our first goal is always answering the question – did we capture what was reported? This is not difficult to answer affirmatively with the new xml format of the N-PX filing. An equally important question is to try to evaluate for our client’s benefit – do we think the data is reasonably accurate? That takes some imagination and we do not always expect to be right. It is easy enough to identify data at the extremes in magnitude – tougher to establish whether or not data that is not seemingly extreme is accurate.

We have identified reported holdings and votes that don’t make sense. For some of the filers we have been able to reasonably (we think) decide to deflate all of the reported values by 1,000,000 shares. For example the N-PX filing made by SRH Total Return Fund (CIK: 102426) reports that they voted 1,028,000,000 shares on issues relating to Berkshire Hathaway.

Warren Buffett is almost my neighbor so I knew that number was suspicious. We pulled up their (SRH’s) NPORT-P filing made on 1/29/24 to check the number of BRK-A and BRK-B shares they reported holding as of 2/29/2024. According to that filing they held 1,028 shares of the Class A stock and 345,000 shares of the Class B.

We matched across another couple of holdings and based on that review we are reasonably confident that they inflated each of the numeric share counts in their filing by 1,000,000. Thus, when you access the database and if you pull those holdings you will see that we report 1,028 for shares voted (sharesVoted) and 1,028,000,000 for shares voted original (sharesVoted_ORG).

Unfortunately there are some cases where we cannot reasonably infer the value that should be reported. An example of this situation can be found by reviewing this filing made by Kodai Capital Management LP (CIK: 1999612). Here is a screenshot of the first three rows from the filing.

They are reporting holding more than 600,000,000 in each of 3M, Adobe and Amazon. This is a huge concentration of wealth. However, when we look at their 13-F holding reports prior to the listed meeting dates – there is no reasonable deflator we can use to systematically get back to a value close to what is reported in the 13-F filing.

We also looked at the VOTING AUTHORITY columns from the 13-F and they don’t really provide much help, I could guess that we could argue to deflate the values from the N-PX by 2,000,000 based on the values reported in the SOLE VOTING AUTHORITY column but that seems much harder to defend then the more clear-cut case from the SRH filing.

I am reluctant to toss the data (what do I know) and I am also reluctant to leave the data in the original form. Therefore, I made a decision to report None for sharesVoted, sharesOnLoan and voteRecord_sharesVoted. However, we will leave unchanged the as-reported values for sharesVoted_ORG, sharesOnLoan_ORG and voteRecord_sharesVoted_ORG.

Issues with the Updated N-PX Filing: Management Recommendation and a Peculiar Decimal Error

Today, we spent some time reviewing the new N-PX filings and uncovered a few complexities, particularly around the “voterecord_managementRecommendation” column. Initially, I misunderstood what this variable represented (we’ve since updated the N-PX field guide accordingly). As it turns out, it’s more nuanced than before.

An important change in this form relative to the prior versions of the FORM N-PX is that this form does not require the filer to directly report the recommendation of management on the voting issue. Instead of simply reflecting management’s position, the managementRecommendation actually indicates how the filer’s vote compares to the management’s recommendation. For example, if the filer voted AGAINST a shareholder resolution that management also recommended voting AGAINST, the managementRecommendation field would report FOR.

To clarify this, here’s a simple matrix:

voteRecord_howVotedmanagementRecommendationManagement Position
FORAGAINSTAGAINST
AGAINSTAGAINSTFOR
FORFORFOR
AGAINSTFORAGAINST

It is a bit convoluted. Reviewing the first line in the table above – the filer voted FOR, which was AGAINST management’s recommendation – therefore the proxy should report that management recommended a vote AGAINST. But the second row demonstrates that the filer cast shares AGAINST the proposal and this was AGAINST management’s recommendation so the proxy should report that the management position was FOR the proposal. The last two rows describe the cases where the filer’s votes matched management’s position (recommendation) as reported in the proxy. The first two rows describe cases where the filer voted in opposition to management.

On to a New Issue: Distorted Vote Counts

While investigating the managementRecommendation variable, we encountered another oddity—absurdly large vote counts. Here’s an example from Amazon’s say-on-pay vote from a few institutional investors’ N-PX filings:

We believe we have identified a pattern across roughly 25,000 individual cases from 153 N-PX filings. The issue seems to stem from a decimal error made by the Filer. We confirmed that this was likely an error by comparing the reported holdings in 13-F and N-PORT filings. We believe that it might be possible to develop a check for this issue and adjust the sharesVoted and voteRecord_sharesVoted values. We have additional testing to do to confirm that our check and fix will perform as expected.

After correcting for this error error, the data relating to Amazon showed 956,903,357 votes AGAINST the say-on-pay proposal by institutional managers. This seems reasonable given that Amazon’s 5.07 8-K filing from their annual meeting reports almost 1.7 billion votes cast against the Say-on-Pay proposal.

Before this discovery, the sum of votes exceeded the total number of shares cast by a factor greater than 10, which understandably raised some eyebrows.

A Cautionary Note for N-PX Data Users

This discovery underscores that the N-PX data, as received from the SEC, may not always be reliable for vote counts. If you’re working with this data, be sure to sanity-check the share numbers to avoid dealing with unrealistic or erroneous figures. Frankly it makes us a bit nervous to make changes in the data because there is always the chance we will introduce an error. Our first and most important goal is to confirm that we collected the data as-reported. Once we have done that it is then possible to investigate for anomalies. The problem with that is we have to switch to heuristics in our coding and analysis. We try to separate coding done deterministically from coding done heuristically.

12,522,509 Institutional Vote Actions Extracted into Database

A client recently contacted me about making data from form N-PX more available. Form N-PX is used by institutional managers to report how they voted their shares in proposals put to shareholders. These reports have been filed/mandated by the SEC since 2004. Until this year they have been filed as txt or htm documents with significant variation in the structure of the document. We have periodically looked at them but since there is always something to do we have not invested significant resources into normalizing these filings.

This year the SEC required these filings to be filed using XML to report the voting records for each security and so it became much easier to parse them. In addition to revamping the form of the disclosure the SEC significantly increased the number of filers that are now required to report at least their vote on SAY-ON-PAY issues. Because of the increase in reporting obligations there were almost 9,500 N-PX filings filed by the deadline (8/30) this year as compared to roughly 4,000 per-year in prior years.

If you want to jump to see the Field Guide for this database – use this link to take you to the Knowledge Base (KB LINK)

I have parsed these and loaded them into a new database (npxdata) that is now available on the platform. Remember, I take constructive criticism well. I decided to omit all reporting of securities that do not have a CUSIP. I did this because the database is large and I am not certain of the value of having a compilation of the reports of non-US entities when there is not an underlying CUSIP available. I made the point about constructive criticism because we can make the data from these other securities available – I will wait for a prompt and discussion to make sure this will be valuable. The data has been parsed already to putting on the platform will not take significant effort. If there is enough interest we may just make both available but separate based on the available identifiers.

Here is a screenshot of the results of a search for reported votes against the Say-On-Pay proposals.

Forgive the repetition here – I am having too much fun exploring this. Here are the results of a search to identify all proposals related to environmentOrClimate:

The initial release of this data relies on the CUSIP. I need to take another step to merge with the CIK. That might take a bit of time and since CUSIP is available in other databases you might merge with I decided to do this initial release without the CIK.

Some of the way the data is displayed/delivered is not very efficient. Here is a screenshot of the original data that CAPTRUST FINANCIAL ADVISORS reported as it is displayed in the rendered html table on EDGAR. This data relates to their votes on the Advisory vote to approve executive compensation at the 2024 annual meeting for The 3M Company.

There are 26 entries reporting votes as they relate to different tranches of shares but for the life of me I cannot sort out why they were reported in this manner. The total number of votes against is 68,847 – so this matches the number of shares that they report as having voted – but I can’t understand why it was useful in their mind to report those individually in this manner. At the present time each of those individual votes have been preserved in our database as a separate line. We will include all of the other metadata about this vote for each line. Once I am confident we are not losing any useful data it is likely that we will consolidate all of these into one row. (I will observe that there are no included managers associated with this filing and their is no series information). It would be interesting to understand the process by which these votes were case – did they in-fact fill out 27 individual proxy cards because these shares are held in 27 individual accounts?

One really nice feature of having this data is that it is now trivial to identify specific proposals and the nature of their sponsorship. The SEC had curated a list of vote categories and these are a required disclosure for each item that is reported. Here is the current list (there are 14 items).

N-PX filers are required to assign at least one and as many as 14 codes to each vote that they report. Yes, there are some cases where 14 codes were assigned to one vote issue. From my perspective – this database can be used to identify those filers that had a voting issue that you might want to explore. This makes it easier to identify for example all voting issues that related to ENVIRONMENT OR CLIMATE. What we did was to remove the spaces and converted the text to camel-case. DIRECTOR ELECTIONS is now directorElections. We did this so we could convert these to 1/Null values. The screenshot above that has the results

The SEC mandated that the N-PX filer report on the sponsor of the proposals. The field in the N-PX filing is voteSource – we preserved that field name. The valid values are ISSUER and SECURITY HOLDER. Thus any proposal that has SECURITY HOLDER as a voteSource was sponsored/initiated by a share holder.

Filers are also supposed to report on the number of shares they have control over that are on-loan. My phrase construction may leave a bit of confusion because control is a bit nebulous term in this context. Nonetheless there is a field for SHARES ON LOAN (sharesOnLoan in the db). Institutional investors may lend shares and also direct the votes for those shares so those cases do exist. Shares may be on loan but if the investor directed the voting then those votes will be recorded.

One challenge for our initial users is going to be to sort out how to aggregate these by proposal (voteDescription in the database). The SEC’s language is fairly explicit that the filer is to “If a form of proxy in connection with a matter is subject to rule 14a-4 under the Exchange Act [17 CFR 240.14a-4], the description in Item 1(f) must: (i) use the same language that is on the form of proxy to identify the matter; . . .”

Here is the exact language of a proposal in Apple’s proxy:

I pulled all of the details of votes relating to proposals detailed in Apple’s proxy for their annual meeting that took place on 2/28/2024. My expectation was to find the language EEO Policy Risk Report. However I found the following variations of that language

  • A shareholder proposal entitled “EEO Policy Risk Report”
  • A shareholder proposal entitled  EEO Policy Risk Report .
  • A shareholder proposal entitled ”EEO Policy Risk Report”
  • A shareholder proposal entitled EEO Policy Risk Report .
  • Report on Risks of Omitting Viewpoint and Ideological Diversity from EEO Policy
  • A SHAREHOLDER PROPOSAL ENTITLED EEO POLICY RISK REPORT&QUOT. &QUOT

Clearly we can clean a lot of this up and we will. Now that this data is in a database we can view it and start developing rules to help normalize these labels. But that will take some time, thought and extensive testing. However, the paraphrase/translation used by the investor as displayed in the fifth row (highlighted in red) will be challenging to normalize when done at scale. We could not normalize these until we could see them – so that step will happen – I am going to presume that it is more valuable for you to have this data now rather than wait for the arduous process of confirming the editing that we need to do.

Here is a code sample that you can use to pull all rows from the database where section12ASayOnPayVotes = 1 and voteRecord_howVoted = ‘Against’

import sqlite3
import csv

# Define the database path and output CSV file
DB_PATH = r"S:\directEDGAR_DATA\npxdata.db"
output_csv = r"D:\PhotonUser\My Files\Temporary Files\filtered_data.csv"


try:
    # Connect to the SQLite database
    conn = sqlite3.connect(DB_PATH)
    cursor = conn.cursor()

    # Execute the query to select rows with the specified conditions
    query = """
    SELECT * FROM npxdata 
    WHERE section14ASayOnPayVotes = 1 
    AND voteRecord_howVoted = 'AGAINST'
    """
    cursor.execute(query)

    # Fetch all matching rows
    rows = cursor.fetchall()

    # Get the column names from the cursor description
    column_names = [description[0] for description in cursor.description]

    # Open the CSV file for writing
    with open(output_csv, mode='w', newline='', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=column_names)
        writer.writeheader()

        for row in rows:
            try:
                # Create a dictionary mapping column names to row values
                row_dict = dict(zip(column_names, row))
                # Write the row to the CSV file
                writer.writerow(row_dict)
            except Exception as e:
                print(f"Error writing row: {row_dict} - {e}")

    print(f"Data successfully written to {output_csv}")

except sqlite3.Error as e:
    print(f"Database error: {e}")
except Exception as e:
    print(f"Error: {e}")
finally:
    # Close the database connection
    if conn:
        conn.close()

For your information – I was in a hurry so I asked ChatGPT for the code above. The prompt I provided was:

You said:
1. I have an sqlite database located at "S:\directEDGAR_DATA\npxdata.db" 
2. There is a table in the database named  npxdata 
3. There is a field named section14ASayOnPayVotes that has a value of 1 or Null 
4. There is another field named voteRecord_howVoted that has a value of FOR or AGAINST.
5. I want some python code to select all rows from the database where section14ASayOnPayVotes= 1 and voteRecord_howVoted = AGAINST - 
6. I want to save them to a csv file named r"D:\PhotonUser\My Files\Temporary Files\filtered_data.csv"
7. I want to use Pythons CSV library - not pandas 
8. I do not know the names of the fields so I want the code to find the columns in-order from the database 
9. I want the rows to be written using the DictWriter method.  
10. I want to write each line individually have some graceful error handling in case there is a problem writing any particular row.

The new form makes access to this data more efficient and enhances the nature of the disclosures. We hope pushing it into the database helps accelerate your research. Our next step is to add the CIK and then we will work on normalizing the voteDescription. Look in the Knowledge Base for a description of all of the fields.

XBRL Data – Interesting Assumption in the Research

For a number of reasons I have been reading some of the academic research that has used XBRL data. There is a fascinating underlying assumption – this is not a quote but the sense of this following statement pervades a fair number of these papers. Before the introduction of tagged financial data, users were constrained to either wait until the commercial data services normalized the data and added into their data bases or they were stuck consuming 10-K filings singly.

I don’t think that is true at all. It took me about 10 minutes to write 28 lines of Python code to parse the income statement for Apple from all years from 2005 – 2023. Here is a screenshot of one of those tables:

Here is the code – it relies on a directEDGAR generated summary file, I searched for the 10-K filings of Apple and saved the summary file to get the paths.

from lxml import html, etree
import csv

xpath_expr = (
    "//table["
    "contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'basic') and "
    "contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'diluted') and "
    "contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'operating') and "
    "contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'research') and "
    "contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'years') and "
    "not(contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'quarter') or "
    "contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'months') or "
    "contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'reported'))"
    "]"
)

with open(r"H:\apple\summary.csv") as csv_fh:
    csvReader = csv.DictReader(csv_fh)
    for row in csvReader:
        file_path = row['FILENAME']
        cik, rdate = file_path.split('\\')[5:7]
        with open(file_path,'r', encoding="utf-8") as doc_fh:
            filing = doc_fh.read()
        apple_tree = html.fromstring(filing)
        tables = apple_tree.xpath(xpath_expr)
        for indx, table in enumerate(tables):
            new_html = etree.Element("html")
            body = etree.SubElement(new_html, "body")
            body.append(table)
            new_html_str = etree.tostring(new_html, pretty_print=True, method="html", encoding="unicode")
            new_html_str = new_html_str.replace('\xa0', ' ')
            new_name = '-'.join([cik,rdate,str(indx + 1)]) + '.htm'
            with open('H:\\apple\\newtables\\' + new_name, 'w', encoding='utf-8') as out_fh:
                out_fh.write(new_html_str)

It took me another couple of minutes to normalize the row labels – I did that with directEDGAR’s Dehydrator/Rehydrator tools.

When our Dehydrator/Rehydrator features were added to directEDGAR I had four calls from some of the largest hedge funds in the US. My sense after the second call was that they were wondering if we had something better than what they were using. Think about this, if you had one billion plus to invest in say 2007 – wouldn’t you have the team that could parse out the critical data you wanted for your trading strategy? I was pretty sure that the folks I spoke with were way ahead of where we were when I had those conversations.

My point is that I don’t think XBRL changed anything for these folks. And in my opinion, especially until the introduction of iXBRL it was a lot easier to use the html to parse the financial statements rather than using the XBRL to attempt to construct the financial statements. One reason for that is that you can test every td element in a table to determine whether or not it has a value. If something is missing from the XBRL there is no way to know that it is missing. We discovered that when attempting to build the Effective Tax Rate table from the XBRL. The table would not add up and when we inspected it turns out there was data that was just missing. The tags are nice – and they can be used to speed up normalization. To do that you take the row label from the table and go find the associated tag.

To understand what I mean by that – here is CA’s income statement that was filed in 2010.

What exactly does Product development and enhancements mean? If you have parsed the income statement from the 10-K and then collected the row labels – you can take that row label back into the associated label file and find the tag (ResearchAndDevelopmentExpenseSoftwareExcludingAcquiredInProcessCost). In this case the availability of the XBRL would have helped clean/normalize this data faster – if you had never encountered that row label before.

The ability to do that is a huge benefit. But if I were running technical operations at a hedge fund – I would not attempt to use XBRL to collect fundamental data. I would integrate the tags but I would still be parsing the html tables as my authoritative source.

As an aside – continue reading this is a bit off topic though.

I took the code above and then I ran it against all 10-K filings that were filed from 1/1/2021 to 12/31/2023. There were 25,263 10-K in our archive for that span. Don’t laugh, I only collected data from 1,565 filings. I have two really strong constraints up there – one is that the table must have the word years and the other is it must have the word research. I looked at Microsoft’s 10-K and discovered that they used the word year. I also confirmed that Meta Platforms and Google used the same word. So I changed the code require year instead of years. After that change I was able to increase the number of tables pulled to 2,335. Again, don’t laugh – keep reading.

The constraint that the income statement must have the word research is a really strong constraint as well. I did a search over those 10-K filings and identified 7,650 10-K filings that did not have the word research anywhere in the filing.

If the word research is not in the 10-K, well it can’t be in the income statement. And as academic researchers we do know that even if that word is in the filing it does not have to be in the income statement. I need a sample of ten of those and I suspect I can modify the code above to capture another large segment of the filers.

Here is why I don’t think you should be dismissive of my observations and results. The knowledge to do what I describe above is not a significant hurdle. And while I only have 2,335 tables, I spent very little time. If I were smart enough to have a trading strategy that I wanted to implement in 2007 I am very confident that I could have had the data from the 10-K (and 10-Q and 8-K) available within milliseconds after having the filing delivered. So I am just not convinced that XBRL moved the dial a huge amount.