ChatGPT & directEDGAR

I know from talking to my own colleagues in the College of Business at the University of Nebraska at Omaha that people fall along an interesting spectrum with respect to their comfort level with using different tools to collect data. In a perfect world we would just imagine the data and it would appear – I alluded to that in the intro to one of our user guides about a decade ago.

About a month ago I asked one of our team (Siva) to plan a demo of how a novice might use ChatGPT to help them for instance – write Python code to pull data from one of our databases. Siva gave us the demonstration this morning. My brain has been on fire since then. It was a fascinating introduction and before the call was finished I had agreed to buy everyone on the team an upgraded account ($20.00/month) for access to ChatGPT and we were starting to talk about the ways it could help us improve our productivity.

But back to the original use case. After learning from watching Siva, I decided to test it myself. So first, I dropped in the code I used when creating the DC database and asked ChatGPT to provide a user guide to explain the fields.:

conn = sqlite3.connect(r'E:\MP_TEMP\DIRECTOR_COMP_weekly.db')
cur = conn.cursor()

cur.execute("""CREATE TABLE IF NOT EXISTS directorCOMP(
              ROW_ID INTEGER PRIMARY KEY,
              CIK TEXT,
              DEID TEXT,
              RDATE TEXT,
              RID TEXT,
              NAME TEXT,
              YEAR INT,
              CASH INT,
              BONUS INT,
              NONEQUITY INT,
              NQDEFCOMP INT,
              STOCK INT,
              OPTION INT,
              OTHER INT,
              TOTAL INT,
              SEC_NAME TEXT,
              PERSON_CIK TEXT,
              GENDER TEXT,
              SINCE INT,
              AGE INT
              );
			   """)

After a few iterations – and let me say, ChatGPT knew that CIK was Central Index Key from the get go – that was impressive to me – we had the following:

User Guide for the Director Compensation Database

Introduction: This database contains information about director compensation for various companies. The database table is named “directorCOMP”. The following is a guide to explain the fields in this database.

  1. “ROW_ID” – This field contains a unique identifier for each record in the table and is defined as an “INTEGER” type. The “PRIMARY KEY” constraint ensures that no two records can have the same value in this field.
  2. “CIK” – This field contains the Central Index Key (CIK) for a company and is defined as a “TEXT” type.
  3. “DEID” – This field contains a unique identifier for a director that helps to identify the source document from which the compensation information was pulled. The “DEID” field is defined as a “TEXT” type.
  4. “RDATE” – This field contains the date that compensation information was received and is defined as a “TEXT” type.
  5. “RID” – This field contains a unique identifier for compensation information and is defined as a “TEXT” type.
  6. “NAME” – This field contains the name of the director and is defined as a “TEXT” type.
  7. “YEAR” – This field contains the year for which compensation information is being reported and is defined as an “INT” type.
  8. “CASH” – This field contains the amount of cash compensation reported and is defined as an “INT” type.
  9. “BONUS” – This field contains the amount of bonus compensation reported and is defined as an “INT” type.
  10. “NONEQUITY” – This field contains the amount of non-equity incentive plan compensation reported and is defined as an “INT” type.
  11. “NQDEFCOMP” – This field contains the amount of non-equity incentive plan deferred compensation reported and is defined as an “INT” type.
  12. “STOCK” – This field contains the amount of stock award compensation reported and is defined as an “INT” type.
  13. “OPTION” – This field contains the amount of option award compensation reported and is defined as an “INT” type.
  14. “OTHER” – This field contains the amount of other compensation reported and is defined as an “INT” type.
  15. “TOTAL” – This field contains the total amount of compensation reported and is defined as an “INT” type.
  16. “SEC_NAME” – This field contains the name of the SEC filing and is defined as a “TEXT” type.
  17. “PERSON_CIK” – This field contains the CIK number for the director and is defined as a “TEXT” type.
  18. “GENDER” – This field contains the gender of the director and is defined as a “TEXT” type. The allowed values for this field are “M” for male and “F” for female.
  19. “SINCE” – This field contains the year that the director has been with the company and is defined as an “INT” type.
  20. “AGE” – This field contains the age of the director and is defined as an “INT” type.

My goal with this was to reduce fear when interacting with our platform using code. So I asked the following:

Suppose I have a CSV file that has a column labeled CIK and a column labeled YEAR. The file is stored on an APPSTREAM user instance “D:\PhotonUser\My Files\Temporary Files\mysample.csv”. The path to the Director Compensation database is “S:\directEDGAR_DATA\DIRECTOR_COMP.db”. Write a Python program to save the data from the Director Compensation database that relates the CIKs and Years in my file and I want the results saved to a CSV file named director_data.csv.

I learned something new – that I know will improve my productivity:

import csv
import sqlite3

# Connect to the Director Compensation database
conn = sqlite3.connect(r'S:\directEDGAR_DATA\DIRECTOR_COMP.db')
cur = conn.cursor()

# Read the data from the CSV file into a list
cik_years = []
with open(r'D:\PhotonUser\My Files\Temporary Files\mysample.csv', 'r') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        cik_years.append((row['CIK'], row['YEAR']))

# Execute a query to retrieve the data from the Director Compensation database that matches the CIKs and Years in the CSV file
cur.execute("""
    SELECT *
    FROM directorCOMP
    WHERE (CIK, YEAR) IN ({})
""".format(','.join(['(?,?)'] * len(cik_years))), cik_years)

# Write the results to a CSV file
with open('director_data.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    
    # Write the header row
    writer.writerow([column[0] for column in cur.description])   # I learned about the description method of the cur obj
    
    # Write the data rows
    for row in cur:
        writer.writerow(row)

# Commit the changes and close the connection
conn.commit()
conn.close()

Anyway – there is some code – will it work? I was eager to find out so I created a sample with more than 9,000 CIK-YEAR pairs saved it per the details above, started an Idle (Python) session as if I was a directEDGAR user, pasted the code and ran it.

I understand the issue and this is not the way I would have written the code. My goal though was to see if a novice Python user who doesn’t necessarily want to learn Python – but wants to collect the data could use ChatGPT to adjust the code – so I posted the error message into ChatGPT

I looked at the code and was fairly certain I would see the error again but this time 2,000/1,000. But remember, I was curious about whether I could power through this and get some output. I was right, here is the error message:

I pasted that message into ChatGPT and hit the Regenerate response message and saw some code I thought would work!

import csv
import sqlite3

# Connect to the Director Compensation database
conn = sqlite3.connect(r'S:\directEDGAR_DATA\DIRECTOR_COMP.db')
cur = conn.cursor()

# Read the data from the CSV file into a list
cik_years = []
with open(r'D:\PhotonUser\My Files\Temporary Files\mysample.csv', 'r') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        cik_years.append((row['CIK'], row['YEAR']))

# Split the list of values into smaller chunks to avoid exceeding the maximum number of bindings allowed
chunk_size = 1000
for i in range(0, len(cik_years), chunk_size):
    chunk = cik_years[i:i + chunk_size]
    
    # Execute a query to retrieve the data from the Director Compensation database that matches the CIKs and Years in the CSV file
    cur.execute("""
        SELECT *
        FROM directorCOMP
        WHERE (CIK, YEAR) IN ({})
    """.format(','.join(['(?,?)'] * len(chunk))), [item for sublist in chunk for item in sublist])
    
    # Write the results to a CSV file
    with open(r'D:\PhotonUser\My Files\Temporary Files\director_data.csv', 'a', newline='') as csvfile:
        writer = csv.writer(csvfile)

        # Write the header row
        if i == 0:
            writer.writerow([column[0] for column in cur.description])

        # Write the data rows
        for row in cur:
            writer.writerow(row)

# Commit the changes and close the connection
conn.commit()
conn.close()

I am going to tell you that statement # Commit the changes scared me – you can’t commit changes and there are no changes. But the OS safeguards and privilege management help with that issue.

This was a successful experiment – ChatGPT wrote the code for me to pull the director compensation data. I want to emphasize, I made no alterations to the code ChatGPT delivered. All I did was describe my goal, define the fields, specify the path to the db, the path to my sample and the destination. So we need to get some field definition files up there so you can do the same. This was exciting. I have to stop playing with this but you can expect some more posts to describe how to develop code to process and analyze our artifacts.

Updated Structure for Full-Text Searchable Comment Letters (UPLOAD filing type) in Process and Partially Available

For quite some time I have wanted to update the way we have made the Comment Letter available. Because of this I have delayed updating the collection. I received an email from a researcher last week who needed our comment letters to be updated and since I have been struggling with another project (and she needed these updated) I turned my focus to the Comment Letters.

I had two primary goals with this project. First, I discovered sometime ago that the SEC sometimes included a field in the header associated with the Comment Letters that indicated (pointed to) the filing that was the subject of the letter <PUBLIC-REFERENCE-ACC>. I wanted to add these into the metadata that we include with the filings. The second thing I wanted to do was to systematically replace all of the pdf comment letters with a txt version since the txt version is friendlier to your use of Python to do finer work with these filings. Sometimes the SEC includes a TEXTEXTRACT version of the comment letter and my intention was to use that version if available. Otherwise we will run the pdf form of the letters through one of our tools and use that rather than the pdf version.

I will have to plead incompetence for not knowing earlier that the <PUBLIC-REFERENCE-ACC> was available and we were not incorporating it into the platform for your use. When we develop features we take what we think is a reasonable sample of the population and based on what we discover as we are working with that sample we make the decisions about how to move forward. You have read my comments in the past where I acknowledge – I don’t know what I don’t know.

Now that I know of this tag I set to work this weekend to begin incorporating it into the platform. Guess what, the SEC stopped using this tag before January 2021. From what I can tell, the last comment letter that included this field was uploaded on 11/12/2020. Frankly, the frequency of this field being included in the comment letters dropped off significantly after September 2020.

I am building a database of all of the metadata for these filings and so when the code first finished running on Saturday I was sure I had made some Burch error (this is an error that only I can make). After I spot checked a few of the later header files and did not see the field in these files I decided to check all of the header files for filings made since 2020 by year to see if maybe (I was hopeful) there was a new field. So I ran code to identify each unique field in every header file by year. Guess what, I did not make a Burch error, there were no filings with either the <PUBLIC-REFERENCE-ACC> field or a new field that provided this info. There were 29 unique fields across all of the filings made in 2020 and only 28 unique fields made in all of the filings in 2021 and 2022. Serious bummer.

{'acceptance-datetime', 'filing-values', 'private-to-public', 'type', 'conformed-name', 'business-address', 'state', 'street1', 'cik', 'fiscal-year-end', 'assigned-sic', 'filed-for', 'form-type', 'date-changed', 'sec-header', 'phone', 'public-document-count', 'city', 'zip', 'former-company', 'filing-date', 'company-data', 'mail-address', 'accession-number', 'state-of-incorporation', 'former-conformed-name', 'street2', 'irs-number'}

I did some searches on the SEC website and could not find any indication about this. The latest version of the PDS Technical Specification document still contains reference to this field. This is interesting because the latest version is 12/14/2020, which is after the field stopped showing up.

It is really interesting, that was a very useful feature and it is gone. I have not found anyone who is happy with the redesign of the EDGAR website/interface. The project I have been stuck on is trying to improve metadata collection from 10-K filings and I have finally had to make a decision that we can’t capture everything we want to capture with the confidence that is needed to distribute through the platform. I will post more on that later. I am just curious though who at the SEC makes these decisions and why?

Enough ranting. Let me address what has been accomplished. First, there is a new index UPLOAD Y2016-Y2025. The label is a little bit misleading. Today it only has filings from 2020 – the end of 2022. The balance will be moved from the UPLOAD Y2016-Y2020 index within the next week. That process is running now but I wanted to have the filings that our client needed available now.

Second, there is a new database – COMMENT_LETTER_DETAILS:

This contains a number of fields that I think might be useful. The SEC’s decision to stop providing the tag is going to make this more painful to complete than I hoped. However, for all the 2020 filings with the tag present we added a link to the EDGAR filing that was the subject of the Comment Letter. We also added the type of filing that was the subject. We will do this for every filing we can back to 2005 (which is when Comment Letters first became available).

When you do a search remember that the RDATE/CDATE have a different meaning with Comment Letters. The SEC does their research and sends a Comment Letter to a registrant. The day the letter is send is considered the ‘Filed Date’ for purposes of securities law and on EDGAR you will find the Comment Letters filed date to match the letter date. However, the Comment Letters are not visible on EDGAR for review until sometime later. The window has shortened considerably since the early days however it can still extend several months. The RDATE value we use is the date the header indicates the filing status changed from PRIVATE-TO-PUBLIC. The CDATE is the SEC EDGAR reported filing date. Thus, if you expect a market reaction, presumably that would come around the CDATE.