ChatGPT & directEDGAR

I know from talking to my own colleagues in the College of Business at the University of Nebraska at Omaha that people fall along an interesting spectrum with respect to their comfort level with using different tools to collect data. In a perfect world we would just imagine the data and it would appear – I alluded to that in the intro to one of our user guides about a decade ago.

About a month ago I asked one of our team (Siva) to plan a demo of how a novice might use ChatGPT to help them for instance – write Python code to pull data from one of our databases. Siva gave us the demonstration this morning. My brain has been on fire since then. It was a fascinating introduction and before the call was finished I had agreed to buy everyone on the team an upgraded account ($20.00/month) for access to ChatGPT and we were starting to talk about the ways it could help us improve our productivity.

But back to the original use case. After learning from watching Siva, I decided to test it myself. So first, I dropped in the code I used when creating the DC database and asked ChatGPT to provide a user guide to explain the fields.:

conn = sqlite3.connect(r'E:\MP_TEMP\DIRECTOR_COMP_weekly.db')
cur = conn.cursor()

cur.execute("""CREATE TABLE IF NOT EXISTS directorCOMP(
              ROW_ID INTEGER PRIMARY KEY,
              CIK TEXT,
              DEID TEXT,
              RDATE TEXT,
              RID TEXT,
              NAME TEXT,
              YEAR INT,
              CASH INT,
              BONUS INT,
              NONEQUITY INT,
              NQDEFCOMP INT,
              STOCK INT,
              OPTION INT,
              OTHER INT,
              TOTAL INT,
              SEC_NAME TEXT,
              PERSON_CIK TEXT,
              GENDER TEXT,
              SINCE INT,
              AGE INT
              );
			   """)

After a few iterations – and let me say, ChatGPT knew that CIK was Central Index Key from the get go – that was impressive to me – we had the following:

User Guide for the Director Compensation Database

Introduction: This database contains information about director compensation for various companies. The database table is named “directorCOMP”. The following is a guide to explain the fields in this database.

“ROW_ID” – This field contains a unique identifier for each record in the table and is defined as an “INTEGER” type. The “PRIMARY KEY” constraint ensures that no two records can have the same value in this field.
“CIK” – This field contains the Central Index Key (CIK) for a company and is defined as a “TEXT” type.
“DEID” – This field contains a unique identifier for a director that helps to identify the source document from which the compensation information was pulled. The “DEID” field is defined as a “TEXT” type.
“RDATE” – This field contains the date that compensation information was received and is defined as a “TEXT” type.
“RID” – This field contains a unique identifier for compensation information and is defined as a “TEXT” type.
“NAME” – This field contains the name of the director and is defined as a “TEXT” type.
“YEAR” – This field contains the year for which compensation information is being reported and is defined as an “INT” type.
“CASH” – This field contains the amount of cash compensation reported and is defined as an “INT” type.
“BONUS” – This field contains the amount of bonus compensation reported and is defined as an “INT” type.
“NONEQUITY” – This field contains the amount of non-equity incentive plan compensation reported and is defined as an “INT” type.
“NQDEFCOMP” – This field contains the amount of non-equity incentive plan deferred compensation reported and is defined as an “INT” type.
“STOCK” – This field contains the amount of stock award compensation reported and is defined as an “INT” type.
“OPTION” – This field contains the amount of option award compensation reported and is defined as an “INT” type.
“OTHER” – This field contains the amount of other compensation reported and is defined as an “INT” type.
“TOTAL” – This field contains the total amount of compensation reported and is defined as an “INT” type.
“SEC_NAME” – This field contains the name of the SEC filing and is defined as a “TEXT” type.
“PERSON_CIK” – This field contains the CIK number for the director and is defined as a “TEXT” type.
“GENDER” – This field contains the gender of the director and is defined as a “TEXT” type. The allowed values for this field are “M” for male and “F” for female.
“SINCE” – This field contains the year that the director has been with the company and is defined as an “INT” type.
“AGE” – This field contains the age of the director and is defined as an “INT” type.

My goal with this was to reduce fear when interacting with our platform using code. So I asked the following:

Suppose I have a CSV file that has a column labeled CIK and a column labeled YEAR. The file is stored on an APPSTREAM user instance “D:\PhotonUser\My Files\Temporary Files\mysample.csv”. The path to the Director Compensation database is “S:\directEDGAR_DATA\DIRECTOR_COMP.db”. Write a Python program to save the data from the Director Compensation database that relates the CIKs and Years in my file and I want the results saved to a CSV file named director_data.csv.

I learned something new – that I know will improve my productivity:

import csv
import sqlite3

# Connect to the Director Compensation database
conn = sqlite3.connect(r'S:\directEDGAR_DATA\DIRECTOR_COMP.db')
cur = conn.cursor()

# Read the data from the CSV file into a list
cik_years = []
with open(r'D:\PhotonUser\My Files\Temporary Files\mysample.csv', 'r') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        cik_years.append((row['CIK'], row['YEAR']))

# Execute a query to retrieve the data from the Director Compensation database that matches the CIKs and Years in the CSV file
cur.execute("""
    SELECT *
    FROM directorCOMP
    WHERE (CIK, YEAR) IN ({})
""".format(','.join(['(?,?)'] * len(cik_years))), cik_years)

# Write the results to a CSV file
with open('director_data.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    
    # Write the header row
    writer.writerow([column[0] for column in cur.description])   # I learned about the description method of the cur obj
    
    # Write the data rows
    for row in cur:
        writer.writerow(row)

# Commit the changes and close the connection
conn.commit()
conn.close()

Anyway – there is some code – will it work? I was eager to find out so I created a sample with more than 9,000 CIK-YEAR pairs saved it per the details above, started an Idle (Python) session as if I was a directEDGAR user, pasted the code and ran it.

I understand the issue and this is not the way I would have written the code. My goal though was to see if a novice Python user who doesn’t necessarily want to learn Python – but wants to collect the data could use ChatGPT to adjust the code – so I posted the error message into ChatGPT

I looked at the code and was fairly certain I would see the error again but this time 2,000/1,000. But remember, I was curious about whether I could power through this and get some output. I was right, here is the error message:

I pasted that message into ChatGPT and hit the Regenerate response message and saw some code I thought would work!

import csv
import sqlite3

# Connect to the Director Compensation database
conn = sqlite3.connect(r'S:\directEDGAR_DATA\DIRECTOR_COMP.db')
cur = conn.cursor()

# Read the data from the CSV file into a list
cik_years = []
with open(r'D:\PhotonUser\My Files\Temporary Files\mysample.csv', 'r') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        cik_years.append((row['CIK'], row['YEAR']))

# Split the list of values into smaller chunks to avoid exceeding the maximum number of bindings allowed
chunk_size = 1000
for i in range(0, len(cik_years), chunk_size):
    chunk = cik_years[i:i + chunk_size]
    
    # Execute a query to retrieve the data from the Director Compensation database that matches the CIKs and Years in the CSV file
    cur.execute("""
        SELECT *
        FROM directorCOMP
        WHERE (CIK, YEAR) IN ({})
    """.format(','.join(['(?,?)'] * len(chunk))), [item for sublist in chunk for item in sublist])
    
    # Write the results to a CSV file
    with open(r'D:\PhotonUser\My Files\Temporary Files\director_data.csv', 'a', newline='') as csvfile:
        writer = csv.writer(csvfile)

        # Write the header row
        if i == 0:
            writer.writerow([column[0] for column in cur.description])

        # Write the data rows
        for row in cur:
            writer.writerow(row)

# Commit the changes and close the connection
conn.commit()
conn.close()

I am going to tell you that statement # Commit the changes scared me – you can’t commit changes and there are no changes. But the OS safeguards and privilege management help with that issue.

This was a successful experiment – ChatGPT wrote the code for me to pull the director compensation data. I want to emphasize, I made no alterations to the code ChatGPT delivered. All I did was describe my goal, define the fields, specify the path to the db, the path to my sample and the destination. So we need to get some field definition files up there so you can do the same. This was exciting. I have to stop playing with this but you can expect some more posts to describe how to develop code to process and analyze our artifacts.

directEDGAR

Search, Extraction & Normalization Engine

ChatGPT & directEDGAR

Like this:

Related

Leave a ReplyCancel reply

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from directEDGAR