Auditor Change Disclosure Text – Another ChatGPT & directEDGAR Example

One of clients sent an email this morning asking for help pulling out of 8-K filings the specific text content that was used to explain reason for the 8-K filing. I want to protect their research idea so I am going to modify the example and pretend I want to pull the explanation associated with auditor changes. So here is the scenario – I want to do sentiment analysis on all explanations of auditor changes included in 8-K filings. The client has used Python in the past but I could sense some real apprehension as I was explaining how to do this. So I decided to use ChatGPT to generate the code and then I shared the chat results with them. I decided to blog about this experience because I think it should help reduce some apprehension across our user base for this and similar problems and their response was kind of cool This may even work for a techno imbecile like me. Let me assure you, this person is not an imbecile. But I understand the apprehension.

Step 1 – Identify all 8-K filings that disclose an auditor change. Easy enough – run a search in one or more 8-K collections for (DOCTYPE contains(8K*)) and (ITEM_4_01 contains(YES)). For this real example I selected Y2005-Y2015, Y2016-Y2020, Y2021-Y2025 and Y2023-Y2023. I hit the perform search button at 2:43 pm. I am sharing that because I hope to be finished with this in less than an hour.

Here is an image of the search results as displayed in our software:

This makes me a bit nervous – we found 26,940 documents and I am a little bit concerned about moving that many files at once. I am going to do this by subset of years – so I am going to rerun by first selecting Y2016-Y2020, Y2021-Y2025 and Y2023-Y2023. This time the search returned 6,162 documents – this feels like a more appropriate size.

Step 2 – We need to extract the text content of those documents – From the Extraction menu select DocumentExtractionTextOnly – use the folder selector tool to create a folder in your Temporary Files folder and select the folder.

This will take a few minutes – the application has to pull the document out of the html and clean it up and then save it in the directory. I generally expect the process to create 300-500 files a minute. So while that is running I am going to go to ChatGPT and ask for some help. Here was my initial request to ChatGPT.

I have some txt files at D:\PhotonUser\My Files\Temporary Files\Group1  

The files have a line of text that begins with 
Item 4.01. Changes in Registrant’s Certifying Accountant
However, I cannot be sure about the white space or the capitalization.

after that line is additional text in most cases I want to capture all text from the Item 4.01 until the next line that begins with a pattern of ITEM #.# or ITEM #.##  (I put this in caps to emphasize but again I am not certain about capitalization) or spaces.  In some cases though the end will be indicated with a line that begins with the word SIGNATURE or SIGNATURES

I want the text between the begin and end locations saved to a new directory named   
D:\PhotonUser\My Files\Temporary Files\PARSED_GROUP1

And any files that did not yield results should be moved to a new directory called

D:\PhotonUser\My Files\Temporary Files\PROBLEM_GROUP1

It took 21 minutes to get the 6,162 8-K that were in my initial search – so that was on the low end of my expectation ( a bit less than 300/minute). Here is the code that ChatGPT generated for me:

import os
import re
import shutil

# Define the source directory
source_dir = r"D:\PhotonUser\My Files\Temporary Files\Group1"

# Define the destination directories
parsed_dir = r"D:\PhotonUser\My Files\Temporary Files\PARSED_GROUP1"
problem_dir = r"D:\PhotonUser\My Files\Temporary Files\PROBLEM_GROUP1"

# Create the destination directories if they don't exist
os.makedirs(parsed_dir, exist_ok=True)
os.makedirs(problem_dir, exist_ok=True)

# Define the patterns for identifying the start and end of the desired text
start_pattern = re.compile(r'(?i)item\s*4\.\d+')
end_pattern = re.compile(r'(?i)item\s*\d+\.\d+\.?(?!\d)|^signature[s]?')

# Iterate over the files in the source directory
for file_name in os.listdir(source_dir):
    file_path = os.path.join(source_dir, file_name)
    
    # Check if the item is a file
    if os.path.isfile(file_path):
        
        # Read the file contents
        with open(file_path, 'r') as file:
            content = file.read()
        
        # Find the start and end indices of the desired text
        start_match = start_pattern.search(content)
        
        if start_match:
            start_index = start_match.end()
            
            # Find the end index
            end_match = end_pattern.search(content[start_index:])
            
            if end_match:
                end_index = start_index + end_match.start()
            else:
                # If the end pattern is not found, check for "SIGNATURE" or "SIGNATURES"
                signature_match = re.search(r'(?i)^signature[s]?', content[start_index:])
                if signature_match:
                    end_index = start_index + signature_match.start()
                else:
                    end_index = len(content)
            
            # Extract the desired text
            parsed_text = content[start_index:end_index]
            
            # Save the extracted text to a new file
            parsed_file_path = os.path.join(parsed_dir, file_name)
            with open(parsed_file_path, 'w', encoding='utf-8') as parsed_file:
                parsed_file.write(parsed_text)
            
            print(f"File '{file_name}' processed and saved to '{parsed_file_path}'")
        else:
            # Move files that did not yield results to the problem directory
            problem_file_path = os.path.join(problem_dir, file_name)
            shutil.move(file_path, problem_file_path)
            print(f"File '{file_name}' moved to '{problem_file_path}'")

As soon as I ran the code I had an error message:

========== RESTART: D:/PhotonUser/My Files/Temporary Files/parse8k.py ==========
Traceback (most recent call last):
  File "D:/PhotonUser/My Files/Temporary Files/parse8k.py", line 32, in <module>
    content = file.read()
  File "C:\Program Files\Python310\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1173: character maps to <undefined>

So I asked ChatGPT for help:

The error message suggests that the file you're trying to read has a character that cannot be decoded using the 'cp1252' encoding (also known as Windows-1252). To resolve this issue, you can modify the script to handle the file using a different encoding, such as 'utf-8'.

Please replace the following line in the script:
with open(file_path, 'r') as file:

with:
with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:

This change allows the script to read the file using the 'utf-8' encoding while ignoring any characters that cannot be decoded. The 'errors='ignore'' argument ensures that any problematic characters are skipped without raising an exception.

Save the script after making the modification and run it again. It should now be able to read the file without encountering the UnicodeDecodeError.

I made that change and we went to town. There were 321 files in the problem folder I shared the following with ChatGPT

one of the files in the PROBLEM_FOLDER had this after the text I wanted:

SIGNATURE 
Pursuant to the requirements of the Securities Exchange Act of 1934, the Registrant has duly caused this report to be signed on its behalf by the undersigned hereunto duly authorized. 
Dated: July 10, 2020 


 
    BARNWELL INDUSTRIES, INC. 
	
    	
    By: 
	  
/s/ Russell M. Gifford  
	
    	  Name: 
	  Russell M. Gifford 

The response was

In this updated version, the end pattern has been modified to use a positive lookahead assertion (?=.*?signature[s]?\b) to match lines that contain the word "SIGNATURE" or "SIGNATURES" regardless of their position within the line.

Please give this modified script a try, and let me know if you encounter any further issues.

It is 4:02 so I was at this longer than I hoped – there was some other back and forth with ChatGPT – but I am pretty pleased with the results – here is a screenshot of a review of these in the SmartBrowser:

I tried to be as naive as possible but I was also persistent. Because directEDGAR allows you to first identify the relevant 8-K and then extract the text from the 8-K we have saved significant time. With this time savings we can be persistent with ChatGPT to make sure we get it right.

Here is the last iteration of the code:

import os
import re
import shutil

# Define the source directory
source_dir = r"D:\PhotonUser\My Files\Temporary Files\Group1"

# Define the destination directories
parsed_dir = r"D:\PhotonUser\My Files\Temporary Files\PARSED_GROUP1"
problem_dir = r"D:\PhotonUser\My Files\Temporary Files\PROBLEM_GROUP1"

# Create the destination directories if they don't exist
os.makedirs(parsed_dir, exist_ok=True)
os.makedirs(problem_dir, exist_ok=True)

# Define the patterns for identifying the start and end of the desired text
start_pattern = re.compile(r'(?i)item\s*4\.\d+')
end_pattern = re.compile(r'(?i)item\s*\d+\.\d+\.?(?!\d)|^(?=.*?signature[s]?\b)')

# Iterate over the files in the source directory
for file_name in os.listdir(source_dir):
    file_path = os.path.join(source_dir, file_name)
    
    # Check if the item is a file
    if os.path.isfile(file_path):
        
        # Read the file contents
        with open(file_path, 'r', encoding='cp1252', errors='ignore') as file:
            content = file.read()
        
        # Find the start and end indices of the desired text
        start_match = start_pattern.search(content)
        
        if start_match:
            start_index = start_match.end()
            
            # Find the end index
            end_match = end_pattern.search(content[start_index:])
            
            if end_match:
                end_index = start_index + end_match.start()
            else:
                # If the end pattern is not found, check for "SIGNATURE" or "SIGNATURES"
                signature_match = re.search(r'(?i)^(?=.*?signature[s]?\b)', content[start_index:], re.MULTILINE)
                if signature_match:
                    end_index = start_index + signature_match.start()
                else:
                    end_index = len(content)
            
            # Extract the desired text
            parsed_text = content[start_index:end_index]
            
            # Save the extracted text to a new file
            parsed_file_path = os.path.join(parsed_dir, file_name)
            with open(parsed_file_path, 'w', encoding='utf-8') as parsed_file:
                parsed_file.write(parsed_text)
            
            print(f"File '{file_name}' processed and saved to '{parsed_file_path}'")
        else:
            # Move files that did not yield results to the problem directory
            problem_file_path = os.path.join(problem_dir, file_name)
            shutil.move(file_path, problem_file_path)
            print(f"File '{file_name}' moved to '{problem_file_path}'")

I still have another 20,778 to complete. But the hard work is finished. I started extracting the next group of 8-K filings at 4:05 – this time I am going to just extract all found in the Y2005-Y2015 archive.

Okay finished at 5:38 PM. It took about 70 minutes to pull the relevant filings and then I had to address some new error issues that came up as a result of there being some structural differences in the 8-K filings. I did more back and forth with ChatGPT as I noticed filings that ended up in the problem group.

Minor But Important Changes Coming to HTML Files

I have been reworking our code to parse sections of 10-K filings. As I was testing code and reviewing the documents the code is working on I became more and more annoyed at the way the html documents are constructed. If you have not noticed – many filings now consist of a very limited number of lines. For instance – here is a screen shot of the source of a 10-K:

All of the critical elements of the document that I want to work on are in line 11. While there are ways to inspect elements using some of the browser features – I find that is much easier and efficient for me to open the document in Notepad++ when I need to inspect the way elements are organized etc so I can write the appropriate XPATH to identify parts of the document to parse. But if I open that document up in Notepad++ I discover line number 11 is 2,217,885 characters wide. It really is brutal to review and browse this.

We are going to transition every htm file so that it is pretty_printed. Here is an image of part of the same file after it has been pretty_printed:

Overall this change should improve your ability to identify elements that you want to analyze in your code. All of the elements should be the same except for two changes – first the sourceline property will change. That should be a very useful feature as it adds another way to differentiate elements. Second, almost all of the elements will have a newline character (‘\n’) added to the tail. Those can be stripped out easily if you are working with text. To understand this compare the next two images. The first one is before the file was pretty_printed. The sourceline is 11 (recall that in the first image I showed that all of the document was on line 11). Also there is no content associated with the tail.

The next image is the same element after the document was pretty_printed. Now the sourceline is 844 and the tail has a new line appended to it (‘\n’). For those of you looking at these images real closely – the memory reference is different because these are different runs on the same document.

We will begin the transition this weekend, starting with the most recent filings and then work back in time. I want to wait until the weekend because I need to modify the code we use to capture and first process filings and I don’t want some unexpected issue (one of those things that I don’t know that I don’t know) does not delay or impede our daily update process.

Stay tuned as we have something more substantial and interesting coming up in the next couple of weeks.

Minor Database Changes

We just replaced the initial Tax related databases and got rid of the _NEW tag that was appended to the name. To learn about the content please reads this (or this) from our knowledge base.

Manish has been working diligently to update these and I think he has done a tremendous job even though sometimes from the outside it might look like we are not making progress. We now have data from 2015 filings available in the TAX_RECON file. The DEFERRED_TAX file is still limited to filings back to 2019. For both files we are adding data as it is filed with EDGAR weekly.

Remember – we are trying to merge the XBRL data with the data that was originally reported in the HTML table in the 10-K/EX-13s. When we conceived of this as a way to add more robustness and utility to the data that exists in the ExtractionPreprocessed system as we move it to the database it seemed like an easy proposition. Unfortunately that has not been the case. Frankly all of our work makes us very suspicious of any data pulled directly from the XBRL as it is just not very complete and or has other issues.

I am going to try to document some of these issues in a later post.

Minor Database Changes Coming & Another ChatGPT Experiment – All 8-K filings with an Acceptance Time after 4:00 pm on Fridays.

We are in the midst of doing some rationalization of our databases. All the changes will be reported in Knowledge Base in the section that describes the databases and their fields. Our goal is to bring some more discipline to our field names and the organization of the databases. Most of the changes will be minor (for example every table that has a field CIK_ACCESSION that field will be renamed to CIK_ACC. We have some tables that have ACCESSION_NUMBER others that have ACCESSION. All of those will be standardized to ACCESSION. There are two reasons for these changes. First, we are working on an update to the Extraction Engine – one of the features is that we want is to dynamically accept fields to filter on that come from a CSV file. Right now the only field you can filter on using a CSV file is CIK, but at some point we want you to be able to filter on any field based on the contents of a file that has the criteria. To make this work we need to be more disciplined with the names we assign to fields.

Another change is that all ACCEPTANCE_DATETIME fields in all databases are going to be converted to ISO 8601 format (YYYY-MM-DDTHH:MM:SS). Currently the fields are stored as reported in the header file YYYYMMDDHHMMSS. But that format is not really easy to work with – especially for a novice as I discovered today.

I am trying to coach my interns on using ChatGPT to develop more content and skills and so in preparation for one of my ideas about coaching them I decided to prepare to help them through creating Python code to select all 8-K filings made after 4:00 PM on Fridays. This is a project that some of our clients have used in their classrooms in the past and I wanted to approach it as a interested novice. However, as I was working through the process the query to select – especially the after 4:00 PM constraint got more and more complex. I finally wondered if it would be simpler if the field value was changed. To test this I created another copy of the db and changed the field. The resulting query was so much simpler after that change. So, if you can imagine that you wanted to replicate this – here is the exchange I had with ChatGPT after the modification (note the platform database has been modified as of now):

I have a database that is at S:\directEDGAR_DATA\8KMETA.db there is one table named EIGHTKMETA the primary key is CIK_ACC there is a field named ACCEPTANCE_DATETIME – here is a sample value from that field 2022-12-16T08:19:33 is there a way to write Python code to select all observations where the ACCEPTANCE_DATETIME is after 4:00 pm on Fridays. I want to save the results in a csv file and the directory where the results should be saved is “D:\PhotonUser\My Files\Temporary Files\”

Here is the code – and guess what – it worked – but let’s walk through the process of getting it on the platform and ready to run.

import sqlite3

# Connect to the database
conn = sqlite3.connect(r"S:\directEDGAR_DATA\8KMETA.db")
cursor = conn.cursor()

# Select all observations where ACCEPTANCE_DATETIME is after 4:00 pm on Fridays
query = """
    SELECT *
    FROM EIGHTKMETA
    WHERE strftime('%w', ACCEPTANCE_DATETIME) = '5'  -- Friday
        AND strftime('%H:%M', ACCEPTANCE_DATETIME) > '16:00' -- After 4:00 pm
"""

cursor.execute(query)
results = cursor.fetchall()

# Process the results
for row in results:
    # Access row data using row[index]
    # ...

# Close the database connection
cursor.close()
conn.close()

To use this code we need to get it up to the platform – assuming I am already in a session – I just need to open IDLE – the most direct way is to use the tiles on the browser frame.

I need to open a New File so I can save the code and use or modify as needed. From the File menu in IDLE select New File.

Presumably I am working with ChatGPT on my local computer and the code was generated in a session so I need to transfer the code to the New File I opened – so using the Copy Code feature provided by ChatGPT copy the code to the clipboard.

To transfer the contents of the clipboard to the remote session use the control in the browser frame (a fuller description can be found in the Knowledge Base here) to select Paste to Remote Session and follow the prompt to paste into the container that appears.

After hitting Ctrl+V you should see your content pasted into the container and the container will close. Put your mouse in the top right corner of the file you opened from within IDLE and hit CTRL+V again to then paste the code into IDLE.

Before we can run this we need to save the code – while you have a home folder, presumably you are like me and what to keep things close and safe so I am going to save to the Temporary Files folder and then move it to my local computer when I am finished.

In this case I have saved the file with the name identify_8k_after_4.py.

Now I am ready to run this file – To run, select Run Module from the Run menu.

It took about a minute to run this code and generate the results. A screenshot does not give the result file enough justice as their are too many fields. I have hidden some to make the results a bit clearer. If you are a real SEC guru and are wondering why we have those particular ITEM codes – we recoded all earlier 8-K item codes to match the current ones adopted after SOX.

Forgive me for bragging – notice that we have the time listed in those early filings – most cannot be found in the header file. For example, notice the 8-K filing for Mattel on 2/26/1996. The header file reports time of midnight:

In these cases, if we could find a dissemination time – we replaced the header ACCEPTANCE_DATETIME with the dissemination time.

Remember – to transfer the results file (and probably the code file) back to your local computer follow the steps outlined here. Note – the 8KMETA.db now has the field ACCEPTANCE_DATETIME in the ISO 8601 format. Others will follow soon.

Minor News – Knowledge Base, Tax Recon Table and new field added to 8-K Meta (8KMETA) database.

For those of you here to explore the capabilities of directEDGAR – we have created a Knowledge Base. To access the Knowledge Base – either use the menu icon above or use this link (Knowledge Base). We will be continuing to fill it out.

Almost a month ago I described the release of two new databases as well as the process we were taking to move our extensive collection of existing TAXRECON tables into a new format. If you have communicated with me at all I might have told you “I don’t know what I don’t know!” This is proving true once again. I had hoped that we could use the tables we had pulled from the html filings and with the existing row labels and the data values have some certainty regarding the identification of the XBRL tags and labels so that we could completely reorganize our existing data into the same format as the new database. This is going to prove to be partially impossible. Manish has been beating his head against the monitor for the last month and last night we had to decide to scale back our expectations.

To sum it up, when we start with the html table and use the row labels as reported in the html to set out to find the tags used in the XBRL we simply cannot make matches for all row labels. This is because the label used in the XBRL differs from the label used in the html. This is true for roughly 20% of the tables. For the tables where we can’t find a match the number of cases where a match is not found can range from one to five. The reasons we can’t match are either because the label is ultimately so different from what is in the html or because the label was just not included in the associated label file. We know this because for some cases (not anywhere near all) we have opened the label file and read each individual label after doing a number of searches to find variations of the expected language. We have tried a number of ways to finesse this problem including using the data value we found in the html table to find the value in the instance document ( the .xml file) to work through a trail – this has not worked for reasons that range from the fact that the data value might be missing in the xml file to the cases where there is no available rational way to link the two because of what I am just going to say is shoddy construction of the XBRL.

The decision we came to is to just move forward – if we can’t find the label (and thus the tag) then we will only report the label that was in the html and not the tag (the tag field for those particular rows will be blank).

With respect to the database we have for the 8-K META (8KMETA) – I have delayed updating it because I have wanted to make two changes. First, I wanted to change the dE_PATH field value from the directory that holds all of the files that were parsed from each 8-K to the path for the actual 8-K (or 8-K/A). I wanted to do this so this path can be used in a search using the DB Document search filter. The second thing I wanted to do was to add the path to the 8-K (or 8-K/A) on EDGAR. This is being added because some of our clients have told me that there are cases where they will collect data from one platform and then need additional values or need some validation of the data and the SEC hyperlink is useful for those cases. These fields are being added on our testing platform right now. The existing database will be switched out over the weekend and be updated to Friday – another piece of good news for us is that we have automated the update process like we have done with the DC data and so this will stay current.

I wanted to have some image here on this post so I decided to see how long it would take me to identify all 8-K (and 8-K/A) filings that were filed to report on auditor changes (Item 4.01 – Change in Registrant’s Certifying Accountant). It took less than five seconds.

Don’t anchor on that number because the database is still under construction. If my math is right, there are still another 1,135 to be added as well as any that have been filed in the last couple of days. (A quick scan of the EDGAR latest filings page suggests there are another six that have been filed recently) Remember, if you hit the Save Results button above all of the data from that query will be dumped to a CSV file for you to filter and research to your heart’s content.

Be careful when saving the file after you have opened it though. Excel can in some cases modify/convert the values in the ACCEPTANCE_TIME field to a rounded exponential value and if you save it after it has made that conversion you will lose some precision.

The easiest way I know to manage that is to open the csv file and right-click on the Excel column label to select the column and open the context menu and select Format Cells . . .

Select the Number format and change the Decimal places to 0.

Once you have done that the number will fully display.

If you continue to save the file as a csv file you will have to work that magic each time you open it. If you save it as an XLSX file the format will be preserved as illustrated in the next (and final) image!

Two New Databases Added to the Platform

As I have mentioned before we are trying to port all of the data from the existing ExtractionPreprocessed distribution system to our new environment. Of course this is moving much slower than we would like. However, while some of the challenges relate to discovering things we didn’t know before – some of the challenges have been due to our attempts to improve the process of collecting the data from top to bottom.

I am delighted to tell you that while we are not finished porting the old TAXRECON data over to the new system – we have made significant progress and have introduced two new databases for you to use. We now have the TAXRECON and a DEFERRED_TAX databases available for your research enjoyment!

You can see them in the image above, I appended the word TEST to each of these just in case one of the tax experts tells me that I have made a mistake (because I didn’t know something I didn’t know I needed to know!).

To start with we identified all 10-K filings made in 2019 through April 7th or so 2023 (hint – these will be updated monthly). We eliminated all 10-K’s that were filed by a non-US filer in this first iteration and those that were associated with multiple CIKs (more later on these choices). We then found the tax recon and deferred tax asset/liability tables in the 10-K (and Exhibit-13s), parsed and normalized them. We then matched the data from the html to the tagging in the iXBRL instance documents to collect the tag used in the XBRL. This tag in the database in a name field. There is also a field called orig_row_label. This field has the label that was displayed in the html.

We were struggling with the best way to present this data. After consulting with a number of clients we decided to present it in a manner as close to the original organization as possible. Rather then using the original column labels which are generally representative of a balance sheet date (deferred tax data) or fiscal period (tax recon data) we analyzed the column headings and labeled the column with the data for the fiscal year ended as CURRENTYEAR. We found as many as four periods presented and so the prior year data columns were labeled PRIORYEAR, TWOPRIOR and THREEPRIOR.

Adjacent to each data column we have columns that report the date values that would have been in the data value column heading. Because the taxrecon data summarizes data for a fiscal year we report CURRENTYEAR_START_DATE and CURRENTYEAR_END_DATE (and appropriate labels for the additional columns that exist). The deferred tax data represents balances as of a particular date so we label the date related columns as CURRENTYEAR_FYE, PRIORYEAR_FYE, . . .

In those cases where the registrant reports their taxrecon data in both dollars and percentages we captured both and stack them in the database. Boeing is one example. When Boeing filed their 10-K in 2020 for the fiscal year ended 12/31/2019 they began reporting their tax recon data in both percentages and dollars (prior years were reported in percentages).

Here is a partial view of how we organized this data – there is a field in the database that indicates the nature of the presentation (DOLLAR/PERCENT) at both the table level and the row level.

The rows above are not really adjacent to one another (not that it matters), I am just trying to illustrate that we captured all of the data and organized it at least initially by table type. I cut off the name column because I wanted the focus to be on the DOLLARS/PERCENT presentation – but you can see we have the label that was used in the html as well as the tag.

The next step in this process is to consolidate all of our existing taxrecon data and reorganize it for this format. While we do that we will also use the XBRL to pull the tags so we can add the name field for this data as well. We don’t have prior year deferred tax data so we will also be working to parse and add that.

One of the reasons I started mucking around with this stuff again in November was because of some assertions I had heard from the SEC and others about the quality of the data available through the iXBRL. I was actually pretty eager and imagined that we could make short work of this by identifying the taxrecon table in the iXBRL instance and speeding up our collection and normalization process. Once again, the real-life XBRL fails to live up to the potential it has because of how it is created and vetted. One of the most important things we learned is that we absolutely cannot rely on indications in the XBRL about the sign (+/-) of the reported numeric values. We abandoned that idea fairly early and instead use the presentation information in the html to assign a sign to the data in our database. I don’t have the exact stats but as I recall when we compared signed values we pulled from the html with the values from the XBRL instances the sign discrepancy was greater than about 10%. This is brutal and very boring work. We persisted because we were hoping that there was some clue somewhere that would allow us to rely on the iXBRL data. It was just too uncertain. The fact is that we (outsiders) don’t know for sure if a number should be reported as a subtraction or an addition so we have to rely on the fact that presumably the totals were audited and the disclosures in the html were screened for accuracy.

To finish off this post I want to share my experience using ChatGPT to access this data. I had a sense from working in the middle of the data that the addition of the name field would save some effort on your part. I believe it does but not as much as I imagined. Anyway, once Manish had the final databases available I wanted to know how many unique values were present in the orig_row_label column. I was in a hurry and am still a real amateur with SQLITE so I asked for some help:

I asked ChatGPT the following – I have an sqlite3 db at “S:\directEDGAR_DATA\TAX_RECON_TEST.db” the db has a table named TAXRECON there is a field named name – I want to know how many distinct/unique values there are in this field using Python 3.9.

Below is the code that was generated. I connected to directEDGAR on AWS (the same way you would connect), opened Idle and pasted the code. This time it ran immediately.

import sqlite3

# Connect to the database
conn = sqlite3.connect("S:\directEDGAR_DATA\TAX_RECON_TEST.db")

# Create a cursor object to execute queries
cursor = conn.cursor()

# Execute a query to count the number of distinct values in the name field
query = "SELECT COUNT(DISTINCT orig_row_label) FROM TAX_RECON"
cursor.execute(query)

# Fetch the result and print it
result = cursor.fetchone()[0]
conn.close()
print(f"There are {result} distinct values in the name field.")

I have to admit – I was really disappointed – there are 15,626 unique values for name. That is clearly another significant problem with XBRL. That compares to the fact that there are 22,436 unique values for orig_row_label. So name reduces the noise but I was really hoping for a more significant decrease as compared to the row labels.

I asked for a code modification to generate a list of all unique values for name and their frequency. The query changed to

# Execute a query to get the frequency of each unique value in the name field
query = "SELECT name, COUNT(*) FROM TAXRECON GROUP BY name"
cursor.execute(query)

I then iterated through the results to print all cases where the frequency was greater than 900. Here are those results:

('us-gaap:EffectiveIncomeTaxRateContinuingOperations', 8443)
('us-gaap:EffectiveIncomeTaxRateReconciliationAtFederalStatutoryIncomeTaxRate', 8509)
('us-gaap:EffectiveIncomeTaxRateReconciliationChangeInDeferredTaxAssetsValuationAllowance', 5494)
('us-gaap:EffectiveIncomeTaxRateReconciliationChangeInEnactedTaxRate', 2116)
('us-gaap:EffectiveIncomeTaxRateReconciliationForeignIncomeTaxRateDifferential', 3472)
('us-gaap:EffectiveIncomeTaxRateReconciliationNondeductibleExpense', 1572)
('us-gaap:EffectiveIncomeTaxRateReconciliationNondeductibleExpenseShareBasedCompensationCost', 2359)
('us-gaap:EffectiveIncomeTaxRateReconciliationOtherAdjustments', 5573)
('us-gaap:EffectiveIncomeTaxRateReconciliationOtherReconcilingItemsPercent', 1146)
('us-gaap:EffectiveIncomeTaxRateReconciliationShareBasedCompensationExcessTaxBenefitAmount', 950)
('us-gaap:EffectiveIncomeTaxRateReconciliationShareBasedCompensationExcessTaxBenefitPercent', 1297)
('us-gaap:EffectiveIncomeTaxRateReconciliationStateAndLocalIncomeTaxes', 7077)
('us-gaap:EffectiveIncomeTaxRateReconciliationTaxCredits', 1688)
('us-gaap:EffectiveIncomeTaxRateReconciliationTaxCreditsResearch', 1577)
('us-gaap:IncomeTaxExpenseBenefit', 6812)
('us-gaap:IncomeTaxReconciliationChangeInDeferredTaxAssetsValuationAllowance', 4761)
('us-gaap:IncomeTaxReconciliationChangeInEnactedTaxRate', 2166)
('us-gaap:IncomeTaxReconciliationForeignIncomeTaxRateDifferential', 3106)
('us-gaap:IncomeTaxReconciliationIncomeTaxExpenseBenefitAtFederalStatutoryIncomeTaxRate', 7124)
('us-gaap:IncomeTaxReconciliationNondeductibleExpense', 1628)
('us-gaap:IncomeTaxReconciliationNondeductibleExpenseOther', 924)
('us-gaap:IncomeTaxReconciliationNondeductibleExpenseShareBasedCompensationCost', 2414)
('us-gaap:IncomeTaxReconciliationOtherAdjustments', 4892)
('us-gaap:IncomeTaxReconciliationOtherReconcilingItems', 1775)
('us-gaap:IncomeTaxReconciliationStateAndLocalIncomeTaxes', 6213)
('us-gaap:IncomeTaxReconciliationTaxContingencies', 927)
('us-gaap:IncomeTaxReconciliationTaxCredits', 1602)
('us-gaap:IncomeTaxReconciliationTaxCreditsResearch', 1322)
('us-gaap:IncomeTaxReconciliationTaxExemptIncome', 1246)

ChatGPT & directEDGAR

I know from talking to my own colleagues in the College of Business at the University of Nebraska at Omaha that people fall along an interesting spectrum with respect to their comfort level with using different tools to collect data. In a perfect world we would just imagine the data and it would appear – I alluded to that in the intro to one of our user guides about a decade ago.

About a month ago I asked one of our team (Siva) to plan a demo of how a novice might use ChatGPT to help them for instance – write Python code to pull data from one of our databases. Siva gave us the demonstration this morning. My brain has been on fire since then. It was a fascinating introduction and before the call was finished I had agreed to buy everyone on the team an upgraded account ($20.00/month) for access to ChatGPT and we were starting to talk about the ways it could help us improve our productivity.

But back to the original use case. After learning from watching Siva, I decided to test it myself. So first, I dropped in the code I used when creating the DC database and asked ChatGPT to provide a user guide to explain the fields.:

conn = sqlite3.connect(r'E:\MP_TEMP\DIRECTOR_COMP_weekly.db')
cur = conn.cursor()

cur.execute("""CREATE TABLE IF NOT EXISTS directorCOMP(
              ROW_ID INTEGER PRIMARY KEY,
              CIK TEXT,
              DEID TEXT,
              RDATE TEXT,
              RID TEXT,
              NAME TEXT,
              YEAR INT,
              CASH INT,
              BONUS INT,
              NONEQUITY INT,
              NQDEFCOMP INT,
              STOCK INT,
              OPTION INT,
              OTHER INT,
              TOTAL INT,
              SEC_NAME TEXT,
              PERSON_CIK TEXT,
              GENDER TEXT,
              SINCE INT,
              AGE INT
              );
			   """)

After a few iterations – and let me say, ChatGPT knew that CIK was Central Index Key from the get go – that was impressive to me – we had the following:

User Guide for the Director Compensation Database

Introduction: This database contains information about director compensation for various companies. The database table is named “directorCOMP”. The following is a guide to explain the fields in this database.

  1. “ROW_ID” – This field contains a unique identifier for each record in the table and is defined as an “INTEGER” type. The “PRIMARY KEY” constraint ensures that no two records can have the same value in this field.
  2. “CIK” – This field contains the Central Index Key (CIK) for a company and is defined as a “TEXT” type.
  3. “DEID” – This field contains a unique identifier for a director that helps to identify the source document from which the compensation information was pulled. The “DEID” field is defined as a “TEXT” type.
  4. “RDATE” – This field contains the date that compensation information was received and is defined as a “TEXT” type.
  5. “RID” – This field contains a unique identifier for compensation information and is defined as a “TEXT” type.
  6. “NAME” – This field contains the name of the director and is defined as a “TEXT” type.
  7. “YEAR” – This field contains the year for which compensation information is being reported and is defined as an “INT” type.
  8. “CASH” – This field contains the amount of cash compensation reported and is defined as an “INT” type.
  9. “BONUS” – This field contains the amount of bonus compensation reported and is defined as an “INT” type.
  10. “NONEQUITY” – This field contains the amount of non-equity incentive plan compensation reported and is defined as an “INT” type.
  11. “NQDEFCOMP” – This field contains the amount of non-equity incentive plan deferred compensation reported and is defined as an “INT” type.
  12. “STOCK” – This field contains the amount of stock award compensation reported and is defined as an “INT” type.
  13. “OPTION” – This field contains the amount of option award compensation reported and is defined as an “INT” type.
  14. “OTHER” – This field contains the amount of other compensation reported and is defined as an “INT” type.
  15. “TOTAL” – This field contains the total amount of compensation reported and is defined as an “INT” type.
  16. “SEC_NAME” – This field contains the name of the SEC filing and is defined as a “TEXT” type.
  17. “PERSON_CIK” – This field contains the CIK number for the director and is defined as a “TEXT” type.
  18. “GENDER” – This field contains the gender of the director and is defined as a “TEXT” type. The allowed values for this field are “M” for male and “F” for female.
  19. “SINCE” – This field contains the year that the director has been with the company and is defined as an “INT” type.
  20. “AGE” – This field contains the age of the director and is defined as an “INT” type.

My goal with this was to reduce fear when interacting with our platform using code. So I asked the following:

Suppose I have a CSV file that has a column labeled CIK and a column labeled YEAR. The file is stored on an APPSTREAM user instance “D:\PhotonUser\My Files\Temporary Files\mysample.csv”. The path to the Director Compensation database is “S:\directEDGAR_DATA\DIRECTOR_COMP.db”. Write a Python program to save the data from the Director Compensation database that relates the CIKs and Years in my file and I want the results saved to a CSV file named director_data.csv.

I learned something new – that I know will improve my productivity:

import csv
import sqlite3

# Connect to the Director Compensation database
conn = sqlite3.connect(r'S:\directEDGAR_DATA\DIRECTOR_COMP.db')
cur = conn.cursor()

# Read the data from the CSV file into a list
cik_years = []
with open(r'D:\PhotonUser\My Files\Temporary Files\mysample.csv', 'r') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        cik_years.append((row['CIK'], row['YEAR']))

# Execute a query to retrieve the data from the Director Compensation database that matches the CIKs and Years in the CSV file
cur.execute("""
    SELECT *
    FROM directorCOMP
    WHERE (CIK, YEAR) IN ({})
""".format(','.join(['(?,?)'] * len(cik_years))), cik_years)

# Write the results to a CSV file
with open('director_data.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    
    # Write the header row
    writer.writerow([column[0] for column in cur.description])   # I learned about the description method of the cur obj
    
    # Write the data rows
    for row in cur:
        writer.writerow(row)

# Commit the changes and close the connection
conn.commit()
conn.close()

Anyway – there is some code – will it work? I was eager to find out so I created a sample with more than 9,000 CIK-YEAR pairs saved it per the details above, started an Idle (Python) session as if I was a directEDGAR user, pasted the code and ran it.

I understand the issue and this is not the way I would have written the code. My goal though was to see if a novice Python user who doesn’t necessarily want to learn Python – but wants to collect the data could use ChatGPT to adjust the code – so I posted the error message into ChatGPT

I looked at the code and was fairly certain I would see the error again but this time 2,000/1,000. But remember, I was curious about whether I could power through this and get some output. I was right, here is the error message:

I pasted that message into ChatGPT and hit the Regenerate response message and saw some code I thought would work!

import csv
import sqlite3

# Connect to the Director Compensation database
conn = sqlite3.connect(r'S:\directEDGAR_DATA\DIRECTOR_COMP.db')
cur = conn.cursor()

# Read the data from the CSV file into a list
cik_years = []
with open(r'D:\PhotonUser\My Files\Temporary Files\mysample.csv', 'r') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        cik_years.append((row['CIK'], row['YEAR']))

# Split the list of values into smaller chunks to avoid exceeding the maximum number of bindings allowed
chunk_size = 1000
for i in range(0, len(cik_years), chunk_size):
    chunk = cik_years[i:i + chunk_size]
    
    # Execute a query to retrieve the data from the Director Compensation database that matches the CIKs and Years in the CSV file
    cur.execute("""
        SELECT *
        FROM directorCOMP
        WHERE (CIK, YEAR) IN ({})
    """.format(','.join(['(?,?)'] * len(chunk))), [item for sublist in chunk for item in sublist])
    
    # Write the results to a CSV file
    with open(r'D:\PhotonUser\My Files\Temporary Files\director_data.csv', 'a', newline='') as csvfile:
        writer = csv.writer(csvfile)

        # Write the header row
        if i == 0:
            writer.writerow([column[0] for column in cur.description])

        # Write the data rows
        for row in cur:
            writer.writerow(row)

# Commit the changes and close the connection
conn.commit()
conn.close()

I am going to tell you that statement # Commit the changes scared me – you can’t commit changes and there are no changes. But the OS safeguards and privilege management help with that issue.

This was a successful experiment – ChatGPT wrote the code for me to pull the director compensation data. I want to emphasize, I made no alterations to the code ChatGPT delivered. All I did was describe my goal, define the fields, specify the path to the db, the path to my sample and the destination. So we need to get some field definition files up there so you can do the same. This was exciting. I have to stop playing with this but you can expect some more posts to describe how to develop code to process and analyze our artifacts.

Updated Structure for Full-Text Searchable Comment Letters (UPLOAD filing type) in Process and Partially Available

For quite some time I have wanted to update the way we have made the Comment Letter available. Because of this I have delayed updating the collection. I received an email from a researcher last week who needed our comment letters to be updated and since I have been struggling with another project (and she needed these updated) I turned my focus to the Comment Letters.

I had two primary goals with this project. First, I discovered sometime ago that the SEC sometimes included a field in the header associated with the Comment Letters that indicated (pointed to) the filing that was the subject of the letter <PUBLIC-REFERENCE-ACC>. I wanted to add these into the metadata that we include with the filings. The second thing I wanted to do was to systematically replace all of the pdf comment letters with a txt version since the txt version is friendlier to your use of Python to do finer work with these filings. Sometimes the SEC includes a TEXTEXTRACT version of the comment letter and my intention was to use that version if available. Otherwise we will run the pdf form of the letters through one of our tools and use that rather than the pdf version.

I will have to plead incompetence for not knowing earlier that the <PUBLIC-REFERENCE-ACC> was available and we were not incorporating it into the platform for your use. When we develop features we take what we think is a reasonable sample of the population and based on what we discover as we are working with that sample we make the decisions about how to move forward. You have read my comments in the past where I acknowledge – I don’t know what I don’t know.

Now that I know of this tag I set to work this weekend to begin incorporating it into the platform. Guess what, the SEC stopped using this tag before January 2021. From what I can tell, the last comment letter that included this field was uploaded on 11/12/2020. Frankly, the frequency of this field being included in the comment letters dropped off significantly after September 2020.

I am building a database of all of the metadata for these filings and so when the code first finished running on Saturday I was sure I had made some Burch error (this is an error that only I can make). After I spot checked a few of the later header files and did not see the field in these files I decided to check all of the header files for filings made since 2020 by year to see if maybe (I was hopeful) there was a new field. So I ran code to identify each unique field in every header file by year. Guess what, I did not make a Burch error, there were no filings with either the <PUBLIC-REFERENCE-ACC> field or a new field that provided this info. There were 29 unique fields across all of the filings made in 2020 and only 28 unique fields made in all of the filings in 2021 and 2022. Serious bummer.

{'acceptance-datetime', 'filing-values', 'private-to-public', 'type', 'conformed-name', 'business-address', 'state', 'street1', 'cik', 'fiscal-year-end', 'assigned-sic', 'filed-for', 'form-type', 'date-changed', 'sec-header', 'phone', 'public-document-count', 'city', 'zip', 'former-company', 'filing-date', 'company-data', 'mail-address', 'accession-number', 'state-of-incorporation', 'former-conformed-name', 'street2', 'irs-number'}

I did some searches on the SEC website and could not find any indication about this. The latest version of the PDS Technical Specification document still contains reference to this field. This is interesting because the latest version is 12/14/2020, which is after the field stopped showing up.

It is really interesting, that was a very useful feature and it is gone. I have not found anyone who is happy with the redesign of the EDGAR website/interface. The project I have been stuck on is trying to improve metadata collection from 10-K filings and I have finally had to make a decision that we can’t capture everything we want to capture with the confidence that is needed to distribute through the platform. I will post more on that later. I am just curious though who at the SEC makes these decisions and why?

Enough ranting. Let me address what has been accomplished. First, there is a new index UPLOAD Y2016-Y2025. The label is a little bit misleading. Today it only has filings from 2020 – the end of 2022. The balance will be moved from the UPLOAD Y2016-Y2020 index within the next week. That process is running now but I wanted to have the filings that our client needed available now.

Second, there is a new database – COMMENT_LETTER_DETAILS:

This contains a number of fields that I think might be useful. The SEC’s decision to stop providing the tag is going to make this more painful to complete than I hoped. However, for all the 2020 filings with the tag present we added a link to the EDGAR filing that was the subject of the Comment Letter. We also added the type of filing that was the subject. We will do this for every filing we can back to 2005 (which is when Comment Letters first became available).

When you do a search remember that the RDATE/CDATE have a different meaning with Comment Letters. The SEC does their research and sends a Comment Letter to a registrant. The day the letter is send is considered the ‘Filed Date’ for purposes of securities law and on EDGAR you will find the Comment Letters filed date to match the letter date. However, the Comment Letters are not visible on EDGAR for review until sometime later. The window has shortened considerably since the early days however it can still extend several months. The RDATE value we use is the date the header indicates the filing status changed from PRIVATE-TO-PUBLIC. The CDATE is the SEC EDGAR reported filing date. Thus, if you expect a market reaction, presumably that would come around the CDATE.

DC Database Updates to Occur Weekly

Our team hit the commit button on a new ‘job’ early this morning that will update the DC database available through the application on Sunday morning at 1:14 AM CT. Getting to this stage was a bigger project than we imagined initially but we learned a lot so I hope that what we learned translates well to getting the EC data ready for the same scheduling.

To see the benefit of this consider that Apple Inc (CIK:320193) filed their proxy last Thursday. As you can see from the image below – their Director Compensation data is ready for use:

To be absolutely clear – that image came from using the DB Query feature on our application as I was writing this post.

If you are curious as to why we are only updating weekly – read on!

We process the data in two stages. First, we have the processes that extract and normalize the data as it was reported in the table in the filing. During this phase we also populate the SEC_NAME, PERSON_CIK and GENDER fields. This is what gets immediately distributed to the old platform and to our API customers. Overnight we then attempt to automatically populate the YEAR and SINCE fields. If we can’t populate those fields using data that exists in our databases we have to queue these up for a human to populate after reviewing the source document. It can get really challenging to decide whether or not to pass on these fields or not. So by waiting until Sunday we are hoping that it is more likely that we will have finished whatever review is required and either populated those fields or signaled NOT AVAILABLE. I will warn though that during busy periods (peak proxy filing ‘season’ runs from late March until the end of April) we will get further behind. And we do the easy ones first and set the hard ones aside until we can devote the right expertise to reviewing the source documents. An easy one is where the disclosure of age and tenure are included in the source document. However, there are more than 400 cases each year where this data is not disclosed for an individual director. Whether we have populated those fields or not, the data will be available – if we are still hopeful that we can get a measure of AGE or SINCE then the field will be blank. If we have determined that we are not likely to find evidence to populate we will indicate with a NOT AVAILABLE – MONTH YEAR message.

Let me bore you with an example. OneWater Marine Inc. filed a proxy on 1/13/2023. A Mr. Greg A Shell was listed as a director but there were no details about his age or when his tenure began. It turns out he resigned in November because of a change in his primary employment. We found the announcement of his appointment to the board in March. To find his age we had to cross our fingers and hope to find another disclosure of age with a date to see if we could extrapolate to the proxy filing date. We were lucky to find an S-1 filing for an unrelated entity that reported his age (44) as of 12/31/2019. So this field could be populated (47) . I think it took 20 minutes to find this fact. While it may seem obvious to populate the SINCE value with the year the director shows up – we have determined that introduces error

Another issue to be sensitive to is that this data can change over time. For example, AGE and SINCE values will be constantly updated. But there are even bigger issues that can cause change. Items included in Part III of the 10-K are often omitted because the issuer will indicate that they are going to take advantage of the grace they are allowed to incorporate the information included in the proxy if it is filed within 120 days after the fiscal year-end. However, if something causes the company to delay the filing of the proxy (DEF 14A) then they are obligated to include the information from Items 10 – 14 in an amended 10-K (10-K/A). Further, there are issuers who will make the initial disclosure in a 10-K and then still file a proxy by the deadline (or later). We parse and normalize the data when it is filed and only respond to multiples when a second filing is made that includes the same data. It is a little complicated and tedious to fully describe how we handle these. The impact is that once a duplicate is filed we have some tests that run. If the first disclosure was in a 10-K or 10-K/A and the second is a DEF 14A we pull the first disclosure and rotate in the second disclosure. If both disclosures were from a DEF 14A we will remove both from the database until someone can verify the reasons for the second. If the second is because the original meeting date was canceled then we use the data from the second disclosure. If the second disclosure is because of a special meeting we confirm the originally scheduled meeting took place and delete the second disclosure and push the first disclosure back into rotation. There are even other cases. I told you this was tedious!

There are more exciting things coming in the next weeks. Stay posted.

More Errors in SEC Filings

I have been pounding on the XBRL data as we intend to make some subset of this available. One of the issues that is particularly important to me is to improve the way we make metadata available to you to organize your search. I don’t want to get bogged down in a deep discussion about our plans at this moment, I would just like to observe that we for sure want the Document and Entity Information (DEI) at your fingertips so you can better manage your search.

I created two datasets, one has all of the DEI numeric data and another has all of the text data. We knew some of the numeric values had errors. For example, did you know that EBay once reported that their public float was $ 31,354,367,947,000,000,000! Anyway, I was playing with the dataset in an SQLITE browser tool and decided to test filter on setting ICFR flag to false and the EntityFilerCategory type to LARGE ACCELERATED FILER. I expected none or just a few. Instead, the database returned 97 entities where the DEI table indicated that the entity was a LARGE ACCELERATED FILER and the IcfrAuditorAttestationFlag is set to false. Because it is a requirement that LAF have an audit of their internal controls I thought this was curious. Here is an image of part of that query.

I of course am hugely curious about that result. Frankly, my first thought was that another unexpected issue in the source files that I will have to pour through to sort out. The code to manage this process was really tricky because of filer specific idiosyncratic choices. Guess what – most of these were coded wrong, either by an employee of the filer making an error or some hiccup between the filer’s form creation and their EDGAR-IZING software. For example, you can see clearly in this audit report that the auditor issued an unqualified opinion on the Company’s internal control over financial reporting.

Some of the cases the auditor concluded that there were one or more material weaknesses in the internal controls, but my read of the Edgar Filer Manual suggests to me that the flag is to indicate whether or not their was an audit of internal controls, not the results of the audit.

My quick scan (using directEDGAR of course) indicates that 91 of the 97 cases were miscoded by the filers. My present thought is that we need to push this data out as-reported. I am going to muck around with it a bit more before we push it into the platform. Once again, because of the updates, once it is available it will be visible in the Database to Query area of the database tool.

I will observe that DraftKings is the result of a SPAC deal and I think based on the timing etc of their acquisition of the old DraftKings business that they were exempt from the requirement to have an audit of internal controls for their 2021 financials.

I hope you have a Happy New Year.