One of clients sent an email this morning asking for help pulling out of 8-K filings the specific text content that was used to explain reason for the 8-K filing. I want to protect their research idea so I am going to modify the example and pretend I want to pull the explanation associated with auditor changes. So here is the scenario – I want to do sentiment analysis on all explanations of auditor changes included in 8-K filings. The client has used Python in the past but I could sense some real apprehension as I was explaining how to do this. So I decided to use ChatGPT to generate the code and then I shared the chat results with them. I decided to blog about this experience because I think it should help reduce some apprehension across our user base for this and similar problems and their response was kind of cool This may even work for a techno imbecile like me. Let me assure you, this person is not an imbecile. But I understand the apprehension.
Step 1 – Identify all 8-K filings that disclose an auditor change. Easy enough – run a search in one or more 8-K collections for (DOCTYPE contains(8K*)) and (ITEM_4_01 contains(YES)). For this real example I selected Y2005-Y2015, Y2016-Y2020, Y2021-Y2025 and Y2023-Y2023. I hit the perform search button at 2:43 pm. I am sharing that because I hope to be finished with this in less than an hour.
Here is an image of the search results as displayed in our software:

This makes me a bit nervous – we found 26,940 documents and I am a little bit concerned about moving that many files at once. I am going to do this by subset of years – so I am going to rerun by first selecting Y2016-Y2020, Y2021-Y2025 and Y2023-Y2023. This time the search returned 6,162 documents – this feels like a more appropriate size.

Step 2 – We need to extract the text content of those documents – From the Extraction menu select DocumentExtractionTextOnly – use the folder selector tool to create a folder in your Temporary Files folder and select the folder.

This will take a few minutes – the application has to pull the document out of the html and clean it up and then save it in the directory. I generally expect the process to create 300-500 files a minute. So while that is running I am going to go to ChatGPT and ask for some help. Here was my initial request to ChatGPT.
I have some txt files at D:\PhotonUser\My Files\Temporary Files\Group1
The files have a line of text that begins with
Item 4.01. Changes in Registrant’s Certifying Accountant
However, I cannot be sure about the white space or the capitalization.
after that line is additional text in most cases I want to capture all text from the Item 4.01 until the next line that begins with a pattern of ITEM #.# or ITEM #.## (I put this in caps to emphasize but again I am not certain about capitalization) or spaces. In some cases though the end will be indicated with a line that begins with the word SIGNATURE or SIGNATURES
I want the text between the begin and end locations saved to a new directory named
D:\PhotonUser\My Files\Temporary Files\PARSED_GROUP1
And any files that did not yield results should be moved to a new directory called
D:\PhotonUser\My Files\Temporary Files\PROBLEM_GROUP1
It took 21 minutes to get the 6,162 8-K that were in my initial search – so that was on the low end of my expectation ( a bit less than 300/minute). Here is the code that ChatGPT generated for me:
import os
import re
import shutil
# Define the source directory
source_dir = r"D:\PhotonUser\My Files\Temporary Files\Group1"
# Define the destination directories
parsed_dir = r"D:\PhotonUser\My Files\Temporary Files\PARSED_GROUP1"
problem_dir = r"D:\PhotonUser\My Files\Temporary Files\PROBLEM_GROUP1"
# Create the destination directories if they don't exist
os.makedirs(parsed_dir, exist_ok=True)
os.makedirs(problem_dir, exist_ok=True)
# Define the patterns for identifying the start and end of the desired text
start_pattern = re.compile(r'(?i)item\s*4\.\d+')
end_pattern = re.compile(r'(?i)item\s*\d+\.\d+\.?(?!\d)|^signature[s]?')
# Iterate over the files in the source directory
for file_name in os.listdir(source_dir):
file_path = os.path.join(source_dir, file_name)
# Check if the item is a file
if os.path.isfile(file_path):
# Read the file contents
with open(file_path, 'r') as file:
content = file.read()
# Find the start and end indices of the desired text
start_match = start_pattern.search(content)
if start_match:
start_index = start_match.end()
# Find the end index
end_match = end_pattern.search(content[start_index:])
if end_match:
end_index = start_index + end_match.start()
else:
# If the end pattern is not found, check for "SIGNATURE" or "SIGNATURES"
signature_match = re.search(r'(?i)^signature[s]?', content[start_index:])
if signature_match:
end_index = start_index + signature_match.start()
else:
end_index = len(content)
# Extract the desired text
parsed_text = content[start_index:end_index]
# Save the extracted text to a new file
parsed_file_path = os.path.join(parsed_dir, file_name)
with open(parsed_file_path, 'w', encoding='utf-8') as parsed_file:
parsed_file.write(parsed_text)
print(f"File '{file_name}' processed and saved to '{parsed_file_path}'")
else:
# Move files that did not yield results to the problem directory
problem_file_path = os.path.join(problem_dir, file_name)
shutil.move(file_path, problem_file_path)
print(f"File '{file_name}' moved to '{problem_file_path}'")
As soon as I ran the code I had an error message:
========== RESTART: D:/PhotonUser/My Files/Temporary Files/parse8k.py ==========
Traceback (most recent call last):
File "D:/PhotonUser/My Files/Temporary Files/parse8k.py", line 32, in <module>
content = file.read()
File "C:\Program Files\Python310\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1173: character maps to <undefined>
So I asked ChatGPT for help:
The error message suggests that the file you're trying to read has a character that cannot be decoded using the 'cp1252' encoding (also known as Windows-1252). To resolve this issue, you can modify the script to handle the file using a different encoding, such as 'utf-8'.
Please replace the following line in the script:
with open(file_path, 'r') as file:
with:
with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:
This change allows the script to read the file using the 'utf-8' encoding while ignoring any characters that cannot be decoded. The 'errors='ignore'' argument ensures that any problematic characters are skipped without raising an exception.
Save the script after making the modification and run it again. It should now be able to read the file without encountering the UnicodeDecodeError.
I made that change and we went to town. There were 321 files in the problem folder I shared the following with ChatGPT
one of the files in the PROBLEM_FOLDER had this after the text I wanted:
SIGNATURE
Pursuant to the requirements of the Securities Exchange Act of 1934, the Registrant has duly caused this report to be signed on its behalf by the undersigned hereunto duly authorized.
Dated: July 10, 2020
BARNWELL INDUSTRIES, INC.
By:
/s/ Russell M. Gifford
Name:
Russell M. Gifford
The response was
In this updated version, the end pattern has been modified to use a positive lookahead assertion (?=.*?signature[s]?\b) to match lines that contain the word "SIGNATURE" or "SIGNATURES" regardless of their position within the line.
Please give this modified script a try, and let me know if you encounter any further issues.
It is 4:02 so I was at this longer than I hoped – there was some other back and forth with ChatGPT – but I am pretty pleased with the results – here is a screenshot of a review of these in the SmartBrowser:

I tried to be as naive as possible but I was also persistent. Because directEDGAR allows you to first identify the relevant 8-K and then extract the text from the 8-K we have saved significant time. With this time savings we can be persistent with ChatGPT to make sure we get it right.
Here is the last iteration of the code:
import os
import re
import shutil
# Define the source directory
source_dir = r"D:\PhotonUser\My Files\Temporary Files\Group1"
# Define the destination directories
parsed_dir = r"D:\PhotonUser\My Files\Temporary Files\PARSED_GROUP1"
problem_dir = r"D:\PhotonUser\My Files\Temporary Files\PROBLEM_GROUP1"
# Create the destination directories if they don't exist
os.makedirs(parsed_dir, exist_ok=True)
os.makedirs(problem_dir, exist_ok=True)
# Define the patterns for identifying the start and end of the desired text
start_pattern = re.compile(r'(?i)item\s*4\.\d+')
end_pattern = re.compile(r'(?i)item\s*\d+\.\d+\.?(?!\d)|^(?=.*?signature[s]?\b)')
# Iterate over the files in the source directory
for file_name in os.listdir(source_dir):
file_path = os.path.join(source_dir, file_name)
# Check if the item is a file
if os.path.isfile(file_path):
# Read the file contents
with open(file_path, 'r', encoding='cp1252', errors='ignore') as file:
content = file.read()
# Find the start and end indices of the desired text
start_match = start_pattern.search(content)
if start_match:
start_index = start_match.end()
# Find the end index
end_match = end_pattern.search(content[start_index:])
if end_match:
end_index = start_index + end_match.start()
else:
# If the end pattern is not found, check for "SIGNATURE" or "SIGNATURES"
signature_match = re.search(r'(?i)^(?=.*?signature[s]?\b)', content[start_index:], re.MULTILINE)
if signature_match:
end_index = start_index + signature_match.start()
else:
end_index = len(content)
# Extract the desired text
parsed_text = content[start_index:end_index]
# Save the extracted text to a new file
parsed_file_path = os.path.join(parsed_dir, file_name)
with open(parsed_file_path, 'w', encoding='utf-8') as parsed_file:
parsed_file.write(parsed_text)
print(f"File '{file_name}' processed and saved to '{parsed_file_path}'")
else:
# Move files that did not yield results to the problem directory
problem_file_path = os.path.join(problem_dir, file_name)
shutil.move(file_path, problem_file_path)
print(f"File '{file_name}' moved to '{problem_file_path}'")
I still have another 20,778 to complete. But the hard work is finished. I started extracting the next group of 8-K filings at 4:05 – this time I am going to just extract all found in the Y2005-Y2015 archive.
Okay finished at 5:38 PM. It took about 70 minutes to pull the relevant filings and then I had to address some new error issues that came up as a result of there being some structural differences in the 8-K filings. I did more back and forth with ChatGPT as I noticed filings that ended up in the problem group.