I had a client reach out recently and asked for some help developing a strategy to collect the record dates for annual meeting. They were imagining that they were going to have to run a search for the phrase “Record Date” and then spend significant time looking at the results to transfer the date to their data collection worksheet. This client had no significant experience with Python. I wanted to demonstrate how careful use of AI tools could really accelerate this work. I also wanted to help them dip their toes into using Python.
We ran a search on the 2025 proxy archive for ((DOCTYPE contains(DEF)) or (DOCTYPE contains(PRE)) ) and (date(January 1 2025 to december 31 2025) w/10 (record date) ) When we did this the only proxies available were those filed through 4/11/2025 – but the record date is expected to be after the proxy filing and they needed to do this for all years back to 2014. I was helping them create a pattern that could be used again. Here is the result of that search:
I showed them how to set the span of the context extraction – for something like this a tighter span is generally better. I set the span to 3 by using the File/Options selection from the application menu.
We then used the Extraction\ContextExtraction feature to save to their Temporary Files folder on their instance. Here is a screenshot of the file contents centered on the disclosure from Roblox’s proxy as seen in the image above.
We then brainstormed with ChatGPT – I created a small sample of the file and – remember this is taking place in a conversation – I was not trying to be formal but wanted the client to see how simple the process could be.
Here is a sample CSV file that was created pulling the relevant context from a search using the directEDGAR platform. (note I passed a sample of the csv file)
The CONTEXT column is the text from the document.
The FILENAME defines the document.
Notice that one document can have multiple context outputs and they might or might not overlap.
I want to use Python to analyze the context to determine and have included in a separate column the date, in this example the date is Month, dd, YYYY but there can be variations - I have seen m/d/yyyy and sometimes the comma is missing after the date in the Month dd, YYYY format.
If there is an explicit reference to either record or record date in a particular single context extraction I want that noted, if there is not an explicit reference to a record date I want that noted. My actual sample is more than 3,000 rows so I thought it best to offer just a small sample
Note - my preference is to use the csv module to read and write
We looked at the initial results and ChatGPT pulled the dates but the formats varied so we asked:
Lets normalize the dates in m/d/y format
Here is the resulting code:
import csv
import re
from datetime import datetime
def extract_date(text):
date_patterns = [
r’\b(?:January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{1,2},?\s+\d{4}’,
r’\b\d{1,2}/\d{1,2}/\d{4}’
]
for pattern in date_patterns:
match = re.search(pattern, text, re.IGNORECASE)
if match:
return match.group(0)
return “”
def normalize_date(date_str):
date_formats = [
“%B %d, %Y”, # March 27, 2025
“%B %d %Y”, # March 27 2025
“%m/%d/%Y”,
]
for fmt in date_formats:
try:
parsed_date = datetime.strptime(date_str.strip(), fmt)
return parsed_date.strftime(“%-m/%-d/%Y”) # Unix-style no leading 0
except ValueError:
continue
return date_str
def has_record_mention(text):
keywords = [‘record date’, ‘record’]
text_lower = text.lower()
return any(keyword in text_lower for keyword in keywords)
def process_file(input_path, output_path):
with open(input_path, mode=’r’, newline=”, encoding=’utf-8′) as infile, \
open(output_path, mode=’w’, newline=”, encoding=’utf-8′) as outfile:
reader = csv.DictReader(infile)
fieldnames = reader.fieldnames + [‘EXTRACTED_DATE’, ‘RECORD_DATE_MENTIONED’]
writer = csv.DictWriter(outfile, fieldnames=fieldnames)
writer.writeheader()
for row in reader:
context = row.get(‘CONTEXT’, ”)
date_found = extract_date(context)
row[‘EXTRACTED_DATE’] = normalize_date(date_found) if date_found else “”
row[‘RECORD_DATE_MENTIONED’] = “Yes” if has_record_mention(context) else “No”
writer.writerow(row)
input_csv = “record_dates_chat_gpt_sample.csv” # Change to your input file
output_csv = “record_dates_processed_normalized.csv” # Desired output file
process_file(input_csv, output_csv)
print(f”Processing complete. Output saved to ‘{output_csv}'”)
Here is the result of running their input file through that code:
We are not done yet – the client astutely noticed that this was not the whole collection of PROXY filings. So we needed to see what we missed. I ran a new search with them xfirstword and not (date(January 1 2025 to December 31 2025) w/10 record date) and ( (DOCTYPE contains(def)) or (DOCTYPE contains(PRE)) ). This identified the filings that did not match the criteria. We used the infamous CTRL+F to iterate through the documents and we learned two things. First there is additional variety – we saw language such as stockholders as of record on someDate. But for the most part the record date was not set so we saw many of these examples.
Our client had to get to another meeting but they did report that they were confident about the next steps. Actually, the were pretty ecstatic – we collected dates from 2,500 filings in 30 minutes and that included a little bit of wrangling with ChatGPT.




