Extracting Text with Python

A client recently asked about parsing from 8-K filings the descriptions associated with a particular item code. I made a blog post about using ChatGPT to write the code. I was trying very hard to be a naive user of Python while interacting with ChatGPT. However, since this question has come up in the past I wanted to create some more robust code and make it available. The code can be retrieved from “S:\PythonCode\PARSE_8K_ITEMS.py” – the S: drive is where all of the directEDGAR stuff is located.

The first step is of course to identify the filings that you need to parse and extract the text of the filing. This process is described – here. I generally limit my extraction to groups of about 5,000 documents per cycle. I do this for two reasons one of which is fairly technical but the other is that I generally prefer to run code on my local computer. That means I need to transfer the extracted text from my platform instance to my local computer and my feeling is that the smaller groups are just less prone to bandwidth and connection issues.

The basic logic of the code is to read the filing/document line by line to find the line that has the content that marks the beginning of the section you want to parse. Here is an image of one 8-K filing in the SmartBrowser after extraction with the line I want to use to identify the beginning of the section.

I want to emphasize that the entire 8-K is in the SmartBrowser. So now I need to scroll down to find out how the section ends. The red line marks the end of the section I want to parse.

Regular Expression used to identify beginning of section

As I noted above, the code reads each line and so my regular expression to identify the start line relies on assuming that the description of the item begins the line.

start_pattern = r'^\s*Item\s*4\.01'
^ - indicates that the pattern that follows should be at the beginning of the string that is being analyzed - we are iterating through this line by line so each line is a string
\s* - indicates that there may be ZERO or more white space characters
Item - after the white space we are looking for the characters I-t-e-m in sequence  (note I handle capitalization issues later)
\s* again ZERO or more white space characters
4 - we are requiring the number 4
\. - because the period/dot has special powers in a regular expression we want to escape it to limit the search to find a period/dot
01 - and then the sequence 01.

Identify the end of the section

When you review the code you will note that that the actual code to identify the end of the section relies on three different regular expressions.

 if re.match(end_pattern_401, line, re.IGNORECASE) or re.match(signature_pattern, line, re.IGNORECASE) or  re.match(exhibit_pattern, line, re.IGNORECASE):

The first one is unique to the task of parsing 4.01 related filings. Some registrants include 4.01 (a) and/or 4.01 (b). section labels in their 8-K. Thus a general search for Item # will result in a snip that only has the first line. Thus we need a special pattern that ignores the existence of a following line that begins with Item 4.01. Further, some filings do not include any additional items in the auditor change 8-K so the next boundary to consider is the caption that begins the exhibit section. If there are not any exhibits then we need to search for the caption that indicates that the signature section follows.

Older (Pre 2004 Filings)

As noted elsewhere – we uses the post-2004 ITEM CODES to identify the associated reasons that triggered the filing of the 8-Ks. This means when you search for (ITEM_4.01 contains (YES)) you will be able to identify all Auditor Change related 8-K filings, even those filed prior to the post-SOX changes in the 8-K form. However, you can’t use the regular expression that I described above to identify the beginning of the section since the caption was ITEM 4. CHANGE IN REGISTRANT’S CERTIFYING ACCOUNTANTS. This is addressed in some comments at the bottom of the code sample.

Developing your own regular expression

You can modify the code to parse other documents to separate specific sections once you have identified a pattern that will systematically identify the beginning and the end of the section you want to parse. The tool I often use when trying to develop/refine a regular expression is the website at Pythex. To use it, paste the text you want the regular expression to run against in the Your test string box and start constructing your regular expression in the Your regular expression box.

directEDGAR

Search, Extraction & Normalization Engine

Extracting Text with Python

Regular Expression used to identify beginning of section

Identify the end of the section

Older (Pre 2004 Filings)

Developing your own regular expression

Related