CUSIP-CIK Mapping

I am making available a new database on the platform that has three fields; BASE_CUSIP, CUSIP and CIK. As you have probably seen, we have parsed the 13F filings and have been trying to link the CUSIP to the issuer CIK. CUSIPs are hard to get as they are issued by the American Banker’s Association and managed by Standard and Poor’s. What we did to identify the mapping between CUSIPs and CIKs is we parsed the SC 13G and SC 13D filings as they contain CUSIP values.

First Page of SC 13G/A Filed to Disclose Holdings in BioScrip by Heartland Advisors in 2007

Based on the metadata associated with this filing we could determine that this related to the company that was named BioScrip, whose CIK is 1014739. Because of the number of SC 13 filings that were made with this CIK we feel pretty comfortable that this mapping is correct.

Of course it is never that easy as you well know. Bioscrip was acquired in a reverse merger type arrangement in 2019 and while the legal entity with respect to filing with the SEC remained fixed, the shares were replaced with new shares and with those new shares a new CUSIP was issued (68404L201) as can be seen in the image from an SC 13 filing made in 2022.

Since our interest is in making it easier for you to search for SEC filing content, we believe this mapping will help you when you want to identify filings made by some security issuer when you have the CUSIP and not the CIK. From our perspective, the fact that the CUSIP changes is more or less irrelevant. If you have data that is indexed on CUSIP for this company that relates to security information prior to August 2019 you will have the CUSIP value 09069N108 – that maps to CIK 1014739. If your data is more recent (in 2020 for example) you would have CUSIP 6804L201 – that maps to CIK 1014739. And actually, if you have data prior to early 2005 you might have CUSIP 553044108 (the company was known as MIM CORP then). So we have one CIK that maps to three CUSIPS. We are pretty confident that we have enough evidence to draw the conclusion that the mapping is appropriate/reasonable. Specifically there were more than 40 confirming SC 13 filings with CUSIP 553044108. There were more than 100 SC 13 filings with CUSIP 09069N108 and more than 15 with CUSIP 68404L102. I would argue that the evidence is reasonable. However, you need to understand that we just analyzed the filings. We tested the validity of the CUSIPS we found and created tests relating to characteristics of the filing, the filer and the issuer. However, we don’t know what we don’t know. I have been trying to find errors, I suspect there can be some but we have done extensive testing.

Remember, the first six characters of the CUSIP are issuer specific, so we also added the BASE_CUSIP value(s) (09069N, 553044 and 68404L) as an additional field, just in-case. We clearly do not have every CIK mapped to a CUSIP. For some issuers they do not have any SC 13 filings. For others, the evidence is not strong enough (5 filings total, 3 with one CUSIP and 2 with another). There are those that report the SEDOL of the underlying security rather than the CUSIP of the ADR. We are also working with another data source that will yield some additional mappings.

Finally, you can filter on CIK – to access the entire listing just hit the Execute button without setting any operators/criteria. And of course you can export the results using the Save Results button.

End of Year Index Changes Coming – no action needed by you!

We are starting to prep for the end of the calendar year. We will consolidate all Y2021/Y2022 indexes into a Y2021 – Y2025 index and then create new Y2023 indexes for all filings made after 12/31/2022. With the software update we did earlier this year – you will not have to do anything. If you start the application after we complete the update around 1/1/2023 the old indexes will be merged and renamed. It does mean that if you are using one or more of our artifacts that includes the FILENAME field you might have to either manually change the path value or rerun your search on the new consolidated index.

To be clear – suppose you used our platform to identify the 8-K filings made so far in 2022 that related to an auditor change.

If you ran the summary extraction from this search the FILENAME field will have the path to the folder Y2022-Y2022 as you can see in the next image:

Once we consolidate and re-index that path will no longer be valid. However, it will be in a very predictable location – we can replace the Y2022-Y2022 component of the path with Y2021-Y2025 and the filed will be reachable using Python or one of the filepath filtering features.

I am particularly excited that when you open the application when this process is complete you will not have to do anything to access the new indexes – I can’t screenshot it right now because it has to happen – but the new index will be listed for you to select. Before the last update you would have had to run an Index Update through the options menu.

Significant Update to 13F Share Data – Read About Consequences of Chunking versus not Chunking

We completed a fairly intense update to the 13F Share Data available on our platform. First, unbeknownst to us there were CUSIPS that sometimes had lower-case characters. Note to self, remember that filers make mistakes. We fixed that issue. More importantly we have been working to identify more CUSIP-CIK mappings. We succeeded and so we updated the 13F share data with the CIK if we were able to identify the CIK of the underlying issuer. The first problem only affected about 100K rows, the new mapping adds about 22 million new rows with the CIK as a value. This was significant, we almost doubled the number of observations that have a CUSIP-CIK mapping. We now have over 45 million observations with a CUSIP-CIK match. We still need to identify the CIK-CUSIP mapping for about 17 million rows.

This gets dense now. One challenge with the next group is that some CUSIPS for these have been associated with two or more CIKs and we have to sort out how to make sure we map these correctly. As an example of one of these cases CUSIP 00206R102 – belongs to the entity known today as AT&T (CIK – 732717) – that was formerly known as Southwestern Bell Corporation. However, we found evidence that it was attached (wrongly) to securities issued by the prior entity known as AT&T (CIK – 5907) that was acquired by Southwestern Bell Corporation. I am confused as I am writing this because, well it is confusing. We are still trying to develop the correct algorithm to assign the correct CIK to these cases. Another challenge is that some of the CUSIPS are missing one or more leading or trailing zeros (0). We actually think this group will be the next one we update because we have parsed all of the SEC’s List(s) of 13F Securities and believe we can use these to address this group. Our plan is to confirm that for those CUSIPSs where there are one or more leading 0s we will check every other existing, known CUSIP to see if the non-Zero characters for a subset of some other CUSIP value. For example, APPLE’s CUSIP is 037833100. We will try to confirm that there is no other CUSIP that contains the sequence 378331, So with that evidence and some fuzzy name-matching we hopefully can conclude that when the reported CUSIP is 378331 and the name is close to APPLE that the proper CIK to map to in those cases is 320193. But all of this takes time.

I will observe that this is not quite as grim as it sounds. Some of the securities listed in the 13F HR where we are missing the CIK are derivative securities (ETF, trust shares and the like). We are trying to carefully identify these. I am going out on a limb and say from reviewing some of this data, at least 60% of the missing CIK values are from these type of derivative securities. ISHARES has more than 600 derivative securities listed in the Q3 2022 list of 13F Securities that are well represented in the holdings data. We may end up adding a field to flag these. As I am writing this I am waiting on some code to pull all unmatched pairs and their frequency to sort out our steps going forward.

In the meantime, here is another code example of working with the as reported 13FSHAREDATA. I prepared this because of some encouragement to add additional explanatory comments. This is available on S:\PythonCode. I am going to prep a short video that demonstrates using this code. This example presumes that you have a list of CIKs that you want to use to identify relevant holdings. You can trivially modify the code below to pull based on CUSIP if you supply a list of CUSIPs and change the key word cik to cusip in this code example where the word cik is present.

In the code below I demonstrate querying the database by quarter. I have written/spoken about my practice of trying to profile the optimal balance between chunk size and time. In this example I had 1,149 CIKs in my list. The output files contained a total of 10,657,575 observations. It took me 16 minutes to pull by quarter using one of our client instances. I modified the code to try pulling the same sample without conditioning on quarter. After 90 minutes I finally killed the job. It is not a CPU issue, it is a memory issue. I am having to stop myself from droning on here, it is kind of cool – I have 1,149 CIKs, each is defined/exists in one place in the memory stack and then there are references to their location in the data . . .! I will stop! Anyway – here is some code – and as noted above, it is available on S:\PythonCode.

import sqlite3
import csv

db_path = 'S:\\directEDGAR_DATA\\13FSHAREDATA.db'

# I highly recommend that you pull by year/quarter it ultimately will be faster
# than even pulling by year because you are not having to use the page file memory as much
# I am always messing around with trying to minimize total time for some operation and
# it is just true that this is a delicate balance


years = [str(year) for year in range(2013, 2022)]
qtrs = ['-03-31', '-06-30', '-09-30', '-12-31']

# suppose you have a list of CIKs and you want the individual reported transactions relating
# to the filers in that list I am assuming below that your list is in a text file with no header and is in the Temporary # Files folder on the instance you are working from.
# I am also assuming that your CIKs are not left-padded with 0 - if they are open the list in EXCEL and then save as
# a text file



with open(r"D:\PhotonUser\My Files\Temporary Files\sample_cik.txt") as fh:
    my_cik_list = fh.readlines()

# A folder that you created to contain the results
DEST_FOLDER = "C:\\PhotonUser\\My Files\\Temporary Files\\13Fdata\\"

with open(r"D:\PhotonUser\My Files\Temporary Files\sample_cik.txt", 'r') as fh:
    my_cik_list = fh.readlines()

# there is going to be a carriage return/line feed after each observation - this will remove those
my_cik_list = [cik.strip() for cik in my_cik_list]

# we need to turn the list into a tuple as that is the object type that an sqlite query prefers
my_ciks = tuple(my_cik_list)

for year in years:
    for qtr in qtrs:
        # we are going to save after each cycle - modify if you want to save less frequently
        # the problem you will face is that about one year of data will exceed the 'capacity' of
        # EXCEL if you have a large number of CIKs or CUSIPS
        results = []
        period = year + qtr
        conn = sqlite3.connect(db_path)
        # useful feature - preserves the mapping between the COLUMN and VALUE - so the COLUMN NAME is persistent
        conn.row_factory = sqlite3.Row
        cur = conn.cursor()
        cur.execute(
            f"""SELECT * from THIRTEENFSHAREDATA where periodofreport_DATE =  "{period}" and cik in {my_ciks} """)
        rows = cur.fetchall()
        conn.close()
        # this just lets us see progress
        print(len(rows), period)
        if len(rows) == 0:
            print("no observations", period)
            continue
        for row in rows:
            d_row = dict(row)
            results.append(d_row)

        column_names = [k for k in results[0].keys()]
        header_dict = dict((ch, ch) for ch in column_names)

        dest_file = DEST_FOLDER + '13FHOLDINGS_' + period + '.csv'
        # I am assuming you have created a folder to contain the results - see above

        outref = open(dest_file, 'w', newline='')
        my_writer = csv.DictWriter(dest_file, fieldnames=column_names)
        my_writer.writerow(header_dict)
        my_writer.writerows(results)
        outref.close()
            
        


Off Topic (Somewhat) but Great Friday Reading

I say off-topic because this post is not about a new feature or update to our platform. Jack Ciesielski was the creator/founder of the Analyst’s Accounting Observer. Jack and his team read financial statements and unwound practices that they thought were dubious to create what he thought was the better representation of key metrics. Jack ‘retired’ a while ago though he still serves on the EITF, the Investor Advisory Group of the PCAOB and the CFA Institute’s Corporate Disclosure Policy Council. I was fortunate to meet Jack as they were early directEDGAR customers (they were with us when we had to mail CDs with updates and they were a pilot customer when we toyed with providing NAS devices to our customers).

I reached out to Jack a bit ago to see how he was doing in retirement. I guess he can’t stop thinking about accounting/markets and business because he shared that he transitioned from the Analyst’s Accounting Observer to a weekly newsletter “The Weekly Reader”. In this Jack has curated some interesting reads as well as a brief take on why they made it into the newsletter. He also offers a couple of bonus rounds. Here is a link to the latest version (Jack Ciesielski WR 11/11/2022) – if you would like to be added to Jack’s mailing list just send him an email (jciesielski [some important symbol here] accountingobserver.com). At least with Jack we can be assured that he is not trying to make money off of your email address!

Quick Update

Earlier this week I explained that we would move all of our archive of ITEM_1A to the platform and index them. This was completed, I had a message soon after wondering if we could do the same with the ITEM_7 (affectionately known as MD&A). This was completed over night. I did change the index name – originally I named the Risk Factors archive RISKFACTORS. When I finished the MD&A and looked at the placement in the index list – I decided I was making this unnecessarily complicated. So the names were changed to ITEM_RISKFACTORS and ITEM_MDA.

New Items for search.

And yes, you can use all of the existing features with these indexes. Specifically – if you want to run a CIK-DATE limited search, you need to have a file with the column headings CIK and MATCHDATE, there can be other columns but those columns must exist.

CIK DATE file for Search

Notice in the file above, I have multiple instances of the same CIK but different dates. That is because for my research example I want to establish whether or not a MDA exists in a window around each of those unique dates with the search phrase(s) I will use in my search. Once the file is created and available select the CIK/DATE Filter checkbox and hit the Set CIK/Date File to activate the file selector tool.

Selecting CIK/Matchdate file

After you select the file remember to specify the range and the specific date you want to use. The RDATE represents the dissemination date (which is often but not always the filed date) and the CDATE represents the balance sheet date AS REPORTED in the filing header. Once you have set the parameters hit the Okay button and enter your Search term(s) phrases into the search box before you hit the Perform Search button.

Expert tip – if you only want the existence of the document rather than identifying those with specific words/phrases use the search operator XFIRSTWORD (or XLASTWORD if you prefer).

Remember as well that the platform will generate a list of CIK-DATE pairs that did not match any criteria for the search. It could either be that they did not have the relevant search terms or if they did, the MATCHDATE you specified as well as the window did not match what is available.

Risk Factors Separate Index Available Now

Let me get to the punchline first – and then I will add some detail. If you log into our platform today and use the Zoom button next to the Library listing tool and then scroll down almost to the bottom you will see the new RISKFACTORS index:

Accessing Risk Factors Index

This index has the collection of the ITEM 1A Risk Factors we have previously parsed from 10-K filings. While these are available from our ExtractionPreprocessed feature and the application has a built in Custom-Indexing feature two things happened in the past week that caused me to move these to the application drive.

First, I had an interesting and long discussion with Professor David Lont at Otago University. David has been a directEDGAR user from almost the beginning. Honestly, some of the research goals he has shared with me in the past were critical factors that caused us to figure out how to add the ability to filter searches by CIK and unique date pairs. In our conversation last week David was not wishing for the ability to search the Risk Factors section, instead he was sharing a story about one of his junior colleagues general frustration with using directEDGAR. It was interesting, and painful to listen to these observations. When I think of hard/easy I am always probably measuring the difference between where we are versus what I had to do to collect purchase price allocations to establish the amount of goodwill recognized in acquisitions by reading 10-Ks on micro-fiche. That is probably not the right measuring stick!

Okay, so I had that discussion with David (and it was great to talk to him) and then two days ago a new user at another client school asked about searching within the Risk Factors section of the 10-K. Initially, my thought was to direct him to the sections in the documentation regarding ExtractionPreprocessed and then the section on Custom Indexing. But my conversation with David had been running around in my head and I thought to myself – that is adding unnecessary work to the process. What I needed to do was to just take the time to move a copy of the Risk Factors snips to the application and create an index just for those documents. The ExtractionPreprocessed was developed when we distributed our content to you and left it to your IT experts to update the indexes etc. We knew that there was no way they could be expected to manage the more than one million files that have accumulated over time. So we set up the system for you to pull the files and added the indexing feature so you could build your own index and search the snips.

I guess I get that might seem convoluted (remember my starting place).

However, there is a caveat, we have some other projects that are fairly significant in scope and I just don’t have the resources right now to do everything I wish I could do with these snips. At the moment we have not injected any metadata into the snips like we normally do with our documents. Any search results will have blank fields for the CNAME, DOCTYPE, FYEND and SIC. However, the search results will have the word count as well as the CIK, RDATE, CDATE and FVAL to provide a way to match back the original 10-K filings that the content was pulled from. I would actually like to put in the path to the original 10-K as a metadata field with the word count from the 10-K as another field.

One immediately nice thing about these is that you can get just the text by completing a search and then hitting the DocumentExtraction TextOnly item from the Extraction item on the menu bar. Thus if you want to use the raw text as an input for some process – that is immediately available.

If I was going to parse these to identify individual listed items I would prefer the htm version with the formatting preserved so as to use the tags to help identify the listed risk factors. As always, the DocumentExtraction feature will dump an original copy of the source file into the directory you specify.

FYI – Risk factors were only required after 12/1/2005. We are capturing the ITEM 1A section as it exists – and so this can be complicated for those entities that reference another document or another location in the 10-K. We are missing some and it is part of our work flow to determine if those can be captured. Further, while entity classification has changed across time – there are entity classes that are not obligated to provide separate disclosures about risk factors. Sometimes they simply remove ITEM 1A from their 10-K other times it is left in and they include language that indicates they are exempt.

Before I sat down to write this post I started the process of moving the MDA archive we have – there will be an MDA index soon (probably by Sunday 10/23/22). I also expect to move over the Business section as well.

Totally Random – But Somewhat Bothersome Finding

We started parsing Executive and Director compensation from filings a number of years ago. We did this because when I visited our early clients I found that many were using our tools to collect this data to augment data they had available through a S&P product (which I won’t name). The decision to collect and make available this data was an easy one because it tied directly into the main reason I started directEDGAR – to reduce the time it took to collect data so our clients could focus on their research not data collection.

Being a small company with limited resources we had to make very cost focused choice about how to distribute the data and so we hit upon this idea of storing the data from a table as a JSON object in a CIK/YEAR directory because it was not hard to write Python code to deliver the data from a request file. And the infrastructure to make this happen was not too complicated.

Here is what one row of the data looks like as a JSON object:

    {
        "CIK": "100885", 
        "RDATE": "R20180328", 
        "CDATE": "C20180510", 
        "FNAME": "F59", 
        "TID": "171", 
        "NAME-LABEL": "Name and Principal Position", 
        "RID": "11", 
        "PERSON-NAME": "RHONDA S. FERGUSON", 
        "PERSON-TITLE": "EVP CHIEF LEGAL OFFICER", 
        "YEAR": "2016", 
        "SALARY": "200000", 
        "BONUS": "720000", 
        "STOCK": "400017", 
        "OPTION": "", 
        "NQDEFCOMP": "", 
        "OTHER": "59746", 
        "TOTAL": "1379763", 
        "SEC-NAME": "FERGUSON RHONDA S", 
        "PERSON-CIK": "1677193", 
        "GENDER": "F"
    }, 

This turned out to be a good choice in many ways, one of which is it allowed up to push this data to those who want the data to integrate into their offerings immediately – this works well for an API.

A downfall of this choice is that we have never really had an easy way to aggregate all of the data and start poking at it at scale. That JSON object you are looking at is a text string and so you have to pull the data to test values in fields. Our shift to the new platform and our decision to move all of this data into an SQL enabled database has made it easier to start looking at the data at scale. While we are still a bit away from releasing the full Executive Compensation database to the platform we are getting closer. But as I am working with it at this intermediate stage I was curious about the questions I could ask.

Preliminary View of Data

There are over 470K PERSON-YEARs in this data. It is very comprehensive going back to 2006. There was a big change in the SEC mandated disclosure for EC and DC data that took effect in early 2007 (and thus affected the disclosure of 2006 data). One key difference is that prior to the new disclosure regime most registrants did not report the value of securities that were part of the compensation package – either the number of securities (shares/options . . .) were disclosed in the table or in a footnote and it was fairly uncommon for a total compensation value to be reported.

The JSON example posted above was pulled from our archive of data filed in 2017 – that is one row for Union Pacific Railroad. Here is their EC table filed in 2004 and while they have a value for Restricted Stock Awards they do not report a value for the Options/SARs – the value you see below is a number of units. Later in the proxy they do provide one estimate of the value of those securities – but not all registrants did this. Further, notice that there is no total.

UNP 2004 Summary Compensation Table

So what is this somewhat bothersome finding. Well it blew my mind when I decided to ask the simple question – what is the gender distribution in this archive. How many FEMALE-YEARS of data have we collected?

Querying Full EC Data for Frequency of Women as Named Executive Officers

We have 45,188 person years of women (awkward phrasing)

We have 418,876 person years of males. Remember that the bulk of this data comes from 2006 to the present.

I do not mean to be a social commentator – I am not very good at it. However, we are at an inflection point in society. The world is confronted with some very complex problems and it seems to me that these problems are not going to go away unless we engage with everyone who is capable of bringing parts of the solutions to the table. The differences in the gender distribution in the data is bothersome to me because I have never believed that because I am male I should move to the front of the line. But it is hard to imagine that we could end up with a distribution like this unless others believed that only men should be able to do certain things.

If you line up all of the people in the world based on some attribute I would just struggle to imagine any meaningful attribute where the ‘best’ was dominated by men. Of course I am the guy who was knocked out by a girl in sixth grade (let me tell you I never crossed her again!). But seriously, my first boss was Mrs. Ittenbach, she owned a Dairy Queen in Odenton Maryland – she taught me how to mop floors so you could eat off of them and gave me insights on how to treat people with fairness. There has never been a time in my life where I did not know a female that was smarter than me. There were very important women in my Ph.D. program, both faculty and fellow students. There have been women leaders at every college I have worked. Most importantly, in every single class I have ever taught there are as many women who are clearly capable of great things as there are men. My neighbor (Dr. E.) and I trade Wordle and Quordle results each morning – Dr. E. takes great pleasure in crushing me. (If you haven’t figure out, Dr. E. is a woman and I am trying to say she is smarter than me.)

On the one hand, as the father of an amazing son, I certainly don’t want him denied opportunities because of his gender. But Dr. E. is the mother of two amazing daughters and I know she has the same feeling (one of her daughters is an intern for us so I know how amazing they are). But this data makes it hard to deny that there are still likely systemic issues that affect opportunities for an important part of our population.

Those of you that are math geniuses will have probably already recognized that the total F + total M does not equal total observations. We have roughly 9,000 observations without a value for GENDER in our database. This is because we only started adding GENDER to the compensation data sometime after we started collecting it.

We are going to have to first establish if there as an engineering solution available (how many of these can we code for) but then there will be plenty we can’t. The problem is that most of these are going to be from the early years, just as the SEC mandated ownership reporting through EDGAR (so we can get the PERSON-CIK and they will be people who did not remain as Named Executive Officers long enough so we won’t have them in our PERSON-CIK archive. We made a bad decision early on to not collect GENDER if we did not have a PERSON-CIK. Fortunately we have been working through the consequence of that decision for a while. It may seen easy but there are too many names where the name does not provide any GENDER clue. We will try an engineering solution first and then the balance will be parsed out to the interns to work on when they are at a waiting point for another task.

I have had two questions about access to this data in the past week. One person was able to get what they wanted from the old request system and we did run a query and provided the results for another. If you have a pressing issue and need to use this data let me know and we can work something out. We just need more testing before we are quite ready to push it out for all. I will say we will make it before we have fully addressed the missing GENDER issue – we can and will update it periodically.

I would be remiss if I did not address the other diversity issue that is problematic – racial and ethnic diversity where many groups of people are even more underrepresented in the population of leaders than women. The problem is that data is much harder to access. While the NASDAQ has developed new board diversity disclosures, these are being phased in and the disclosure requirements do not specifically specify a mapping from a person to a diversity attribute. Further they do not require disclosures about the diversity of company leadership. While some companies provide a mapping between board members and various diverse attributes, many rely instead on a schedule like the following:

Diversity Table

I used our TableSnipper to pull the available tables based on the presence of the words White, Female, and Asian. I did it again using the terms Caucasian and Gender. That seemed to cover the gamut of available tables. There are a large number of significant public companies that do not have any racial or ethnic diversity on their board. I looked at this particular company’s executive leadership and it reflected the (lack of) diversity of the board (more or less).

Director Diversity Matrix – No Diversity!

I have probably gone off the deep-end here. My initial goal was to provide an update on our EC transition. My update is a little nebulous but the important point is – if you need access before we are ready to port it out – we can probably work something out. Another important point is that is really is kind of cool to access the data so directly. I did get lost a bit there is the lack of FEMALE representation in the highest levels of public companies. And then the issue of evidence of a lack of racial/ethnic diversity on boards. It is going to be interesting to take advantage of some of these new disclosures to explore/test more granular hypotheses about how diversity adds meaningful value.

Back to Basics – Image Files in Search Results

I had an an interesting question this morning from a user – they were reviewing some search results and they came to this:

Image Heavy Document

There search returned some documents and a number of their documents would not display in the viewer. They were concerned about two issues, first how could they inspect the document and then what if they wanted to use the search results from the document.

When we push the search results to you, we are not actually pushing the document – we are pushing a cached version of the document. However, the document is almost immediately available for those cases like the one above. Just hit the Open Document button. All of the documents are stored in directories below the indexes and they load fast

After hitting the Open Document Button

This is an htm file with embedded images that contain text. The text was extracted using OCR and indexed. We do not attempt to try to create a text version of the document – OCR technology is not there yet. The search was for Organic Growth and after opening the document above I found the following:

Organic Growth

So now – how to get that text out of the document – well the ContextExtraction feature works with the text in the indexer rather than the text from the document so I set a limited context span as you can see below:

Setting Context to Extract 5 Words Around Search Phrase
Context Extraction

The OCR processor did not break paydown and Deliver – image processing is HARD.

The bottom line is that we have the document, and the search results are available – I will admit it is annoying at times to have to go through these steps.

Python Example – Using the File Path to Find Documents and Using Director Compensation Data to Find Committee Assignment Tables in Proxy Filings

I posted some new code to our PythonCode examples. One of the code examples is one approach to finding committee assignment tables. In that example I did not mention but it is important to emphasize that I first scanned over 150 or so proxy statements to identify the different ways these tables are displayed. I was able to do that by just running a quick search for (DOCTYPE (contains DEF14A)). We always have to start data collection by viewing a range of documents to learn how the filers express the concept we are trying to find. That review helped me create that list of words that are in the code. Since this is an example, I do not mean it to be exhaustive but it is a starting point.

However, in the example code I demonstrate how we can start with a list of CIKs and find the related documents. It will generally be the case that we have to tune our KEY_WORDS etc to find the right data we are trying to collect. To help with that we will probably have to inspect documents where results were not available. One of the new features we built into the latest iteration of the platform was the ability to match on the file path of document in the repository.

The example also illustrates something I personally was excited to demonstrate. We have a pretty deep archive of director compensation and we have much of our director compensation data available in an SQLITE db that you can interact with in code. Since the table that reports committee assignments usually has the name of the directors – I pulled names from the director compensation data by CIK and some specific years to use those in my attempts to identify the tables.

Pulling DC Data to get NAMES to then use to Find Tables!

Based on my visual review of the initial documents I did not see any tables that did not have either the first and last names or just the last names. So I thought a good step one is to first find tables that had names in them.

I did set up the code to save the table so we can use the SmartBrowser to review. Here is a screenshot of one of the tables that I wanted to capture:

Awesome Table as Reviewed in the SmartBrowser.

Of course there were tables that I did not want – this means I need to tinker with my collection of words. Or, I can just delete the table using the Delete Current File button. This is always a balancing act and it might take multiple iterations to find the right set of appropriate and not so appropriate words.

Wrong Table

One test I did not implement was to set minimum dimensions of the table to be snipped, nor did I require the names to be in the same column. These can be added to the code with a bit of poking around.

In the example I created, if you followed along, the summary CSV file that reports on the results includes a variable named DE_PATH. The file also includes a stop_reason value. In my example, if there is a value in stop_reason, the current iteration of the code was not able to find a table that met the criteria. This could be either because the table we are looking for does not exist or it exists in a form that is not the form we expected. The only way to establish which of those is the correct explanation is to inspect those documents.

Summary Sorted on stop_reason.

I want to inspect the documents so I delete the rows that do not have a stop_reason listed and save the file. I then start the application select the index (in my example I am using PROXY Y2022-Y2022, click the Use DB checkbox, and then select the Set DB File.

Prepping to find specific documents.

Once I have selected Use DB and hit the Set DB File the application provides an interface to select the file – remember we need the DE_PATH column in the selected CSV file.

Selecting the File with a Column DE_PATH

We still need to specify some search term. Since I want to scan these documents I intend to use the search operator XFIRSTWORD.

Results of Using DE_Path column to find specific documents

One of the things you will discover is that some group of filers insert page images into their proxy and so there is nothing we can parse with a text/html parsing strategy.

NRG’s Proxy Page Reporting on Director Committee Assignments – it is an Image!

We will add more code examples. However, that code example also demonstrates how to accomplish some other tasks. In the code example I also provided some information about resources to learn Python as well as to find the appropriate disclosure regs as they relate to the filings.

Remarks from Forms 3/4/5 and Amended Now Available

We are starting to move all of the Forms 3/4/5 data to a database format. The first (and easiest) step of this was to pull the REMARKS field from these filings and insert them into their own database.

One potential use is to identify those filings that describe transactions that were made pursuant to a 10b plan. While many of the filings include information about that reason for particular transactions in a footnote attached to the transaction some use a global indicator in the REMARKS section.

We include the CIK field as the CIK of the issuer and the rptownercik field has the CIK of the person reporting the transaction. You can merge with issuer related data or person related data using the appropriate column. The image below shows the results of a search to identify all remarks fields where there is mention of either 10B or 10-B (I noticed some users seem to reference 10-B rather than 10B).

Search to find remarks relating to the possible disclosure of transactions that could be motivated/explained by a trading plan.

We will be working to port over separate tables for derivative and then non-derivative transactions. Because the accession number will be included in all of the related tables it should be trivial to merge across the various tables. I will report that the ACCEPTANCE_TIME value was pulled from the FILING_TIMES data set by ACCESSION_NUMBER.