CUSIP-CIK Mapping

I am making available a new database on the platform that has three fields; BASE_CUSIP, CUSIP and CIK. As you have probably seen, we have parsed the 13F filings and have been trying to link the CUSIP to the issuer CIK. CUSIPs are hard to get as they are issued by the American Banker’s Association and managed by Standard and Poor’s. What we did to identify the mapping between CUSIPs and CIKs is we parsed the SC 13G and SC 13D filings as they contain CUSIP values.

First Page of SC 13G/A Filed to Disclose Holdings in BioScrip by Heartland Advisors in 2007

Based on the metadata associated with this filing we could determine that this related to the company that was named BioScrip, whose CIK is 1014739. Because of the number of SC 13 filings that were made with this CIK we feel pretty comfortable that this mapping is correct.

Of course it is never that easy as you well know. Bioscrip was acquired in a reverse merger type arrangement in 2019 and while the legal entity with respect to filing with the SEC remained fixed, the shares were replaced with new shares and with those new shares a new CUSIP was issued (68404L201) as can be seen in the image from an SC 13 filing made in 2022.

Since our interest is in making it easier for you to search for SEC filing content, we believe this mapping will help you when you want to identify filings made by some security issuer when you have the CUSIP and not the CIK. From our perspective, the fact that the CUSIP changes is more or less irrelevant. If you have data that is indexed on CUSIP for this company that relates to security information prior to August 2019 you will have the CUSIP value 09069N108 – that maps to CIK 1014739. If your data is more recent (in 2020 for example) you would have CUSIP 6804L201 – that maps to CIK 1014739. And actually, if you have data prior to early 2005 you might have CUSIP 553044108 (the company was known as MIM CORP then). So we have one CIK that maps to three CUSIPS. We are pretty confident that we have enough evidence to draw the conclusion that the mapping is appropriate/reasonable. Specifically there were more than 40 confirming SC 13 filings with CUSIP 553044108. There were more than 100 SC 13 filings with CUSIP 09069N108 and more than 15 with CUSIP 68404L102. I would argue that the evidence is reasonable. However, you need to understand that we just analyzed the filings. We tested the validity of the CUSIPS we found and created tests relating to characteristics of the filing, the filer and the issuer. However, we don’t know what we don’t know. I have been trying to find errors, I suspect there can be some but we have done extensive testing.

Remember, the first six characters of the CUSIP are issuer specific, so we also added the BASE_CUSIP value(s) (09069N, 553044 and 68404L) as an additional field, just in-case. We clearly do not have every CIK mapped to a CUSIP. For some issuers they do not have any SC 13 filings. For others, the evidence is not strong enough (5 filings total, 3 with one CUSIP and 2 with another). There are those that report the SEDOL of the underlying security rather than the CUSIP of the ADR. We are also working with another data source that will yield some additional mappings.

Finally, you can filter on CIK – to access the entire listing just hit the Execute button without setting any operators/criteria. And of course you can export the results using the Save Results button.

End of Year Index Changes Coming – no action needed by you!

We are starting to prep for the end of the calendar year. We will consolidate all Y2021/Y2022 indexes into a Y2021 – Y2025 index and then create new Y2023 indexes for all filings made after 12/31/2022. With the software update we did earlier this year – you will not have to do anything. If you start the application after we complete the update around 1/1/2023 the old indexes will be merged and renamed. It does mean that if you are using one or more of our artifacts that includes the FILENAME field you might have to either manually change the path value or rerun your search on the new consolidated index.

To be clear – suppose you used our platform to identify the 8-K filings made so far in 2022 that related to an auditor change.

If you ran the summary extraction from this search the FILENAME field will have the path to the folder Y2022-Y2022 as you can see in the next image:

Once we consolidate and re-index that path will no longer be valid. However, it will be in a very predictable location – we can replace the Y2022-Y2022 component of the path with Y2021-Y2025 and the filed will be reachable using Python or one of the filepath filtering features.

I am particularly excited that when you open the application when this process is complete you will not have to do anything to access the new indexes – I can’t screenshot it right now because it has to happen – but the new index will be listed for you to select. Before the last update you would have had to run an Index Update through the options menu.

Significant Update to 13F Share Data – Read About Consequences of Chunking versus not Chunking

We completed a fairly intense update to the 13F Share Data available on our platform. First, unbeknownst to us there were CUSIPS that sometimes had lower-case characters. Note to self, remember that filers make mistakes. We fixed that issue. More importantly we have been working to identify more CUSIP-CIK mappings. We succeeded and so we updated the 13F share data with the CIK if we were able to identify the CIK of the underlying issuer. The first problem only affected about 100K rows, the new mapping adds about 22 million new rows with the CIK as a value. This was significant, we almost doubled the number of observations that have a CUSIP-CIK mapping. We now have over 45 million observations with a CUSIP-CIK match. We still need to identify the CIK-CUSIP mapping for about 17 million rows.

This gets dense now. One challenge with the next group is that some CUSIPS for these have been associated with two or more CIKs and we have to sort out how to make sure we map these correctly. As an example of one of these cases CUSIP 00206R102 – belongs to the entity known today as AT&T (CIK – 732717) – that was formerly known as Southwestern Bell Corporation. However, we found evidence that it was attached (wrongly) to securities issued by the prior entity known as AT&T (CIK – 5907) that was acquired by Southwestern Bell Corporation. I am confused as I am writing this because, well it is confusing. We are still trying to develop the correct algorithm to assign the correct CIK to these cases. Another challenge is that some of the CUSIPS are missing one or more leading or trailing zeros (0). We actually think this group will be the next one we update because we have parsed all of the SEC’s List(s) of 13F Securities and believe we can use these to address this group. Our plan is to confirm that for those CUSIPSs where there are one or more leading 0s we will check every other existing, known CUSIP to see if the non-Zero characters for a subset of some other CUSIP value. For example, APPLE’s CUSIP is 037833100. We will try to confirm that there is no other CUSIP that contains the sequence 378331, So with that evidence and some fuzzy name-matching we hopefully can conclude that when the reported CUSIP is 378331 and the name is close to APPLE that the proper CIK to map to in those cases is 320193. But all of this takes time.

I will observe that this is not quite as grim as it sounds. Some of the securities listed in the 13F HR where we are missing the CIK are derivative securities (ETF, trust shares and the like). We are trying to carefully identify these. I am going out on a limb and say from reviewing some of this data, at least 60% of the missing CIK values are from these type of derivative securities. ISHARES has more than 600 derivative securities listed in the Q3 2022 list of 13F Securities that are well represented in the holdings data. We may end up adding a field to flag these. As I am writing this I am waiting on some code to pull all unmatched pairs and their frequency to sort out our steps going forward.

In the meantime, here is another code example of working with the as reported 13FSHAREDATA. I prepared this because of some encouragement to add additional explanatory comments. This is available on S:\PythonCode. I am going to prep a short video that demonstrates using this code. This example presumes that you have a list of CIKs that you want to use to identify relevant holdings. You can trivially modify the code below to pull based on CUSIP if you supply a list of CUSIPs and change the key word cik to cusip in this code example where the word cik is present.

In the code below I demonstrate querying the database by quarter. I have written/spoken about my practice of trying to profile the optimal balance between chunk size and time. In this example I had 1,149 CIKs in my list. The output files contained a total of 10,657,575 observations. It took me 16 minutes to pull by quarter using one of our client instances. I modified the code to try pulling the same sample without conditioning on quarter. After 90 minutes I finally killed the job. It is not a CPU issue, it is a memory issue. I am having to stop myself from droning on here, it is kind of cool – I have 1,149 CIKs, each is defined/exists in one place in the memory stack and then there are references to their location in the data . . .! I will stop! Anyway – here is some code – and as noted above, it is available on S:\PythonCode.

import sqlite3
import csv

db_path = 'S:\\directEDGAR_DATA\\13FSHAREDATA.db'

# I highly recommend that you pull by year/quarter it ultimately will be faster
# than even pulling by year because you are not having to use the page file memory as much
# I am always messing around with trying to minimize total time for some operation and
# it is just true that this is a delicate balance


years = [str(year) for year in range(2013, 2022)]
qtrs = ['-03-31', '-06-30', '-09-30', '-12-31']

# suppose you have a list of CIKs and you want the individual reported transactions relating
# to the filers in that list I am assuming below that your list is in a text file with no header and is in the Temporary # Files folder on the instance you are working from.
# I am also assuming that your CIKs are not left-padded with 0 - if they are open the list in EXCEL and then save as
# a text file



with open(r"D:\PhotonUser\My Files\Temporary Files\sample_cik.txt") as fh:
    my_cik_list = fh.readlines()

# A folder that you created to contain the results
DEST_FOLDER = "C:\\PhotonUser\\My Files\\Temporary Files\\13Fdata\\"

with open(r"D:\PhotonUser\My Files\Temporary Files\sample_cik.txt", 'r') as fh:
    my_cik_list = fh.readlines()

# there is going to be a carriage return/line feed after each observation - this will remove those
my_cik_list = [cik.strip() for cik in my_cik_list]

# we need to turn the list into a tuple as that is the object type that an sqlite query prefers
my_ciks = tuple(my_cik_list)

for year in years:
    for qtr in qtrs:
        # we are going to save after each cycle - modify if you want to save less frequently
        # the problem you will face is that about one year of data will exceed the 'capacity' of
        # EXCEL if you have a large number of CIKs or CUSIPS
        results = []
        period = year + qtr
        conn = sqlite3.connect(db_path)
        # useful feature - preserves the mapping between the COLUMN and VALUE - so the COLUMN NAME is persistent
        conn.row_factory = sqlite3.Row
        cur = conn.cursor()
        cur.execute(
            f"""SELECT * from THIRTEENFSHAREDATA where periodofreport_DATE =  "{period}" and cik in {my_ciks} """)
        rows = cur.fetchall()
        conn.close()
        # this just lets us see progress
        print(len(rows), period)
        if len(rows) == 0:
            print("no observations", period)
            continue
        for row in rows:
            d_row = dict(row)
            results.append(d_row)

        column_names = [k for k in results[0].keys()]
        header_dict = dict((ch, ch) for ch in column_names)

        dest_file = DEST_FOLDER + '13FHOLDINGS_' + period + '.csv'
        # I am assuming you have created a folder to contain the results - see above

        outref = open(dest_file, 'w', newline='')
        my_writer = csv.DictWriter(dest_file, fieldnames=column_names)
        my_writer.writerow(header_dict)
        my_writer.writerows(results)
        outref.close()
            
        


Off Topic (Somewhat) but Great Friday Reading

I say off-topic because this post is not about a new feature or update to our platform. Jack Ciesielski was the creator/founder of the Analyst’s Accounting Observer. Jack and his team read financial statements and unwound practices that they thought were dubious to create what he thought was the better representation of key metrics. Jack ‘retired’ a while ago though he still serves on the EITF, the Investor Advisory Group of the PCAOB and the CFA Institute’s Corporate Disclosure Policy Council. I was fortunate to meet Jack as they were early directEDGAR customers (they were with us when we had to mail CDs with updates and they were a pilot customer when we toyed with providing NAS devices to our customers).

I reached out to Jack a bit ago to see how he was doing in retirement. I guess he can’t stop thinking about accounting/markets and business because he shared that he transitioned from the Analyst’s Accounting Observer to a weekly newsletter “The Weekly Reader”. In this Jack has curated some interesting reads as well as a brief take on why they made it into the newsletter. He also offers a couple of bonus rounds. Here is a link to the latest version (Jack Ciesielski WR 11/11/2022) – if you would like to be added to Jack’s mailing list just send him an email (jciesielski [some important symbol here] accountingobserver.com). At least with Jack we can be assured that he is not trying to make money off of your email address!