We completed a fairly intense update to the 13F Share Data available on our platform. First, unbeknownst to us there were CUSIPS that sometimes had lower-case characters. Note to self, remember that filers make mistakes. We fixed that issue. More importantly we have been working to identify more CUSIP-CIK mappings. We succeeded and so we updated the 13F share data with the CIK if we were able to identify the CIK of the underlying issuer. The first problem only affected about 100K rows, the new mapping adds about 22 million new rows with the CIK as a value. This was significant, we almost doubled the number of observations that have a CUSIP-CIK mapping. We now have over 45 million observations with a CUSIP-CIK match. We still need to identify the CIK-CUSIP mapping for about 17 million rows.
This gets dense now. One challenge with the next group is that some CUSIPS for these have been associated with two or more CIKs and we have to sort out how to make sure we map these correctly. As an example of one of these cases CUSIP 00206R102 – belongs to the entity known today as AT&T (CIK – 732717) – that was formerly known as Southwestern Bell Corporation. However, we found evidence that it was attached (wrongly) to securities issued by the prior entity known as AT&T (CIK – 5907) that was acquired by Southwestern Bell Corporation. I am confused as I am writing this because, well it is confusing. We are still trying to develop the correct algorithm to assign the correct CIK to these cases. Another challenge is that some of the CUSIPS are missing one or more leading or trailing zeros (0). We actually think this group will be the next one we update because we have parsed all of the SEC’s List(s) of 13F Securities and believe we can use these to address this group. Our plan is to confirm that for those CUSIPSs where there are one or more leading 0s we will check every other existing, known CUSIP to see if the non-Zero characters for a subset of some other CUSIP value. For example, APPLE’s CUSIP is 037833100. We will try to confirm that there is no other CUSIP that contains the sequence 378331, So with that evidence and some fuzzy name-matching we hopefully can conclude that when the reported CUSIP is 378331 and the name is close to APPLE that the proper CIK to map to in those cases is 320193. But all of this takes time.
I will observe that this is not quite as grim as it sounds. Some of the securities listed in the 13F HR where we are missing the CIK are derivative securities (ETF, trust shares and the like). We are trying to carefully identify these. I am going out on a limb and say from reviewing some of this data, at least 60% of the missing CIK values are from these type of derivative securities. ISHARES has more than 600 derivative securities listed in the Q3 2022 list of 13F Securities that are well represented in the holdings data. We may end up adding a field to flag these. As I am writing this I am waiting on some code to pull all unmatched pairs and their frequency to sort out our steps going forward.
In the meantime, here is another code example of working with the as reported 13FSHAREDATA. I prepared this because of some encouragement to add additional explanatory comments. This is available on S:\PythonCode. I am going to prep a short video that demonstrates using this code. This example presumes that you have a list of CIKs that you want to use to identify relevant holdings. You can trivially modify the code below to pull based on CUSIP if you supply a list of CUSIPs and change the key word cik to cusip in this code example where the word cik is present.
In the code below I demonstrate querying the database by quarter. I have written/spoken about my practice of trying to profile the optimal balance between chunk size and time. In this example I had 1,149 CIKs in my list. The output files contained a total of 10,657,575 observations. It took me 16 minutes to pull by quarter using one of our client instances. I modified the code to try pulling the same sample without conditioning on quarter. After 90 minutes I finally killed the job. It is not a CPU issue, it is a memory issue. I am having to stop myself from droning on here, it is kind of cool – I have 1,149 CIKs, each is defined/exists in one place in the memory stack and then there are references to their location in the data . . .! I will stop! Anyway – here is some code – and as noted above, it is available on S:\PythonCode.
import sqlite3
import csv
db_path = 'S:\\directEDGAR_DATA\\13FSHAREDATA.db'
# I highly recommend that you pull by year/quarter it ultimately will be faster
# than even pulling by year because you are not having to use the page file memory as much
# I am always messing around with trying to minimize total time for some operation and
# it is just true that this is a delicate balance
years = [str(year) for year in range(2013, 2022)]
qtrs = ['-03-31', '-06-30', '-09-30', '-12-31']
# suppose you have a list of CIKs and you want the individual reported transactions relating
# to the filers in that list I am assuming below that your list is in a text file with no header and is in the Temporary # Files folder on the instance you are working from.
# I am also assuming that your CIKs are not left-padded with 0 - if they are open the list in EXCEL and then save as
# a text file
with open(r"D:\PhotonUser\My Files\Temporary Files\sample_cik.txt") as fh:
my_cik_list = fh.readlines()
# A folder that you created to contain the results
DEST_FOLDER = "C:\\PhotonUser\\My Files\\Temporary Files\\13Fdata\\"
with open(r"D:\PhotonUser\My Files\Temporary Files\sample_cik.txt", 'r') as fh:
my_cik_list = fh.readlines()
# there is going to be a carriage return/line feed after each observation - this will remove those
my_cik_list = [cik.strip() for cik in my_cik_list]
# we need to turn the list into a tuple as that is the object type that an sqlite query prefers
my_ciks = tuple(my_cik_list)
for year in years:
for qtr in qtrs:
# we are going to save after each cycle - modify if you want to save less frequently
# the problem you will face is that about one year of data will exceed the 'capacity' of
# EXCEL if you have a large number of CIKs or CUSIPS
results = []
period = year + qtr
conn = sqlite3.connect(db_path)
# useful feature - preserves the mapping between the COLUMN and VALUE - so the COLUMN NAME is persistent
conn.row_factory = sqlite3.Row
cur = conn.cursor()
cur.execute(
f"""SELECT * from THIRTEENFSHAREDATA where periodofreport_DATE = "{period}" and cik in {my_ciks} """)
rows = cur.fetchall()
conn.close()
# this just lets us see progress
print(len(rows), period)
if len(rows) == 0:
print("no observations", period)
continue
for row in rows:
d_row = dict(row)
results.append(d_row)
column_names = [k for k in results[0].keys()]
header_dict = dict((ch, ch) for ch in column_names)
dest_file = DEST_FOLDER + '13FHOLDINGS_' + period + '.csv'
# I am assuming you have created a folder to contain the results - see above
outref = open(dest_file, 'w', newline='')
my_writer = csv.DictWriter(dest_file, fieldnames=column_names)
my_writer.writerow(header_dict)
my_writer.writerows(results)
outref.close()