Our previous work to establish a proxy for the timestamp for 8-K filings prompted one of our clients to ask whether it was possible to do the same for 10-K filings. We used the same strategy and were able to add a good proxy for this field to about 97,000 10-K filings where it was not available in the header. With this addition we have 318,304 10-Ks with an ACCEPTANCE-DATETIME value and 7,610 missing a value. While working on this it seemed reasonable to collect as much metadata from the filings as possible. For example, I saw a recent paper that reported that most of the other databases we use for research have only the most recent address. They described having to go to EDGAR to collect this field. There does not seem like a lot of value for you to have to jump through that hoop. Therefore we added most of the fields included in the header or on the INDEX page of the filings.
There is a new compressed folder in the EXTRAS (see here how to access) folder that has a copy of the sqlite3 database (and some code etc). In addition to the information that was reported in the header files we added the DE_PATH – this is not the path to the 10-K but the path to the folder where the 10-K is stored. This will allow you to match search results from directEDGAR to a query from the database with just a little Excel magic on the FILENAME column from a Summary or Context extraction from directEDGAR.
I really thought this was going to take two days once I had sorted out this goal. We had already pulled the ACCEPTANCE-DATETIME measure for all of the filings. We have the headers archived. I thought it was going to be a trivial mapping exercise. It was much more complicated because there are some variations in the fields included in the headers (I learned that there is a field called CONFIRMING-COPY) and then there was a special challenge with the addresses.
As I was looking at the data I was questioning the value of using the codes that the SEC requires filers to use for the STATE field in the BUSINESS-ADDRESS and the MAILING-ADDRESS sections of the header. Who wants to pull this data and then go find the meaning of the value M2 (Jordan) to organize the results for your model? So I decided that we needed to map the values the filers use when preparing their filings to their descriptive values. This was interesting. The only source I could find for the standardization of these fields from the SEC was limited. First, Lexis-Nexis managed the dissemination platform early and I found a page that listed the codes that were to be used (Archived Country Codes). Notice that the list of codes is incomplete – we had to find cases where the code was not listed on the page but included in the header, read the header and assign the code value based on the information we could glean from the filing (and of course Professor Google). The SEC made some changes so that beginning in February 2010 these code were to be used (New State/Country Codes). There is no specific date for the changeover and no easy way to tell so we mapped to 2/1/2010 to the old and subsequent filings were mapped to the new. While there is consistency for the US states the codes for most other jurisdictions changed. For example, West Germany was I8. Under the new regime Germany was assigned the value 2M.
While my primary reason to assemble this is to facilitate your research, a second and still important reason is to make stuff available that might be interesting to muck around with in class or to use for student projects. Therefore I thought it was important to do the STATE field transformation. Note, I did this transformation to both the BA_STATE (Business Address) and the MA_STATE (Mailing Address) fields.
I want to close this by observing we simply organized the data as reported. There is operator (filer) error. I was showing Manish the finished product and thought it would be fun to identify all filers from FINLAND. There were only three observations. However, one of those is an error. China Natural Resources (2000 10-K Link) entered H9 in the STATE field. In prior filings they used K3 (Hong Kong). This of-course brings up the question – Should we correct these? Maybe later, that would be a significantly different project.
I included an html file that has the country codes (both old and new) in the folder. The queryDB.py file includes the list of fields and there are two csv files. One is the input file used in the query that restricts the results to a specific list of CIKs and the other is the output from that query. I have not yet had a chance to play with Pandas so my code still reads the input and writes the output using the CSV module. I thought about modifying Antonis’s code but since I can’t test it – I will leave it to those of you who are used to the Pandas library.
As a final note the data is current as of 12/22/2021. We will add this to our monthly or quarterly update flow once we get through the transition to 2022 in our search platform. I considered waiting until after 12/31 but there are too many tasks that need to be addressed with the beginning of a new calendar year. Thanks for our patience and I hope you find this useful.