A couple of weeks ago I posted that we would be disseminating a new artifact through our platform that reported the standardized name of the auditor. We are not quite finished so the artifact will be delayed. However, I am implementing an interim solution. In an earlier post I noted that we would distribute some Python code that might be useful for specific problems (Available Code). At the time the intent was to store the code in a folder on the shared APPSTREAM drive called AvailableCode. I have renamed the folder to EXTRAS and it will contain more than just code snippets.
In this case I am adding a new zip file named TEMP_AUDITOR.zip that contains a csv file with the data we currently have available for auditor details. One of the fields/columns in the csv file is named deid. This is an abbreviation for directEDGAR ID and can be used to match the name of the auditor to the 10-K filing that the name was pulled from. The deid is created by concatenating the CIK, RDATE, CDATE and F## (where the ## represent the last two digits of the accession number for the filing) with a dash (-).
To match these results to search results, run your search and complete the extraction (Summary or perhaps Context) that you need. All search results include the path to the source document. The path includes the CIK and the RDATE-CDATE-F## separated by a folder slash/ For example – my search results included a document with the following path:

Use the Excel Text-to-Columns tool to separate the path components and then concatenate the CIK and the RDATE-CDATE-F## value – so your new variable will be 20-R20050331-C20050101-F28. That value is in the auditor_details.csv file as shown in the cropped image below:

You can match between your file and the source file using Excel’s VLOOKUP function or write some Python code to do this dynamically.
In addition to the csv file I am also including a SQLITE3 database file in the folder as well as some Python code that has a sample query of the database. With all of this pressure to increase data literacy of our students I wondered if one or more of you might want to poke at the database and perhaps use it in class and or share with students what it looks like when you are working on your research. If you are not familiar with SQLITE3, the maintainers of SQLITE3 have some great (accessible to humans who were not born as coders) documentation (SQLITE3 Python Docs). There is also an open-source desktop interface (SQLITE3 GUI) that is documented for humans. The desktop application provides for SQL queries as well as changes to the database. We use databases under the hood for our platform but I have little experience working with them. Others have handled those facets of our development. As a novice I was very comfortable using the documentation that is available at those links to begin the process of creating the database and testing queries to identify the gaps. I am trying to say it was not a steep learning curve.
To download this extra to your computer please follow the steps outline near the bottom of this post (Python). I also prepared a video that can be accessed using this link (YouTube) that demonstrates the process (Believe it or not only 2 minutes long!). The folder will have a csv file with all of the current data, the db file and one Python file named queryDB.py. The Python file has some sample queries and also has code to extract the data and save as a CSV file.
FYI – The delay is because some of the quality checking is taking more time than we expected. The file has 109,508 rows (10-K/auditor matches). We are missing approximately 1,800 (1.64%) for the collection of CIKs we are working with in this window. These have to be addressed manually and it takes time. For example we have 282 CIKs that filed a 10-K with a 2012 and 2014 CDATE (balance sheet date). However we failed to identify the auditor for their filing with a 2013 balance sheet date. So far we have confirmed that 33 of these did not file a 10-K for a fiscal year ended in 2013 (yes this happens). For others the audit report is an image file or there were other issues that caused our algorithm to reject the initial results. We are researching these and will update the files and disseminate the artifact when we get the missing rate a bit lower. Note we do have data for periods before 2005 – that will be added in the near future.