Introduction to our Database Query Tool

This is long overdue – I hope to make you comfortable with learning how to use the Query feature we built into the platform. To access the Query interface – select the Query Databases menu item on the top menu bar.

The platform updates dynamically now so the list of databases will reflect whatever databases are currently available, when we add a new database between your sessions with our platform the names of any new databases will be included the next time you launch a session.

The Query Databases tool with current databases listed.

When you select a particular database the Display Columns and Criteria Columns section of the interface will populate based on the definition of the table (database) that is selected. Currently, all of our databases are single table databases.

Display and Criteria populated after selecting the 13FSHARESDATA database.

Display and Criteria populated after selecting the DIRECTOR_COMP database.

If you are familiar with SQL variations think of the Display Columns selector as more or less equivalent to the SELECT statement in SQL. Display Columns determines what will be displayed and available in the output. I think by default the first eight fields are selected in all cases. To select more fields click on them. To remove a selected field from the list select it.

Removed RID from DISPLAY Columns and added TOTAL and SEC_NAME

If you want to browse the data you can just hit the Execute button without specifying any selection criteria – however, some of the databases are huge (8Kmeta, FILING_TIMES and 13FSHAREDATA from this view) and the loading time could be significant. With others though, like auditors and the compensation data – the loading time is trivial. It took me approximately 10 seconds to load all of the 567,948 records from the DIRECTOR_COMP data displayed in this image.

Selection of all records from the DIRECTOR_COMP database

You can add fields after Execute and they will load into the display (and be included in the output). Sometimes the scrollbar at the bottom of the viewer might not be immediately visible, if you want to browse the data just pull at the right bottom corner to get access to the horizontal scroll bar.

Full display of DIRECTOR_COMP after adding in the RID field and activating the horizontal scroll bar.

To save the results from the viewer use the Save Results button to activate the Save Results dialog. The default filename is DBSaveResults.csv. You can change that to whatever is appropriate.

The application will report when the file has been saved. Since we are saving these as csv files it is important to remember that when we open them with Excel – Excel can wreak havoc on some fields by transforming them and persisting that transformation after you save the file using Excels Save As dialog. For instance a CUSIP like 1234E5671 might be saved as an exponential value. I have gotten into the habit of saving with a new name and using the original file as my input to my Python code. It is also important to remember that Excel has row and column limitations – you can create csv files with this tool that will not fully open and this will be reported by Excel.

View of initial QUERY results of DIRECTOR_COMP results.

The Criteria Columns feature allow us to specify the basis by which the application selects the rows to display (the WHERE in a SQL statement). The Comparison Operators change based on the nature of the column selected. If the column is an integer the options include Equals, Between, Greater Than or Less Than.

When a field that is defined as an integer is selected the operators allow us to specify a range, =, >, or < criteria.

Once you have specified the column, operator and values and hit the Add Criteria button the application will construct the appropriate statement and load it into the Current Criteria view. I have selected ROW_ID between 150000 and 250000.

To add additional criteria we need to use one of the displayed Boolean operators (AND/OR). We did not build in an AND NOT button but you can type it (just as you can directly type your criteria in the Current Criteria box without using the guided selection tool). In the image below I added AND NOT and then specified GENDER Equals F. Notice that the interface adds quotes as needed.

Additional criteria added to select rows from the DIRECTOR_COMP data.

After execution of the selection criteria, 86,676 rows selected.

Random note, we actually have some corporate entities listed in some Director and Executive compensation tables. In these cases GENDER is blank. This means that the number of rows of M + number of rows labeled as F is usually less than the total number of rows.

The Contains criteria is equivalent to the SQL LIKE operator (and that is how a query using CONTAINS is constructed). By default the application adds the % wildcard to both ends of the string you specify. I wondered if there were directors with my name or something similar so I typed URCH in the Value to Compare box and hit the Add Criteria – notice the wildcards added by default.

57 Results remain after adding the PERSON_NAME LIKE %urch% criteria.

While the application adds the wildcards by default you can edit, and say remove one to better specify your particular requirement. By removing the ending wildcard I am limiting my search to those rows with a name ending in URCH.

Modifying the LIKE clause to require names ending in URCH rather than names with URCH anywhere in the name. Notice the 14 results compared to the 57 in the previous image.

To use a list of CIKs rather than having to type many or (CIK = “CIK_1”) or (CIK = “CIK_2”) we can select a csv file that has a column named CIK (no leading or trailing spaced) with a list of CIKs (as integers not left padded with 0s). Just select the Use CIK button above, and then hit the CIK Filter button to activate a file selection dialogue:

Set CIK Filter dialogue to select a CIK file to filter results.

I had a list of 2,600 unique CIKs and I did not specify any additional criteria. Pulling the list of over 130,000 observations took about 16 seconds.

Filtering just on CIK – sample list of CIKs had 2,600 unique CIKs.

Probably a more common basis for selection would be by a list of CIKs and YEAR values. By selecting on YEAR the Comparison operators change to reflect those for integer values so I select between and specify the range 2011 to 2020. This query took quite a bit longer, the addition of the YEAR value caused the total execution time to jump to about 95 seconds.

Filtering on CIK and specifying a year range. Total execution time about 95 seconds.

I think the last tip I need to share is the use of the text IN Comparison Operator. This is useful when we need to select multiple values for the same field. To illustrate this supposed I want to identify all audit data relating to audits where the auditor location in the signature was in the states of North Dakota, Nebraska, Colorado, Wyoming, Pennsylvania, Oregon, South Carolina, Alabama, Michigan, Texas and Arizona. We could construct a query using or clauses (location_state = “NORTH DAKOTA”) or (location_state = “NEBRASKA”) . . . or (location_state = “ARIZONA”).

A less tedious strategy is to create a comma delimited list using some other application of the values you want to be selected and then paste that list into the Value to Compare box after selecting the IN Comparison Operator. In the image below I have prepared a comma delimited list of relevant state names.

State names in a comma delimited list.

If you paste this into the Value to Compare box and hit the Add Criteria button the application will quote each separate value and prepare the appropriate query expression.

After adding a list of states and allowing the application to prepare the list for a query.

Query results after filtering/selecting based on multiple location_state values.

There are times you can get the dreaded Not Responding message when you are trying to pull a large number of records. Not Responding means the application does not have any threads available to respond to user input. In the image below I wanted to get that message so I decided to retrieve all records from the FILING_TIMES database.

Not Responding while trying to retrieve all records from the FILING_TIMES database.

Frankly, I think it is cool – we have this hard drive like object in Oregon and a desktop interface that is physically separate from the hard drive like object and physically separate from my home in Omaha where I am working on this post. All of this stuff has to be pulled and matched up and it can take a while. The whole database is 4.5 gb with over 21.5 millions entries and we think it is unlikely that our users are going to need all of those values at one time. I mentioned in an earlier post that by limiting to a set of particular filings and a list of about 1,800 specific CIKs the query took about 2 minutes.

I learn so much from questions I end up getting from clients and I fully expect someone to really test the limits of this tool as others have tested the limits of other features (and thus driven improvements). I was not sure of the limits of operators in a query expression so I decided to test this out. I used Excel to create a series of CIK = expressions (alluded to above) using the list of 2,600 CIKs I worked with to create the DIRECTOR_COMP example.

Construction of a series of cik= expressions using Excel.

The application returned a dreaded Unhandled exception message that basically reports that we are limited to 1,000 operators in an expression.

An Unhandled exception is an exception that we did not imagine could happen when we designed the code. If you get one of these messages – since we have a live connection to a database at that time it is best to hit the Quit button and restart the application as we probably need to figure out how to close and cleanly reopen the connection when these errors occur. If you don’t you might get another error because the previous error left the database in an untenable position.

I closed the application and restarted it then tried again with 995 CIKs (with 994 or statements) and it worked like a charm. Interestingly, it was much faster than using a CIK list of 995 unique CIKs.

Using 994 OR statements to select director compensation data.

I hope you find this useful. We are still working on data porting. I have another extensive post related to the 13F data coming but I am still waiting for some processing on the database to complete to I can complete the post.

directEDGAR

Search, Extraction & Normalization Engine

Introduction to our Database Query Tool

Like this:

Related

Leave a ReplyCancel reply

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from directEDGAR