Collecting Real Data – Use all of our tools and keep it focused!

In early June I spent some time with a client helping them think through a strategy to collect some data from 10-K filings. I heard back that the session was productive – I have wanted to recount the session but they are a PhD student working on a summer paper that they hope to develop into their dissertation. So I have been waiting until I could identify something analogous but was pretty far from the data they were trying to collect – something came up so here goes.

Suppose we want to collect the advertising/marketing costs that are reported anywhere in the 10-K.

Step 1 – Identify the Sample to Develop a CIK List

The the first question I asked the client was – did they have a CIK list. Their response was that they would match back to CRSP (in their case) after they collected this data. My argument was that the direction should be from the constraint to EDGAR – not the other way around. There are two reasons for this. First, the last step of the data collection is likely going to involve some manual processing. Why manually process some data that is not going to be used because the filer is not included in the constraining database? The second reason is that it is unlikely that we can anticipate all of the ways that a filer will express some concept. This example is about collecting the amounts reported as advertising costs. When I started this I was thinking marketing and advertising. I did not think about the fact that some filers might use the phrase promotional costs as a synonym for advertising costs. I discovered this by looking at filings made by my sample for which I did not find any results using advertising costs. Specifically, my sample included Macy’s. I ran a search for “advertising costs” and Macy’s CIK was in the list of missing. Can you imagine Macy’s not reporting advertising expense or costs? I had to investigate that.

Below is the result of searching Macy’s 10-K for advertising while trying to understand why there were no results for advertising costs from a Macy’s filing in the initial search.

If you can’t make out the search above, it was for advertising and (DOCTYPE contains(10K)) and (CNAME contains(macy*)). So Macy’s reports advertising and promotional costs. Knowing that is helpful.

In summary – by focusing on a specific list of CIKs we can limit the amount of unnecessary work to do the final cleaning of our data AND we can scan filings for those with no data and learn how they make the disclosure. Basically we are putting a really tight fence around our problem to make solving it more direct.

Step 2 – Check to See if your Disclosure is in the iXBRL Data

We are trying to provide you a comprehensive platform to let you focus on your research and not get bogged down when the data can be pulled directly in the form you need. I noticed in one of my first searches that this data does sometimes appear either directly in the income statement or in a schedule in the notes. If we have processed the iXBRL data from that filer then it is likely available. Below is a search result from (advertising w/2 (expense or cost)). Note that the search was filtered on CIK. Look at that table in the notes.

Because the data might be available from the normalize iXBRL I am going to switch to the Query Database tool to see if I can find all rows of data that have the word root advert in the Original Row Label (orig_row_label). I am also going to include any rows where words rooted on advert are in the tag and I to CIK limit the search. Here is a screenshot of that effort.

That result count represents rows of data. And these are not perfect yet. But we can dump them into Excel (using the Save Results button) and determine their relevance pretty easily. In my case, I ended up with 456 CIKs where the data was relevant for this example. I sorted on the attributes and eliminated those rows that had an instant value – as these are balance sheet items. I found rows where the label was describing advertising revenue or the name described the receipt of franchise contributions for advertising and many other activities that were not relevant to this particular case. While there was a lot of noise, the noise was easy to clean out because of the ability to filter on the various fields in Excel. Now I can take those CIKs out of my CIK list. I don’t need them any longer. Clearly we can use the same strategy to find example/disclosures of promotional expense(s) as well as marketing expense(s).

Step 3 Identify those in the Sample that Do Not have the Disclosure

This is actually a really important step and why you should be working with a sample. You have a sample of firms that are relevant but it is not necessarily true that all of them will have the disclosure you are looking for. You can search for those filings made by your sample that do not have any of the expected key words. This will also allow you to scan those filings to see if you need to expand your key word search. I ran a search for (DOCTYPE contains(10K)) and not (marketing or advertising or promotional). Here are the results:

I have each one of those filings immediately available to review. All I know about these is that they are filings made by companies that have some characteristic that caused them to be in my sample and they do not have any instance of any of the words marketing, advertising or promotional. It does not mean that there is not relevant information – but my expectations about how the information will be disclosed might need to be refined. Or it might be that there is no relevant disclosure. If after review I determine these filings are not going to have my required disclosure I can log that and Remove these CIKs from my CIK list.

Step 4 Start Doing Really Focused Searches

For the rest of our sample we have to go back and run searches. If you don’t remember – you can search for the presence/proximity of numbers in your search. I am going to start with really narrow searches. I am doing that because I am trying to collect data and I want to make the next step very directed. I am going to also set the context extraction to 20 words.

The search I am running is (advertising w/2 (expense* or cost*)) pre/10 (1~~99) and(DOCTYPE contains(10K)). The wildcard is to anticipate expense or expenses & costs or cost. Here is a screenshot of the results:

I am not at all disappointed in those results. To me they are perfect because they are so focused. I have highlighted in red the relevant context from the filings of 3 different companies. As a faculty member at a non-PhD university – if I actually wanted this data I have something I can pass to an undergraduate assistant for help normalizing. I could also write some Python code to normalize this. I have lots of options because it is so focused.

This iteration allowed me to identify the disclosure with the data I needed for 907 companies from my sample. I could have increased that count if I had a broader search phrase – but it would have added noise to the output that I would have had to think about how I would filter.

Probably the first way I would broaden the search is to increase the space between my (advertising w/2 (expense* or cost*)) and my numbers (1~~99). Another alternative is to require the numbers before my words (1~~99) pre/10 (advertising w/2 (expense* or cost*)). Clearly I need to allow for marketing and promotional. If you keep the search focused, each iteration will yield results that should be relatively easy to convert into the data you need. I certainly could have just run a search for advertising expense or marketing cost – but that would return a huge amount of noise. It is well known that I only have one brain cell. I much prefer to work with very focused results. We certainly could run a search like (promotion* or marketing or advertising) w/20 (expense* or cost*) w/20 (1~~99). The problem with that is there is just going to be more noise. Here is a screenshot of one of the results from such a search:

There is nothing useful in that particular result. From what I could tell – more than 40% of the results from that search were just noise.

If you do run such a search since you are allowing 60 words between your first and last word you would probably need to set the span for context extraction to something like 100. I am just guessing but that would be where I would start.

Again, in summary – start with a CIK list. If possible check our iXBRL data. Look for filings from your CIK list that will not have any matches and then do repeated focused searches. My experience is that ultimately this strategy will save you significant time.

I would like to add – I have been working on a project with the XBRL data. If you have a project and believe that for some part of your sample the data might be available from the XBRL let me know and we can easily pull it for you. I have done that for two people this summer. Right now we are struggling to identify ways to make it more accessible so you can do this on your own but we are not there yet. I know from the two cases I worked on this summer that this saved our users significant time. The comprehensive data goes back to 2013 but we have it back to 2009- for some filers. Send me an email if that seems likely to help your case.

Comparing our iXBRL Data with SEC Data Store

I have had several questions about using our iXBRL data versus the freely available data available from the SEC. I think the simplest explanation is that our data begins with the 10-K (and soon other filings) whereas the SEC data is somehow sourced from the R# files and/or the Financial_Report.xlsx files that are packaged in the filing. So when I say our data begins with the 10-K – I mean to share that we process the 10-K document/form. Our processing does not consider the R# files or the xlsx files because we learned (through lots of frustration) that the presentation in those files can vary from the presentation in the 10-K.

We have discovered a reasonably large number of anomalies/discrepancies when we have attempted to compare our data (we want to be sure we did it right) with the SEC data when pulled and processed from the archive they provide. The discrepancies involve both labels, values and signs. Most often it is a difference in sign – we will report a value as negative/positive but when the matching value is pulled from the SEC archive the value will be positive/negative. Again – there are differences in values as well as the labels but it is mostly sign differences.

Here is a concrete example. – this link (this one) will take you to the R59.htm file from Show Carnival Inc.’s 10-K filed on 3/22/2024. Here is a screenshot of that table from the Notes Details page:

I placed a red box around data that we report as positive numbers in our database. Here is the data for the period ending 2/3/2024 from the directEDGAR platform:

Notice that we report the 846000 as a positive number and if you process the SEC data using the tsv files and follow the rules you will end up with a negative value for that data value. The actual data value in the SEC archive is -846000 (negative) and there is not a negating indicator in the pre.tsv for that data. Note the other values (-562000 & -992000) are also negative in the SEC archive; we report these as positive numbers in our database

Who is right, well we both are – the screenshot above from the Interactive Data illustrates the data value as generated from the SEC archive. The SEC archive accurately and precisely allows you to reconstruct the Rfiles. Our data allows you to access the data as it was reported within the 10-K. Here is a screenshot of the data from our database as it was presented in the 10-K.

It is interesting (perhaps only to an SEC wonk) that these files differ. We have known they differed for a long time because they did not prove to be a reliable way to construct the tax recon and deferred tax tables. As you can see above, the order differs as well as the sign.

Here is an example of a label difference:

This is from the segment adjustment of Reconciliation of Adjusted Net Income to Net Income in the details in the Rfiles. We have the Acquisition and transaction expenses labeled as Less: Equity-based compensation expenses. You can also see that the SEC label for the 4,140 value is Less: Losses on the modification or extinguishment of debt and capital lease obligations. We label that value as Less: Acquisition and transaction expenses.

Random Advertisement

I was finishing up an email to send our clients when I received an email asking about identifying the date of director resignations. I got a bit lost in that because I had never really thought about that issue. I know that they are supposed to be reported in the 8-K as Item 5.02. The problem with that reporting is that code also is used to identify “Departure of Directors or Certain Officers; Election of Directors; Appointment of Certain Officers; Compensatory Arrangements of Certain Officers” So it was relatively easy to provide some direction on how to build a search limiting the results to those 8-K filings with an ITEM 5.02 and additionally conditioning the results to those that have some word rooted on resign within a short span of words rooted on director. Because this is a rarer disclosure I selected all 8-K filings archives rather than doing them one by one. Here is the search I ran:

 (ITEM_5.02 contains(YES))  and (resign* w/10 director*)

Here is a screenshot of the results – it is pretty cool – I found 97,661 documents in a bit under 2 minutes

One reasonable questions is why did I find the first one listed if my ITEM code constraint was set to 5.02 but that filing was coded as ITEM 6. That is because we re-coded all past (archived) filings when the new coding requirements were mandated by the SEC sometime in 2004 or 2005. Thus, any past filing that had a filing reason code of 6 – we tagged in our metadata as a 5.02. Now that is slick!

However, as I was scanning through the results I was thinking about some rather interesting resignation letters that I had seen in the past that were not showing up in these results. The current guidance requires that resignation letters from officers and or directors be filed as Exhibit 17s and normally I would expect them to be associated with an 8-K but given my knowledge of the variation in the way the rules are followed I did another search. This time I searched all filings other than 8-K for Exhibit 17s again conditioned on having word rooted on resign within 10 words of words rooted on director. I set that criteria because in my initial search I found documents that were wrongly coded – this is a registrant/filer issue – we act on the code embedded in the filing.

Here is a screenshot of my results – I brought the filing made by Blackboard visible because that was the one I remembered and could not find in the earlier results.

To me this one was particularly interesting for two reasons. First, Blackboard did not actually file an 8-K reporting this resignation. Second the strong language about resisting compensation demands of the CEO was fascinating when I first read this. It is not often that we can get these opportunities to ‘see’ what is going on with the board.

I noted at the top that this was a random advertisement. The facility with which I could grab/identify these filings is the advertisement. If you want unprecedented access to searching EDGAR filings directEDGAR should be your first choice. If you want to accelerate your research please email me (bkealey [some weird symbol] directedgar.com)- our platform is unaffiliated with the SEC.

Verifying Headquarters Location

I have had some messages with a client about some concerns regarding the accuracy of the Business Address that we include in the 10KMETA database. That data is parsed from the header file associated with the 10-K filings that we capture for our full-text search engine. The concern arose because of a recent paper by Jennings et. al, published in Review of Accounting Studies, Measurement error, fixed effects and false positives in accounting research.”

Jennings, Kim, Lee and Taylor reported that in about 2% of the sample they reviewed the address reported in the header file does not match the headquarters location as reported on the face of the 10-K.

They make an reasonable, but untested assumption – the address as reported on the face of the 10-K is the correct address. It seems to me that if in-fact the state location is critical to a study then making an assumption about data accuracy is a bit dicey. But the fact is we have to make assumptions that the data we use is reasonably accurate. But imagine the case where the filer moved their headquarters in the last week of the year and report their new address on the 10-K. If the new address is in a different state than the old address, which address should be used. I wonder if the best solution is to leave out those CIK-YEAR observations when the address changes. I think their results would be more interesting if that analysis was run.

Nonetheless, let’s make an assumption that we can no longer use the BUSINESS ADDRESS state location as reported in the header file (and included in the metadata database). The problem now is to collect the address as reported on the face of the 10-K. That is not a daunting task. Except for those filings associated with multiple registrants. But these are usually excluded from research because they tend to be utilities and thus are usually eliminated by using the right sample selection criteria to identify your sample using say Compustat (or Capital-IQ).

It is not necessary to pull the filings to get at the address. As a matter of fact, using directEDGAR we can get enough of the business address into an Excel file to then extract either the name of the state/country or the zip code.

I ran a search for “address of principal” and (DOCTYPE contains(10K*)) over the universe of 10-K filings from 2005-2015. I selected that search because I wanted to focus on the face of the 10-K and a quick review of some 10-K filings made me believe that most would have the line Address of Principal Executive Office/s. Here is a screenshot of the results of that search:

If you are knowledgeable enough about EDGAR filing frequency etc you will quickly recognize that the 123,784 documents returned in that search is not the complete universe – that is because the rest of the 10-K forms (and their variants) do not use the same cover page. That is not a problem – we can address those by doing a search for those that are missing – I am not going to do so now because I want to stay focused on this task. I will observe that approximately 200 of those that are missing used the language address of principle. Most of the rest used the phrase Address of Registrant’s principal executive offices.

Once my search is complete – I am going to use the Context Extraction feature. First though, I am going to set the parameters to 5 words – so I will pull five words on either side of my search phrase. The results are displayed in the next screenshot:

You can see two patterns (there are others but these represent a significant proportion of the results). The next question is how to parse out the zip code? Frankly, my instinct is to keep working in Excel. The first iteration of our platform relied extensively on macros to accomplish lots of tasks and I am very comfortable using Excel features. Python is a reasonable choice as well. I just believe that with Excel I can get the job done faster.

I copied the context column to the right and I will do my work on this column so I have the original undisturbed if I make a mistake. My first step is going to be to split on the left parentheses using Excel’s Text to Columns Feature.

I am lazy so I am also going to discard everything that was split to the right. I understand that in some cases I need that but I can access that content in a bit – let me just get the first group finished. Once I have split on the left parenthesis I will find the rightmost space. (I could split on spaces but that would be pretty messy). Once I find the rightmost space I will get all of the text from the rightmost space to the end. However, since we split on the left parenthesis we do need to use the trim function to remove the last space as when we split it on the character a space remained. There are a number of ways to do this (find the last space) I used this function – =LOOKUP(2^15,FIND(” “,N2,ROW(INDIRECT(“1:”&LEN(N2))))) – I found it on SuperUser. Once I used that function I then used the RFIND function in Excel to parse the text from the last space to the end. On this first pass through I found zip codes from 91,391 10-Ks. There is more to do – but I have been at this for maybe 30 minutes. In the population I looked at there seem to be 3,875 zip code changes from one filing to the next. That number may overstate the true zip code changes because when I focused on BEMIS CO (CIK 11199) (since they were mentioned in the Jennings et. al. paper) I discovered what is probably a typo in their zip code as you can see in the screenshot.

I want to be careful, I am not trying to dissuade anyone from using Python and crafting some regular expressions. But I would be remiss if I did not point out that starting from the Excel file provides a huge advantage time wise – so having access to directEDGAR – just makes life better! I had a brief exchange about this with Professor Taylor today – he confirmed that they used some scripting focused on the location. I didn’t try too hard to get from him the amount of time it took for them to get this data – but I bet it was easily 10 times the amount of time I would spend.