Collecting Real Data – Use all of our tools and keep it focused!

In early June I spent some time with a client helping them think through a strategy to collect some data from 10-K filings. I heard back that the session was productive – I have wanted to recount the session but they are a PhD student working on a summer paper that they hope to develop into their dissertation. So I have been waiting until I could identify something analogous but was pretty far from the data they were trying to collect – something came up so here goes.

Suppose we want to collect the advertising/marketing costs that are reported anywhere in the 10-K.

Step 1 – Identify the Sample to Develop a CIK List

The the first question I asked the client was – did they have a CIK list. Their response was that they would match back to CRSP (in their case) after they collected this data. My argument was that the direction should be from the constraint to EDGAR – not the other way around. There are two reasons for this. First, the last step of the data collection is likely going to involve some manual processing. Why manually process some data that is not going to be used because the filer is not included in the constraining database? The second reason is that it is unlikely that we can anticipate all of the ways that a filer will express some concept. This example is about collecting the amounts reported as advertising costs. When I started this I was thinking marketing and advertising. I did not think about the fact that some filers might use the phrase promotional costs as a synonym for advertising costs. I discovered this by looking at filings made by my sample for which I did not find any results using advertising costs. Specifically, my sample included Macy’s. I ran a search for “advertising costs” and Macy’s CIK was in the list of missing. Can you imagine Macy’s not reporting advertising expense or costs? I had to investigate that.

Below is the result of searching Macy’s 10-K for advertising while trying to understand why there were no results for advertising costs from a Macy’s filing in the initial search.

If you can’t make out the search above, it was for advertising and (DOCTYPE contains(10K)) and (CNAME contains(macy*)). So Macy’s reports advertising and promotional costs. Knowing that is helpful.

In summary – by focusing on a specific list of CIKs we can limit the amount of unnecessary work to do the final cleaning of our data AND we can scan filings for those with no data and learn how they make the disclosure. Basically we are putting a really tight fence around our problem to make solving it more direct.

Step 2 – Check to See if your Disclosure is in the iXBRL Data

We are trying to provide you a comprehensive platform to let you focus on your research and not get bogged down when the data can be pulled directly in the form you need. I noticed in one of my first searches that this data does sometimes appear either directly in the income statement or in a schedule in the notes. If we have processed the iXBRL data from that filer then it is likely available. Below is a search result from (advertising w/2 (expense or cost)). Note that the search was filtered on CIK. Look at that table in the notes.

Because the data might be available from the normalize iXBRL I am going to switch to the Query Database tool to see if I can find all rows of data that have the word root advert in the Original Row Label (orig_row_label). I am also going to include any rows where words rooted on advert are in the tag and I to CIK limit the search. Here is a screenshot of that effort.

That result count represents rows of data. And these are not perfect yet. But we can dump them into Excel (using the Save Results button) and determine their relevance pretty easily. In my case, I ended up with 456 CIKs where the data was relevant for this example. I sorted on the attributes and eliminated those rows that had an instant value – as these are balance sheet items. I found rows where the label was describing advertising revenue or the name described the receipt of franchise contributions for advertising and many other activities that were not relevant to this particular case. While there was a lot of noise, the noise was easy to clean out because of the ability to filter on the various fields in Excel. Now I can take those CIKs out of my CIK list. I don’t need them any longer. Clearly we can use the same strategy to find example/disclosures of promotional expense(s) as well as marketing expense(s).

Step 3 Identify those in the Sample that Do Not have the Disclosure

This is actually a really important step and why you should be working with a sample. You have a sample of firms that are relevant but it is not necessarily true that all of them will have the disclosure you are looking for. You can search for those filings made by your sample that do not have any of the expected key words. This will also allow you to scan those filings to see if you need to expand your key word search. I ran a search for (DOCTYPE contains(10K)) and not (marketing or advertising or promotional). Here are the results:

I have each one of those filings immediately available to review. All I know about these is that they are filings made by companies that have some characteristic that caused them to be in my sample and they do not have any instance of any of the words marketing, advertising or promotional. It does not mean that there is not relevant information – but my expectations about how the information will be disclosed might need to be refined. Or it might be that there is no relevant disclosure. If after review I determine these filings are not going to have my required disclosure I can log that and Remove these CIKs from my CIK list.

Step 4 Start Doing Really Focused Searches

For the rest of our sample we have to go back and run searches. If you don’t remember – you can search for the presence/proximity of numbers in your search. I am going to start with really narrow searches. I am doing that because I am trying to collect data and I want to make the next step very directed. I am going to also set the context extraction to 20 words.

The search I am running is (advertising w/2 (expense* or cost*)) pre/10 (1~~99) and(DOCTYPE contains(10K)). The wildcard is to anticipate expense or expenses & costs or cost. Here is a screenshot of the results:

I am not at all disappointed in those results. To me they are perfect because they are so focused. I have highlighted in red the relevant context from the filings of 3 different companies. As a faculty member at a non-PhD university – if I actually wanted this data I have something I can pass to an undergraduate assistant for help normalizing. I could also write some Python code to normalize this. I have lots of options because it is so focused.

This iteration allowed me to identify the disclosure with the data I needed for 907 companies from my sample. I could have increased that count if I had a broader search phrase – but it would have added noise to the output that I would have had to think about how I would filter.

Probably the first way I would broaden the search is to increase the space between my (advertising w/2 (expense* or cost*)) and my numbers (1~~99). Another alternative is to require the numbers before my words (1~~99) pre/10 (advertising w/2 (expense* or cost*)). Clearly I need to allow for marketing and promotional. If you keep the search focused, each iteration will yield results that should be relatively easy to convert into the data you need. I certainly could have just run a search for advertising expense or marketing cost – but that would return a huge amount of noise. It is well known that I only have one brain cell. I much prefer to work with very focused results. We certainly could run a search like (promotion* or marketing or advertising) w/20 (expense* or cost*) w/20 (1~~99). The problem with that is there is just going to be more noise. Here is a screenshot of one of the results from such a search:

There is nothing useful in that particular result. From what I could tell – more than 40% of the results from that search were just noise.

If you do run such a search since you are allowing 60 words between your first and last word you would probably need to set the span for context extraction to something like 100. I am just guessing but that would be where I would start.

Again, in summary – start with a CIK list. If possible check our iXBRL data. Look for filings from your CIK list that will not have any matches and then do repeated focused searches. My experience is that ultimately this strategy will save you significant time.

I would like to add – I have been working on a project with the XBRL data. If you have a project and believe that for some part of your sample the data might be available from the XBRL let me know and we can easily pull it for you. I have done that for two people this summer. Right now we are struggling to identify ways to make it more accessible so you can do this on your own but we are not there yet. I know from the two cases I worked on this summer that this saved our users significant time. The comprehensive data goes back to 2013 but we have it back to 2009- for some filers. Send me an email if that seems likely to help your case.

Leave a Reply