At least twice a month I get an email from someone starting a new project looking for some direction. My first questions back to them are almost always about the work they have done to identify their sample and the disclosure requirements that their sample is subject to. These are critical first steps that often get overlooked. Our experience is that spending some time at the early stages of a project addressing these questions can significantly reduce the stress and uncertainty of data collection because it allows you to get away from the nagging question – why did I not find [some data item]?
While I described the sample selection disclosure requirements as multiple steps they are really hard to separate from one another because they are so intertwined. One area they are separate though is coverage in the most common databases used with EDGAR data. For competitive reasons I probably can’t name the two elephants in the room but most of you know who they are. Their products do not cover all SEC registrants. While their populations may represent more than 98% of the capitalization of the US equity markets, in sheer numbers that is just a subset of all of the SEC registrants (best guess, less than 1/2).
Thus I think a critical first step is always to use their filtering tools to identify companies in their population that have the data items you need and meet your sample criteria. One area that is important to filter on is to decide if you need sample companies that have publicly traded equity. Commercial databases include entities that do not have any publicly traded equity but are SEC registrants. It is not enough to confirm that they have a Central Index Key (CIK). A complication related to this are the cases where the commercial database includes data for the entity that has publicly traded data as well as data for the subsidiaries that have filing obligations (usually because of publicly traded debt). One example of this I like to share is Entergy. Here is a shot of the bottom of the landing page for their 2015 FYE 10-K.

While I can only display five, there are seven registrants who simultaneously filed that same 10-K. Only CIK 65984 (ENTERGY CORP /DE/ – the second listed) has publicly traded equity. The other entities have various other securities that require public disclosure of their financial performance but do not have any proxy reporting requirements.
It took a long time to get here – the point is that without careful filtering users can get frustrated because they will not find compensation or director information for any of the other entities. Some compensation data for subsidiary officers will be listed in the DEF 14A of Entergy Corp simply because there compensation flows to the parent and they will be among the five highest compensated officers of the combined entity.
Another area that is tricky has to do with regulatory disclosure changes. Until SOX Director Compensation disclosures were generally about the schedule used to compensate directors rather than precise details about compensation to individual directors. The rules governing Director Compensation can be reviewed here. The rules became effective for fiscal years that ended after 12/15/2006. When trying to build a sample of Director Compensation data, companies like Apple (late-September FYE (52/53 week FYs) do not provide a compensation table until their proxy filing on January 23, 2008. Without investing time into understanding these disclosure requirements to then develop their sample users can expend considerable effort fruitlessly searching Proxy and then 10-K filings for data that is not available.
One more example has to do with the filing status of issuers. The SEC has a graduated filing schedule and disclosure requirements based on the filing status of the registrants. Companies that meet the definition of a Smaller Reporting Company have a choice to meet the full requirements of Regulation S-K or scaled disclosure requirements (described here). In total there are 12 differences in the disclosure requirements for companies that qualify for and elect the relief available under these regulations. These scaled disclosure requirements include the opportunity to omit disclosures for Risk Factors (Item 1A) as well as reduce content in many other areas of a filing. The SEC noted in 2008 that approximately 1/2 of 10-K filers would be eligible for this relief.
If you are trying to collect data from an area of the 10-K covered by these scaled disclosure opportunities your sample is going to be greatly affected by decisions made by the registrants. We know that at least 1/3 of the 10-K filers at any one time do not provide disclosure about their Risk Factors through the relief offered under the scaled disclosure requirements for smaller reporting companies. Here is a screenshot of a 10-K form from one registrant in 2016 that choose that disclosure regime.

The frustrating part about the scaled disclosure is that companies choose to implement their choices in a number of ways. The above screenshot makes it clear. Other registrants might use the word omitted; which is also clear. Our experience though is that more than half will just leave the space blank with no words to indicate the reason for the omission.
There is a lot of detail in this post. The point I am trying to make is simple though – it is really in your best interest to start research that requires the collection of data from EDGAR filings with a careful filtering of your commercial dataset. Apply as many filters as you can so that you do not spend any time looking for (or cleaning) data that is not going to ultimately be used.