Basic Extraction Features

We offer a number of ways to extract content from search results. This section focuses on the extraction directly from a search.

SummaryExtraction

Our SummaryExtraction feature serves two key purposes. First it provides a record of your search results. This is useful in cases where you might want simply to identify a set of documents with particular characteristics. For example, if you run a CIK/Date search filter on a collection of 8-K filings to identify the 8-K filings that your sample filed in a particular window so you can help use the results for an event study. The second purpose is to serve as a data collection worksheet. The output of the SummaryExtraction matches the order the search results. Thus, it is useful when trying to collect data that is going to involve recording some characteristic/data value that we identify while reviewing the results. The SummaryExtraction can be generated in one of two ways both are visible in the image below:

There is a SummaryExtraction button right below the File control on the top menu bar and their is another listed in the Extraction section of the main menu. These are active when the search results pane of the application has a loaded search. Selecting either of these activates a control that allows you to specify whether or not you want to save All Items or Selected Items and once you have made your selection and pressed the Save as CSV button a control will open that allows you to select a folder and specify the name of the csv file. There is no need to specify an extension, just name the file and the application by default will save it as a csv file. In the image below my file is being saved as demo_summary.

The SummaryExtraction file includes some fields that should help simplify your work. The first seven columns will be identical no matter which index you search or the characteristics of your search.

Column Heading	Description
CIK	Central Index Key
CNAME	Name of company as reported in the header file the day the filing was pulled from EDGAR
FILENAME	Actual path to the document on our platform - useful for matching and running Python code later on the source document
RDATE	The dissemination date of the particular document as reported in the header file on the day we captured it from EDGAR
CDATE	The CONFORMED date as reported in the header file. For financial reports this is generally the balance sheet date. For 8-K filings this is generally the date of the first underlying reportable event. Historically this has been a good proxy for the annual meeting date for Proxy filings. If there is not a CONFORMED date in the header we use the RDATE
WORDCOUNT	The number of words in this particular document
HITS	The sum of the instances of all of the individual words that met the search criteria.

The next columns report on the frequency of the search terms and phrases in your search. The search I ran was (ITEM_4.01 contains(YES)) and (DOCTYPE contains(8k*)). The hit counts are reported in alphabetical order. So my summary will report on the frequency of the search term DOCTYPE and ITEM_4.01. These are both fields but they are embedded in the search index and the documents were identified based on these terms so they will be reported out in the results. After the search terms the values of the metadata that were included in the search documents are reported. In the image below I have reorganized and truncated the metadata columns because there are more than 30 additional fields. When 8-K filings are searched the fields include a column for all 30+ ITEM variants. We add all ITEM codes associated with each 8-K filing and then report those back to you in the SummaryExtraction.

ContextExtraction

Our ContextExtraction allows you to extract the text around your search results. For this example I am going to search for

(DOCTYPE contains(10K*))  and( union* w/20 ( employee* w/10 1~~999))

I am trying to find all 10-K where there is the mention of the words rooted on employee that are within ten words of numbers that begin with a single digit to triple digits and that result needs to be within ten words of words rooted on union. I picked this search because it is an example of the cases where the variation in disclosure is likely large enough that the only way to collect the data is to review the disclosure rather than write some code to parse the number of unionized employees from the filing. Below is an image of the search results from a Levi’s 10-K filing.

One option to manage the data collection is to use a SummaryExtraction, add a column for the data values and then iterate through the result list. However, to preserve the results for documentation and perhaps to work on this data collection while out fishing a ContextExtraction will create a CSV file with all of the metadata as well as your specified context around your search results.

You can specify the context span by first going to the File control on the menu and selecting the Context item at the top. There you can specify the number of words or paragraphs around your search results. It is a little more

Frankly, the decision about the size of the span is as much science as it is art. Remember that you are outputting this into a CSV file. So you are losing a lot of the formatting that makes it easy to read in the document and then the larger the span the harder it is going to be to quickly identify your data. Frankly, if I think I need more than one paragraph I will not export the context other than to archive it for later questions. I would argue it is easier to review the results in our application.

We have to do some math after you set the context span and analyze your search. It is a bit complicated – the simple explanation is that if your search terms in one document are separated by more than the span you specify then you will get one row in the output for each instance of your search terms found. If there is overlap then you will get fewer results because more than one term might be included in the span you set.

Once you have set the span you can hit the ContextExtraction control on the Extraction menu is right below the SummaryExtraction:

After you select ContextExtraction a control very similar to the SummaryExtraction will appear that lets you specify whether to select All Items or Selected Items. Then you select the Save as CSV button to specify the destination for the file.

Since the controls are very similar to those for the SummaryExtraction pictured above we have not included images of the actual file saving process. However, ContextExtracttion is resource intensive and will take longer than the SummaryExtraction because the application has to do the math and find the location in the documents, pull the context and add the metadata related to the document and search and then save to the csv file. It will not be uncommon to see the application report Not Responding as in the following image.

Not Responding simply means that the application cannot process any messages. It will complete the process. Once completed the csv file you specified will have the search results – ordered in the same order as they are listed in the results section of the application and the same fields that are present in the SummaryExtraction will also be in the file – a new field called CONTEXT will be inserted between the FILENAME field and the RDATE field. In the image below I have highlighted part of the context that was extracted from the same document that was the subject of the image above (the search results from a Levi’s 10-K)

DocumentExtraction

The DocumentExtraction feature allows you to create a local copy of the original document for those cases where you want to do more sophisticated work or if you want to create an index of just those documents in your sample. This option pulls the document in the same format as when it was delivered to the SEC (htm or txt) and saves it to your specified directory. We name the document using the filepath/directory structure as well as the index position of the document in the directory it was pulled from. The control that opens after selecting the DocumentExtraction option requires that select the destination directory.

The image below was taken after selecting DocumentExtraction and hitting the browse button. There are provisions in the tool for you to create a New Folder. Once you have created and selected your folder hit the Select Folder button.

After hitting the Select Folder button the application focus moves back to the primary DocumentExtraction control. Hit the Okay button and the application will start pulling your files, renaming them and saving a copy in your destination folder.

When the process is complete the application will report back – just hit the OK button to close the notification message.

DocumentExtractionTextOnly

We actually maintain two copies of every document that is included in our archive. The original document that was parsed from the parent SEC filing and then a text only version that is stored in the related index. When you run a search, the document that you review in the application is the version from the index. This allows us to also deliver a plain text version of the full document when needed.

directEDGAR

Search, Extraction & Normalization Engine

Basic Extraction Features

SummaryExtraction

ContextExtraction

DocumentExtraction

DocumentExtractionTextOnly

Related