Teaser – 1 Version 4.6

We are finishing testing and working on the documentation for the next version of the ExtractionEngine.  We have added some new features that we hope will help with your research.  I will share these as we finish the testing and the documentation for each of the features.

The first item we have completed work on is that we added the ability for you to extract the text only from your search results.  While there is an increased interest in using text mining tools and performing sentiment analysis one blocker has been that most of the text mining software assumes/requires that the input text be in the form of plain text content.  Since most SEC filings are in html our clients have observed that it is painful to extract the text from a set of documents.

We added a new item to the Extraction Menu DocumentExtraction TextOnly.

text_extraction_1

If you select that option a new UI will start and give you the option to specify the destination directory where you would like the text form of the documents to be saved.  Once you have specified the destination directory and hit the Okay button text only copies of the original documents that were returned from the search will be placed in the specified directory.  Each document will be named in the CIK-RDATE-CDATE-~ form so you have an audit trail back to the source document.  Here is an image of a directory with a collection of filings that have been converted to text.

text_extraction_4

 

Here is a partial image of an 8-K filing in html format:

text_extraction_2

Here is an image of the same document after conversion to text:

text_extraction_3

Since we presume that these txt files will be used as inputs to some additional processing system we have not added line breaks.  The only line breaks in the file are those indicated by the html code so they are a bit painful to read.  However, all of the text is present.

 

 

Leave a Reply