Back to Basics – Image Files in Search Results

I had an an interesting question this morning from a user – they were reviewing some search results and they came to this:

There search returned some documents and a number of their documents would not display in the viewer. They were concerned about two issues, first how could they inspect the document and then what if they wanted to use the search results from the document.

When we push the search results to you, we are not actually pushing the document – we are pushing a cached version of the document. However, the document is almost immediately available for those cases like the one above. Just hit the Open Document button. All of the documents are stored in directories below the indexes and they load fast

This is an htm file with embedded images that contain text. The text was extracted using OCR and indexed. We do not attempt to try to create a text version of the document – OCR technology is not there yet. The search was for Organic Growth and after opening the document above I found the following:

So now – how to get that text out of the document – well the ContextExtraction feature works with the text in the indexer rather than the text from the document so I set a limited context span as you can see below:

Setting Context to Extract 5 Words Around Search Phrase

Context Extraction

The OCR processor did not break paydown and Deliver – image processing is HARD.

The bottom line is that we have the document, and the search results are available – I will admit it is annoying at times to have to go through these steps.

directEDGAR

Search, Extraction & Normalization Engine

Back to Basics – Image Files in Search Results

Like this:

Related

Leave a ReplyCancel reply

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from directEDGAR