Back to Basics – Image Files in Search Results

I had an an interesting question this morning from a user – they were reviewing some search results and they came to this:

Image Heavy Document

There search returned some documents and a number of their documents would not display in the viewer. They were concerned about two issues, first how could they inspect the document and then what if they wanted to use the search results from the document.

When we push the search results to you, we are not actually pushing the document – we are pushing a cached version of the document. However, the document is almost immediately available for those cases like the one above. Just hit the Open Document button. All of the documents are stored in directories below the indexes and they load fast

After hitting the Open Document Button

This is an htm file with embedded images that contain text. The text was extracted using OCR and indexed. We do not attempt to try to create a text version of the document – OCR technology is not there yet. The search was for Organic Growth and after opening the document above I found the following:

Organic Growth

So now – how to get that text out of the document – well the ContextExtraction feature works with the text in the indexer rather than the text from the document so I set a limited context span as you can see below:

Setting Context to Extract 5 Words Around Search Phrase
Context Extraction

The OCR processor did not break paydown and Deliver – image processing is HARD.

The bottom line is that we have the document, and the search results are available – I will admit it is annoying at times to have to go through these steps.

Leave a Reply