I had an interesting email from a hard at work PhD student who was using the ContextNormalization feature of our platform to normalize some data. Because they are collecting a piece of data I have not seen used in research before I am going to describe their problem using AUDITOR TENURE data collection. The nature of their problem manifests itself in the same way in almost every Context Normalization case.
As a result of a PCAOB rule change registrants are supposed to disclose their tenure with the client. The most common expression of that tends to be We have served as the Company’s auditor since YYYY. Below is an image from running a search for auditor* since on 2021 10-K filings.

I ran the search auditor* because I also want to catch the expression of auditors since.
I set a really tight span for the context since this is one of those binary cases – it will be concisely expressed or it is not likely to be expressed. Remember – to set the span for Context – use Options/Context and specify the span you need.

Once we’ve done that we are ready to set the parameters for the ContextNormalization. Notice I did not include auditor in the Extraction Pattern. This assures me that the processor will not discard those cases where the phrase is auditors since. Since the processor is working on the active search results I have no concern about phrases like we have been making amazing products since XXXX. Our search was for auditor* since.

This is one of the ‘spelling matters’ issues – if I specify auditor since the engine will not normalize auditors since. The word auditor was critical to get the right context but using it in the Extraction Pattern will reduce the yield since there will not be an exact match to auditor since when they have used auditors since.
The second spelling issue occurs because of formatting errors or typos. When I sorted the results by the value of tenure – you can see I had some results that didn’t make sense.

Someone accidentally inserted an extra space as they were typing in the year values or the underlying html has a tag separating parts of the number.
Then of course we have these cases

In the cases shown above the search correctly identified the context but there are words intervening between the word since and the value we want for year.
We collected the year value from 6,745 documents based strictly on the existence of a valid number following the word since. There were 115 documents with language “since at least YYYY” or “since fiscal YYYY” and other permutations and there were a total of 4 typos.
I keep trying to play around with a Python library called Fuzzy-Wuzzy to improve this yield and while we can make significant improvements for specific use cases – the problem is that I can’t seem to anticipate all use cases in a way that makes me comfortable implementing the library inside one of our functions. However, if you do a ContextExtraction and have some time on your hands I would encourage you to poke at the normalization with that library.