We have been working to add some new data to the platform. Specifically the breakdown of the disclosure of the Equity Compensation Plan Information (EQCPI). I specifically have been evaluating the tests we need to identify the cases that need to be inspected. As good as our parser it we still get some bad parses and we need to be able to systematically identify those and shunt them to a review queue for a person to look at. For context, with Executive or Director Compensation, if the sum of the reported compensation differs from the reported total by more than $10,000 a human looks to see if the issue can be addressed (maybe a value is reported as 123.456 and on closer examination it appears that it should have been reported as 123,456).
What we are seeing in the EQCPI disclosures are errors in the disclosure that we can’t initially be certain of the source of the error. Here is what I saw in a summary file after I ran the code we are working on to identify potential errors.
This was reported in a proxy filing made by OXFORD INDUSTRIES (CIK: 75288). If I were collecting this data for research I would be concerned about the reliability of these values since there is no total for TO_BE_ISSUED. However, this is what was reported (other than the normalization choices we made) as you can see in the next image.
So while this merits review, the captured data accurately represents the disclosure that was made. For internal purposes though I need to be more confident that the data was correctly captured. I have decided that we should stop ignoring dashes and dash like characters and replacing them with something consistent. Currently, we replace dashes with nothing. I am leaning to start replacing them with something clever like wasDASH. I am also thinking about replacing blanks (the value for total under Number of Securities . . .) with wasBLANK.
The advantage of doing so is that we can reduce significantly the cases we have to review. If we do this internally, no big deal. However, I was thinking about whether or not this would benefit you when you are normalizing tabular data. Here is a screenshot of what I am considering.
In some ways this might seem excessive but I don’t know what I don’t know. My point is that to get it right we have to consider that something happened between extraction and dehydration. My lovely wife will tell you quickly that I am not perfect, the point I am trying to make is that some of these html source files can be in a shape and structure we have never seen before and so the error could be that one of our assumptions that is embedded into the code is wrong for this case.
The next time I send out an update I am going to survey you as to whether or not we should make this part of the normalization so that you see these values when the tables are rehydrated.


