As a followup to eDSG’s list of the Top Five eDiscovery Technologies to Watch in 2012, today’s blog post is a more detailed overview of BeyondRecognition, the image processing technology that I ranked as the number one eDiscovery technology to watch in 2012.
This past week, I had the opportunity to spend some time with John Martin, founder of BeyondRecognition (“BR”) and long-time document conversion and litigation technology expert. First of all, to put BR’s new technology into perspective, not much has changed in Optical Character Recognition (OCR) for over 30 years. And, although once you begin to understand what John has created, you will realize that it is much more than just OCR, OCR is a good place to start the comparison.
OCR software electronically translates scanned images of handwritten, typewritten or printed text into machine-encoded text. This software is used to convert books and documents into electronic files, to computerize record-keeping systems in offices and to publish text onto websites. And, with the accelerating rush to Electronically Stored Information (ESI), it would be easy to think that there just isn’t that much paper to convert. However, there are literally trillions of existing documents that will someday need to be converted with billions more yet to be created. The legal, healthcare, mortgage and government markets are currently the prime offenders for creating more paper. In fact, in the five years to 2012, revenue for the Optical Character-Recognition Software industry is expected to increase at an annualized rate of 1.6% to $386.9 million.
And, OCR software developers have not really upgraded OCR software for a very long time. State-of-the-art today is not much different that it was 5-10 years ago. That’s why John and BR have the opportunity to disrupt the market with a completely new approach to the challenge of converting non-digital documents to searchable ESI.
BR’s core technology includes image-based (NOT text-based) document clustering, individual glyph (i.e. character) clustering for highly effective cascading text conversion, error correction, and document-type-specific data extraction functionality with accuracy rivaling or exceeding human coding.
Traditional Old Character Recognition (“OCR”) analyzes each glyph or character in a linear fashion, treating each new glyph as a new issue, and optimizes images for conversion purposes at the page level. By contrast, BR clusters similar glyphs prior to trying to convert them to textual characters, optimizing the portion of the images around each glyph, and then converts the glyphs to characters using the most complete glyph from each glyph cluster. BR then provides cascading or persistent error correction in which characters with low confidence conversion scores are edited in words that failed spell checking. Correcting a single word not only corrects all the instances of that sequence of glyphs but corrects other words where the same glyph was used so long as the correction results in a word that is in the word spelling dictionary. This cascading effect permits editors to correct hundreds of thousands of words with a single keystroke or mouse click – and the error correction is persistent because future occurrences of that glyph will also be converted correctly.
In one example from a mortgage loan file project, correcting the word “thc” with one keystroke resulted in correcting 142,121 instances of the word “the” but also had the cascading effect of correcting yet others, resulting in correcting a total of 149,520 instances of incorrectly spelled words.
Here just a few of the words impacted by the cascading effect in that example:
Following is an example of how BR is able to optimize individual glyphs and essentially reconstitute a page image of old court opinions using the “best” glyph from individual glyph clusters to produce the most accurate text conversion.
Potential to Compete with Off-shore Coding
By clustering like documents based on image similarity and then enabling users to rapidly build data extraction rules for each type or class of record, including location-based rules or rules based on non-textual elements, BR can create data extraction fields or metadata elements for each of those document types. The resulting index of specific types of data elements rivals or exceeds human coding. And, that may be the real disruptive feature of BR as it is going to provide a much more accurate and financially compelling alternative to off-shore coding.
- How to Engage Enterprise Buyers in Meaningful Conversations in 2016 February 28, 2016
- nVIDIA Driving Deep Learning to the Forefront – Literally February 22, 2016
- New Technologies Disrupting the Legal Business in the UK February 17, 2016
- Shares of Tableau plunge 36% after company posts $41M loss in Q4 February 5, 2016
- LexisNexis Unveils Lexis® DiscoveryIQ eDiscovery Platform Enhanced by Brainspace February 2, 2016