Friday, February 01, 2008

White Paper: Optimizing OCR Accuracy on Older Documents

I received a question about OCR, which led me to find this document revised in 2006 and published by the U.S. Government Printing Office (GPO):
Optimizing OCR Accuracy on Older Documents: A Study of Scan Mode, File Enhancement, and Software Products, by Jon M. Booth and Jeremy Gelb, Revised June 2006 (v.2)
This document is technical and not an overview of OCR, so it is not for everyone. The conclusions, though, are interesting:
  1. Older and discolored documents must be scanned in RGB mode to capture all the image data, and to maximize OCR accuracy.
  2. The character accuracy produced by scanning older documents in RGB mode meets (GPO’s meeting of the experts) 99% OCR accuracy requirement, even without applying file enhancement.
  3. No single type of file enhancement, applied individually, improves character recognition rates forOCR.
  4. Specifically, the Downsampling enhancement type does not improve character recognition rates, despite OCR software manufacturers’ claims that a 300 dpi is optimal for recognition rates.
In conclusion, the combination of these facts demonstrate that file enhancement is not needed, because the recognition rates are already at an acceptable level, and more importantly, it does not improve the character recognition rates for OCR.
What is amazing is that they did achieve 98-99% without -- seemingly -- much fuss.

Technorati tag:

No comments: