Digitization 101: Blog Post: Announcing Tesseract OCR

Tuesday, September 12, 2006

Blog Post: Announcing Tesseract OCR

In August, Google released Tesseract OCR into open source. As the Google Code Blog says:

This particular OCR engine, called Tesseract, was in fact not originally developed at Google! It was developed at Hewlett Packard Laboratories between 1985 and 1995. In 1995 it was one of the top 3 performers at the OCR accuracy contest organized by University of Nevada in Las Vegas. However, shortly thereafter, HP decided to get out of the OCR business and Tesseract has been collecting dust in an HP warehouse ever since. Fortunately some of our esteemed HP colleagues realized a year or two ago that rather than sit on this engine, it would be better for the world if they brought it back to life by open sourcing it, with the help of the Information Science Research Institute at UNLV. UNLV was happy to oblige, but they in turn asked for our help in fixing a few bugs that had crept in since 1995 (ever heard of bit rot?)... We tracked down the most obvious ones and decided a couple of months ago that Tesseract OCR was stable enough to be re-released as open source.

There are many projects that could use good OCR software. Although this is not as good as a commercial product (by Google's admission), this may be very useful to projects that cannot afford commercial software. The Tesseract OCR can be downloaded here.

Since the announcement, there has been an update to the product released to fix a couple of problems. Looking at the web site, there may be other problems that need to be fixed. Like other open source products, I'm sure it will be the community of users who will support this. Will Google provide any ongoing support? From what I can see, the answer is "no," so let's hope that a community does form around this product.

Technorati tag: Google

Tuesday, September 12, 2006

Blog Post: Announcing Tesseract OCR

No comments: