Thursday, September 06, 2007

Article: Using “captchas” to digitize books

Captchas are "those strings of distorted characters that websites force you to recognize and type in order to establish that you are a person and not a malevolent computer." Now the pioneer of Captchas has found a way to put all of those -- and us -- to use doing something productive: helping to decipher words that can't be read by OCR (optical character recognition) in old books.

According to the article, Luis von Ahn...
created a tool, called ­"recaptcha," that pairs an unknown word with a known one. He distorts them both and puts a line through them--standard techniques for creating captchas. A user must decipher both captchas to access a site. The accurate typing of the known word serves the security purpose of captchas and adds a measure of confidence that the unknown word was identified correctly and can be used in place of the OCR's gibberish. Volunteers have begun deploying recaptchas, and the technique has been used to decipher two million words for the Internet Archive's book digitization effort.
The article does not say where this technology is being used. It would be cool to know where. I'd actually be interested in doing them just for fun (and to help out).

Addendum (2 p.m.): Thanks to Kathleen for checking out the web site and adding some info in a comment below. Yes...we can all help with this effort, although not a plug-in for Blogger yet.

Kathleen said...

The reCAPTHCHA site ( has a list of plugins and ab API that folks can install on their own blogs and www-sites to assist in the project. Here's a direct link to the "Resources" page:

OldTasty said...

And the Learn More page actually allows you to play around with the app itself. Have fun!