Digitization 101: Article: Linux to help the Library of Congress save American history

Tuesday, April 03, 2007

Article: Linux to help the Library of Congress save American history

Okay, what makes this article so interesting for me is that it describes the Scribe book-scanning system that has been developed by the Internet Archive. Here are the details:

Combination of hardware and free software
"takes high-quality images of books and then does a set of manipulations, gets them in optical character recognition and compressed, so you can get beautiful, printable versions of the book that are also searchable," according to Brewster Kahle
Works with Linux (Ubuntu); support for Windows has been dropped
Books are held in a V-shaped cradle
Uses two digital cameras
"Free software is used almost every step of the way"
Books are scanned at the Library of Congress and the files are processed at the Internet Archive (ah, the Internet) using a cluster of 1,000 machines
"Image processing for an average book takes about 10 hours on the cluster, and while the project still uses proprietary optical character recognition (OCR) software, Kahle says that many open source applications come into play, including the netpbm utilities and ImageMagick, and the software performs 'a lot of image manipulation, cropping, deskewing, correcting color to normalize it -- [it] does compression, optical character recognition, and packaging into a searchable, downloadable PDF; searchable, downloadable DjVu files; and an on-screen representation we call the Flip Book.'"

The Open Content Alliance has 40 members and is currently digitizing 12,000 books per months across five locations.

And what is being digitized at the Library of Congress? Well, it includes "Civil War and genealogical documents, technical and artistic works concerning photography, scores of books, and the 850 titles written, printed, edited, or published by Benjamin Franklin." Very cool!

Thanks to the digitizationblog for finding this article. Mark, you truly made my day!