Thursday, January 18, 2007

Hurst-Wahl & Dames: Article & podcast

K.M. Dames and I wrote an article for the spring Library Journal netConnect supplement. Our article is entitled "Digitizing 101." Since Kevin and I did several workshops together on digitization, we were asked to write an article-version of our workshop. That was tall order given the word count limit, so what we focused on for the article were "detail how to bring collections online, from copyright issues to outsourcing scanning" with a heavy emphasis material selection and copyright.

Kevin and I were also fortunate to participate in a podcast for Library Journal that discusses the article and other digitization related topics. Also on the podcast was Lotfi Belkhir, CEO of Kirtas and the podcast was moderated by Jay Datema, the Technology Editor for Library Journal. One cool thing Jay did was to create a time index the podcast on the web site and provided links to those topics we discussed at those times. For example, when the iPRES conference was mentioned at 45:33, there is a link to the iPRES conference page.

1 comment:

Anonymous said...

It was interesting to hear in the podcast that Cornell has the right to go back to Kirtas and have them redo anything they feel doesn't meet their (high) standards. But what is clear is that both the Microsoft and Google projects rely on "dirty OCR", i.e. with the PDFs they have the page image visible in front, and a hidden layer of OCR'd text. No human is checking the OCR'd text to see if it has actually been recognized correctly, so there will still be words/phrases that are missed when searches are performed.

I mention this because for projects where information integrity is considered vital (for example the digitization of land ownership records in the Alsace-Moselle regions of France done by Infotechnique), they are using multiple checks to see that the digitized information is indeed valid - I believe the term for this is double input validation, where each page/book is scanned twice, by different operators, and the results compared. Only if they match does the page get approved for "publishing".