Digitization 101: Will mass digitization projects need to be re-done?

Wednesday, December 27, 2006

Will mass digitization projects need to be re-done?

A colleague told me about a discussion list post by Joseph J. Esposito, president of Portable CEO, where he posits about the requirements for mass digitization projects. According to Esposito, the big name mass digitization projects (e.g, Google) are not paying attention to four specific requirements:

...the first requirement of such a project is that it adopt an archival approach.
Archives of digital facsimiles are important, but we also need readers' editions...
...scanned and edited material must be placed into a technical environment that enables ongoing annotation and commentary.
The fourth requirement is that mass digitization projects should yield file structures and tools that allow for machine process to work with the content.

You can read the full-text of Esposito's remarks here. It would be interesting to hear from those who are "close to" these mass digitization projects about whether or not they agree. If you have a comment, why not leave it here?

BTW I was unfamiliar with Esposito's name (obviously my fault). One ~~A German language~~ blog re-posted Esposito's words and included a short biography ~~(in English)~~.

Thanks to the commenter yesterday (12/27) who corrected the information I had on the Archivalia blog. The blog is not just in German.

Technorati tag: Digitization

3 comments:

Anonymous said...: What is provided as "samples" by vendors to the companies running the book projects, can and often does vary from what is generated during actual production. When the samples are done, minutes/hours can be spent perfecting each page, when in production you are talking a few seconds of viewage (if any).

At least one vendor charges one rate for "Standard" QC, and another for "Premium" QC. There is more time spent on each image for "Premium".

In the end, extra time spent capturing and processing the image correctly the first time will impact the bottom line. Given the tremendous volumes of books that need to be digitized, I cannot see "Premium" standards being applied.

I wonder if there is a clause in these digitization contracts where the vendor will have to rescan anything that doesn't make the specs (and what happens if the book is alredy back on the library shelves).; 3:30 PM
kg said...: Archivalia http://archiv.twoday.net is not a German-only weblog. It's a collaborative weblog with an "English corner" (separate RSS feed).; 5:17 PM
Anonymous said...: Joe Esposito's characterization of his desiderata for mass digitization in terms of what additional tasks "must" be done to make the whole enterprise worthwhile seems a bit foolhardy, given the complexity of the issues involved and hard economic realities. Instead of criticizing an enterprise like Google's for 'not doing more,' it might be more useful to think ahead ten years from now when this phase of mass digitization (here and in Europe) will be a fait accompli and consider how we will build on what Google and others have accomplished.

The good news about Google-type mass-digitization, in my view, is that almost nothing these projects accomplish will in any way raise the cost of enhancing or re-doing their output in the future -- in fact just the opposite. Keep in mind that the cost to Google partners is pretty marginal, and most are getting back what are essentially free copies of their scanned materials for other kinds of uses. Google is in reality giving us all an enormous (corporate) gift, one that we can decide to build on, ignore, redo or more likely some combination of the above.

The one area of concern in which present activity might compromise future options is in handling of fragile originals -- an issue not mentioned by Esposito, but one that Google partners are currently addressing by holding back such materials from scanning.

Esposito's specific desiderata are arguable to say the least. Some comments follow:

1) "Archival Approach": Mass digitization by its nature precludes taking an "archival approach". Still, the original materials will continue to exist and can continue to be consulted. Selective archival-quality re-scanning can be done as needs and funding options evolve.

2) "Readers' Editions": The need for "readers' editions" (depending on what Esposito means by the term) has always been addressed -- if at all -- by the scholarly and publishing industry, not libraries, as he notes. Economics, demand and perhaps volunteerism will determine whether such editions become generally available. But insisting on them as part of digitizing say 500,000 volumes of old and often outdated (aka "historical") scientific, technical, industrial, mathematical, serial, ephemeral, and commercial tomes in addition to the novels and poetry Esposito is probably thinking of -- is quixotic at best.

3) Typography & Orthography: Esposito's assertion that a tender high school student should not have to struggle with "Victorian typography" strikes me as condescending to say the least. Maybe I'm mistaken; does anyone really think that Google's 1862 New York edition of Great Expectations @ http://books.google.com/books?vid=OCLC00177842&id=4VVZdkGyUSMC (Harvard's copy) is too difficult to read?

4) Ongoing Annotation & Commentary: [Sigh] There is certainly not widespread agreement that there is value in the notion that all knowledge needs to be embedded in 'social software.' But it is of course absolutely possible to add wiki-type functionality to each scanned document in the hope of attracting an "intellectually engaged community" with extra time on their hands. I'm sure Google would be happy to add something like this if they thought a) anyone would use it and b) they could get more eyeball-time for their advertising. (And pity the poor student who is someday asked to analyze all the casual, random, anonymous and/or spamniferous verbiage for his or her course in Rezeptionsgeschichte.)

5) Use of standard formats and markup: Here one can only agree. But the devil is in the expensive details. Yes, everything should be discoverable and accessible by web protocols, services, etc. Once found, though, raw (aka dirty), unmarkedup OCR will be the rule for mass digitization -- and even less than that in difficult-to-OCR languages and scripts. Practically speaking, higher standards will have to be applied selectively to specific bodies of material being used for scholarly or paedigogical purposes.

Sorry to run on.

Stephen Paul Davis
Director, Libraries Digital Program
Columbia University; 7:04 PM