Wednesday, November 30, 2005

How is Google's digitization quality?

When Google announced its project nearly a year ago, I was anxious to hear how they were going to digitize the materials. I soon realized that confidentiality agreements and the air of secrecy was going to keep me (and you) from learning from this project. I know that we'll learn more about copyright because of Google's work, but would it be wonderful to learn more about how they are going about this effort? Even just some tidbits?

We can learn a bit from looking at the books that Google has digitized. And what we learn is that their quality isn't all good. If you search through the materials, you'll find items were the images are very crisp and clear, and others that are blurry and (perhaps) sloppily done.

For example, if you flip through this book (from 1908 and in the public domain), you'll see a fingernail, book clamps, obscured pages, pages missing (p. 61), and pages that are crocked. And nearly every page is hard to read. Is this an anomaly? No. Look at this book (from 1916 and in the public domain) and you'll see brown pages (p. 22). What's up with that?!

Without signing in, you can only see a few pages of the newer books. Even without signing in, one quickly senses that the pages are clearer and much easier to read. (Look at this example from 2004.) Is Google doing something different with these so that they are scanned better?

BTW Google will display only snippets of a book where it has not received permission to digitize and display more pages from the book. Here's an example of that. Useless, right?!

Of course, Google would say that they want you to find the books online and not read the books online. To read the full-text, it is hoped that you'll purchase a copy of the book. Fine. But can I purchase a copy of a book published in 1908? Likely I would have to get a copy from my library through interlibrary loan (ILL), if it is available. Even if I have to get a book through ILL, Google has done its job because it has made me aware of a book that I might not have known about otherwise.

So can we overlook the errors and problems because Google is helping us find books? Part of me says "yes", but then I remember that we don't want be digitizing old books more than once. We want to do it correctly the first time. If these books have to be digitized again to improve the quality of the images, then time and money has been wasted. In addition, the books will have to be handled once more, which I hope is not once to many.

Google need to do better. The company is leading us down an important path. It need to do so the right way.

Finally, I found that if you page through a public domain book too quickly, Google senses that and feels that you may be a robot or virus, and thus stops you. You must then type in a code to continue. (This also occurs if you look at a book more than once.)

Technorati Tag:


Anonymous said...

You might want to take a look at the public domain books scanned by the Internet Archive in conjunction with the Open Content Alliance. A preliminary set of 14 books is available at (best viewed with Firefox). You can view them online and also download PDFs to your computer to read later.

Anonymous said...

Thank you for posting this, Jill! I think it would be cool if such a review could become an ongoing project (perhaps for a group of LIS students) ...