Showing posts with label Google. Show all posts
Showing posts with label Google. Show all posts

Thursday, June 26, 2008

Blog post: U.S. copyright renewal records available for download

This is wonderful news from Google for people research U.S. copyrights:
How do you find out whether a book was renewed? You have to check the U.S. Copyright Office records. Records from 1978 onward are online (see http://www.copyright.gov/records) but not downloadable in bulk. The Copyright Office hasn't digitized their earlier records, but Carnegie Mellon scanned them as part of their Universal Library Project, and the tireless folks at Project Gutenberg and the Distributed Proofreaders painstakingly corrected the OCR.

Thanks to the efforts of Google software engineer Jarkko Hietaniemi, we've gathered the records from both sources, massaged them a bit for easier parsing, and combined them into a single XML file available for download here.
Based on comments made by Siva rote:
This is great news for historians, journalists, researchers, publishers, and librarians. It's also great for the Open Content Alliance and other book digitization projects.

Of course, this does not help much with books published and copyrighted outside of the United States. But that's always a complication.

However, I wonder if Google itself is going to use these records to change the format of many of the scanned books published between 1923 and 1963. Currently, these are only available in "snippet" form. Will Google Book Search change significantly now that this file is available?

That last paragraph is very interesting. Could this allow Google and others to display more, now that they can easily check these records? I hope the answer is "yes."


Technorati tags: ,

Wednesday, June 25, 2008

Thomson Reuters to sell Dialog to ProQuest

Most of you will not be interested in this news that broke on June 12 -- Thomson Reuters has agreed to sell Dialog to ProQuest. I wrote a thoughtful blog post on it last week in the Special Libraries Association blog. At the end of that post, I wondered if Google would ever buy Dialog for its content repositories. Roger Summit, the father of Dialog, posted a comment and wondered what would have happened if Dialog had been purchased by Google now instead of ProQuest. ProQuest and Google are going after two different markets. Google is about making information available to as many people as possible. In its letter to subscribers dated June 12,2008, the executive vice president of Thomson Reuters said:
Together, Dialog and ProQuest will be able to provide the authoritative content and precise search tools essential for the information professional market in the 21st Century.
Later he wrote:
ProQuest intends to invest aggressively in Dialog, refreshing the Dialog and DataStar platforms and meeting the needs of next-generation users.
Here's the problem...the information professionals of today want tools that give them flexible search options, that integrate with others tools (including federated search), and that can be used by a broad group of people with a variety of search skills. Search is not just for professionals anymore; it is someting everyone does. For Dialog and DataStar to be successful into the future, they need to retool themselves not for the information professional market, but for the broader market that includes white collar workers, knowledge workers, academics and students...if not even ordinary people. Dialog has the world's largest digital newspaper archive -- wouldn't you want to be able to search that easily, too? (The letter does not qualify their newspaper archive as being the largest of current materials, but I think that qualifier should be placed on it, since there are massive archives of historic newspapers.)


Technorati tags: , , ,

Sunday, May 25, 2008

Likely you've heard the news -- Microsoft ends its book digitization project

It is likely that you have heard this news already, especially if you have searches focused on digitization. Microsoft will no longer be digitizing books and creating a book search engine. (Full announcement below.) As one writer put it, it is conceding to Google.

In its blog post on the topic, Microsoft said (emphasis added):
We have learned a tremendous amount from our experience and believe this decision, while a hard one, can serve as a catalyst for more sustainable strategies. To that end, we intend to provide publishers with digital copies of their scanned books. We are also removing our contractual restrictions placed on the digitized library content and making the scanning equipment available to our digitization partners and libraries to continue digitization programs. We hope that our investments will help increase the discoverability of all the valuable content that resides in the world of books and scholarly publications.
I'm glad to see that the materials that have been digitized will live on. I am sorry, though, to see Microsoft leave this market. Having several companies (or initiatives) involved in book digitization on a large scale -- and thinking abut access, etc. -- is beneficial. While one initiative can succeed without competition, having more than one pushes them all to be innovative and market-focused. With one less large scale book digitization program, is there enough competition to focus Google to be innovative in how they meet end-user needs as well as the needs of its partners?

By the way, I find it curious that Microsoft made the announcement on a Friday before a long holiday weekend in the U.S. I'm sure that was on purpose (with the assumption that most people wouldn't see the news right away).


The Washington Post carried the full-text of Microsoft's email to its partners announcing its decision. Quoting the email:

Dear Live Search Books Publisher Program Partner,

We are writing today to inform you that we are ending the Live Search Books Publisher Program, including our digitization initiative, and closing the Live Search Books site. We recognize that this is disappointing news to you and to the users of the Live Search Books service. Ending the Live Search Books program is the result of a strategic decision on our part to focus our investments in new vertical search areas where we believe we can more effectively differentiate Live Search.

Given the evolution of the web and our strategy, we believe the next generation of search is about the development of an underlying, sustainable business model for search engines, consumers, and content partners. For example, this past Wednesday, we announced our strategy to focus on verticals with high commercial intent, such as travel, and offer users cash back on their purchases from our advertisers.

With Live Search Books and Live Search Academic, we digitized 750,000 books and indexed 80 million journal articles. Based on our experience, we foresee that the best way for a search engine to make book content available will be by crawling content repositories created by book publishers and libraries. With our investments, the technology to create these repositories is now available at lower costs for those with the commercial interest or public mandate to digitize book content. We will continue to track the evolution of the industry and evaluate future opportunities.

As we wind down Live Search Books we will be reaching out to you in partnership with Ingram Digital Group with information on new marketing and sales opportunities designed to help you derive ongoing benefits from your participation in the Live Search Books Publisher Program. As part of this initiative, we will be making the scan files we created from your print book submissions available to you for free. We will follow-up next week with more information on these offers.

The Live Search Books Publisher Program site (http://publisher.live.com) will be taken down immediately. The Live Search Books site (http://books.live.com) will be taken down next week.

We sincerely appreciate your support and regret any inconvenience that this decision has caused. You can read more about this announcement on The Live Search blog (http://blogs.msdn.com/livesearch).

Sincerely,

The Live Search Books Team

books@microsoft.com



Technorati tags: , ,

Tuesday, December 04, 2007

Blog by the University Librarian at the University of Michigan

In case you haven't found it, Paul Courant, University Librarian at the University of Michigan, has begun a blog called "Au Courant." Yes, he's talking about Google and answering questions that others have posed about the contract. He provides interesting thoughts on the subject and hopefully is helping to allay fears that working with Google is bad.


Technorati tag: ,

Wednesday, November 21, 2007

Yale, Microsoft & Kirtas...and a short rant

Yale University has signed an agreement with Microsoft for the company to digitize 100,000 out-of-copyright books over the next year. University Librarian Alice Prochaska said in an interview that the books Microsoft "scans will be available only on Microsoft’s search engine, the University will receive digital files of all the books that are put online, and the entire digital collection will be linked through the Yale Library Web site and Orbis catalog listings."

The article states that there will be (or is) a non-disclosure agreement, so the financial details will be unknown, however, generally Microsoft and Google subsidize the cost of the digitization either in its entirety or in part.

And who is actually doing the digitization? Kirtas, the creator/manufacturer of a high-speed automated book scanner. Kirtas has an "in-house service bureau that employs more than 75 image technicians and operates three shifts- has mastered a proprietary digitization process that guarantees an overall error rate lower than one per 10,000 pages, ensuring quality mass digitization that will meet the highest standards and endure the test of time."

[rant] I continue to find these (Google, Microsoft, OCA) projects to be fascinating to watch for a variety of reasons. However, I also find it sad to think of the non-book content that should be digitized that is not. There are many cultural heritage organizations that need to begin to digitize, but that can't find funding to get them started. Yes, they should collaborate, but do they have what other collaborators would want? They have content, but not money and maybe not manpower. I also know of libraries in the U.S. that have not yet automated their catalogues. I know that digitization is different than retrospective conversion, but...well...I guess respective conversions aren't sexy at this point. Okay...I'll get off my soapbox. [/rant]


Technorati tag: , ,

Monday, October 22, 2007

Thank goodness for Google's cache

I wanted to point someone to the 1999 article that Steve Puglia wrote for RLG DigiNews, but the URL no longer works and the article doesn't seem to be on the OCLC web site (remember OCLC and RLG have combined). However, I was able to pull up a cached version of the article in Google. We can argue that the cache version should not exist because it violates copyright...but at the moment, I'm just glad it exists because it provided access to something that I needed.

The lesson -- we can't avoid Google.


Addendum (1:45 p.m.): An anonymous reader pointed me to this RLG DigiNews archive on the OCLC web site and the Puglia article is here. Unfortunately, the old URL doesn't point to the new URL, or even to the archive in general. As a point of trivia...While the old URL rank #1 in Google, the new URL (or pointers to it) rank below #60.


Technorati tag:

Tuesday, October 16, 2007

Article: Google Book Search Libraries and Their Digital Copies

In the April 2007 issue of Searcher magazine, Jill Grogg and Beth Ashmore wrote an excellent article entitled "Google Book Search Libraries and Their Digital Copies." Of course, given how quickly Google has brought on new partners, parts of the article were outdated before it was published, yet it provides wonderful details on the partners that existed when the article was written.

Grogg and Ashmore pointed out that the libraries involved with Google (as of early 2007) had all been involved in digitization programs before Google. Univ. of Michigan (UM), for example, had been digitizing materials since the late 1980s. Prior to Google, for example:
  • UM had digitized "141 text collections with 25 million page images online, plus 3 million pages of encoded text and 89 image collections containing approximately 200,000 images."
  • UM and Cornell had collaborated on the Making of America project that had provided "access to hundreds of volumes of American primary sources from 1850 to 1876."
  • New York Public Library had created the Digital Gallery with more than 520,00 images from its four research libraries.
  • Univ. of Wisconsin-Madison had "made available...close to 2 million pages of content with full range of subjects..."
  • University of California Libraries and the California Digital Library has provided "access to over 170,00 digital images and 50,000 pages of documents about California."
But what are the libraries doing with their digital copies received from Google? The authors wrote that "some library administrators are still weighing option about how to use their library digital copies." The sheer number of library digital copies requires thinking and planning...and perhaps partnering...in order to ensure that access is provided in a way that works now and for the long-term. It could be that organizations such as OCLC will help provide access to these digital copies. The article noted that OCLC was planning a pilot program to link to digitized book titles from WorldCat. It is safe to say, that the digitization work will go on for years and that it may take years to figure out how all of these texts will be made available to people not only at the original institutions but elsewhere in the world.

"Google Book Search Libraries and Their Digital Copies" is a long and well-written article. If you are interested in this project, and its issues, I would encourage you to read the full-text. There is definitely more in the article than I can quote/discuss here.


Technorati tags: ,

Monday, October 01, 2007

Trends mentioned at talks at Pratt Institute, Sept. 29

Last Thursday, I guest lectured at two introduction to library science classes at the Pratt Institute campus in Manhattan. Both classes have Susan DiMattia as their instructor, who is a past president of the Special Libraries Association and a friend.

Susan said that many of the things I'm involved with would be of interest, so the agenda I set for my one hour lecture was:
  • Career
  • Focus
  • Trends
I started with a quick outline of my career -- from my start working in a library in fifth grade to starting my own business. Then I talked about the focus of my business and how that focus has changed since 1998, when Hurst Associates was founded. I talked about what competitive intelligence is (which was the original focus of my business). I talked -- of course -- about digitization and the work I did in my corporate life with setting up "scanning" facilities and what I do know. But we all got sidetracked when I spoke about social networking tools, especially Second Life. Libraries and real librarians in a three-dimensional virtual world?! It took many of the students by surprise! I showed both classes the video about the Ohio University campus in Second Life and that helped them understand it a bit more.

I had thought we'd spend a lot of time talking about trends, but I think actually we did spend a lot of time talking about trends -- the trend of librarians reaching out, using new tools, and finding ways of expanding their services. Under the "trends" section of my lecture, I was able to list eight trends for each class, which we were not able to discuss in-depth because we out of time. These were trends that came to me quickly as I wrote my outline:
  • Library Trends:
    • Library users are pulling information towards them through RSS, search engines and other means. They do not need to ask librarians to help them find information.
    • However, library users need librarians to help them find the best sources and to teach them how to understand the information that they are receiving. For example, is the source reputable? Was the information gathered in a way that makes the information accurate? (Mediation and information literacy)
  • Digitization Trends:
    • Mass digitization is what is capturing our attention (e.g., Google, Microsoft, OCA). There are many mass digitization programs around the world and there will be more of them. The number of resources they can call upon will help to ensure their success. The good news, BTW, is that these programs have made many more people aware of what digitization is and how it can benefit all of us.
    • As we create all of these digital surrogates, we need to be aware of digital preservation. Thankfully there are groups that are learning about digital preservation, and who are creating and implementing standards/guidelines for the rest of us.
    • Institutions are looking beyond their libraries and creating systems that house digitized materials, published/unpublished materials by their employees (e.g., professors), presentations, office documents and more. The number of institutional repositories continues to grow as institutions see them as a way of managing their knowledge.
  • Social Networking Trends:
    • We are becoming hyper-linked as we connect to colleagues through a wide variety of social networking tools. Being linked like this helps us stay "in the know", share information, and find partners/collaborations/opportunities.
    • We can use these tools to collaborate across time and space. Who we can work with (and how effective that working relationship can be) has dramatically increased with these tools. We also know that collaborative efforts can be more successful.
    • Our users will help us create information (crowdsourcing). Many projects (e.g., PictureAustralia) have found ways to successful use information from users/volunteers to bolster their work. Blogs, wikis and other tools are examples of crowdsourcing.
Are these "the top" trends? Maybe not, but they are what came to mind as a prepared and I'm sure they got the students thinking. It is likely that if I did the same lecture this week that I would select a different set of trends to discuss.

There were many good questions and lots of notes were taken. The graduate students seemed to appreciated the time I spent with them, and I definitely enjoyed giving them a peek into my world. [Susan, thanks for the opportunity!]

Addendum (9:45 a.m.): I should mention that I pointed the students to this article that I wrote for Information Outlook on Second Life. If they didn't copy down the URL correctly, this post will point them to the article. (Hurst-Wahl, Jill. "Librarians and Second Life." Information Outlook, June 2007, v. 11, n. 6, pp. 44 - 53. Used with permission from SLA.)


Technorati tags: , , , , , ,

Wednesday, September 26, 2007

Article: How we funneled searchers from Google to our collections by catering to Web crawlers

In 2006, Marshall Breeding wrote an article entitled "How we funneled searchers from Google to our collections by catering to Web crawlers." As we know, not all software/databases can easily be crawled by Internet search engines. Some databases require extra ($) components, while you must kluge a solution for others. In Marshall's article, he talks about the solution they implemented for the Vanderbilt Television News Archive.

Question -- If your repository is not automatically being crawled by the Internet search engines, what solutions have you put in place to expose your content so that it is crawled? Please let us know. This is a topic that projects are talking about...and an area where we could all benefit from what others have done.


Technorati tags: ,

Friday, September 21, 2007

Siva Vaidhyanathan on Google (podcasts)

First Monday has three podcasts where Siva Vaidhyanathan talks about Google and Google Book Search -- problems, concerns, etc. -- in ways that everyone can understand.
  1. Siva's lecture “The Googlization of Everything: Digitization and the Future of Books.” MP3 (1:14)
  2. Siva discusses how the Google Book Project threatens copyright. MP3 | Transcript (12 minutes)
  3. Siva describes his experience on “The Daily Show with Jon Stewart,” talks about what makes Google a success and more. MP3 (15 min.)
If you have only 15 minutes, listen to the second podcast. You will quickly understand the problems he sees with Google Book Search. What really stood out to me was the impact this project could have on the concept of Fair Use in the United States. Could this project result in a stricter interpretation of Fair Use? Could Fair Use be based on the use not on the copy?

Related posts:

Updated 9/22/2007.

Technorati tags: , ,

Thursday, August 30, 2007

Interview with Brewster Kahle

Published on Aug. 15, this interview with Brewster Kahle contains some great quotes -- all classic Brewster Kahle. For example:

Are you surprised to see libraries signing up with Google under restrictive terms?

I'm not surprised that a corporation wants to be the only place someone can get information, and I was not terribly surprised that some libraries went forward with this before they understood how they could do it on their own and how much it would cost to do it for themselves, not only to do the digitization but also to create services around these collections. I was surprised to see more libraries jumping on the Google bandwagon after demonstrating how libraries can do this and after actually doing it with the Open Content Alliance.

And in talking about how the Open Content Alliance can compete with Google, Kahle said:
Revolutions aren't started by majorities.
He does provide some cost information on having materials digitized by the OCA:
At an OCA regional scanning center, we'll scan your materials for 10¢ a page. Audio recordings we can do for about $10 a disc, and videos about $15 per hour. And we'll do all of the hosting for free; you can do the interfaces.
Definitely an article worth reading.


Technorati tags: , ,

Wednesday, August 15, 2007

Google Book Search Tips: A University of Michigan University Library Handout

©ollectanea wrote a nice blog post about this five-page handout noting that her eyes glazed over after a while. (And if that happens to a librarian, what will happen when a user reads it?) Looking at the positive aspects of the handout, Georgia Harper wrote:

...the document is really helpful as it shows in detail what features the book search provides, how to use it to best advantage, and if you're at UMich, how to double-check your results against Michigan's catalog, Mirlyn. I want to say right now that I think this is a really good thing. I've heard so many people say things that indicate that there's a lot of misunderstanding about what Google Book Search does and how it works. So clearly, this is needed and kudos to UMich for doing it...

Although this handout was created specifically for UMich, it would be useful to others who are using Google Book Search, which seems to need a lot of explanation for something that seems so simple.


Also posted in the SLA IT Division blog.

Technorati tag:

Tuesday, August 14, 2007

Article: Inheritance and loss? A brief survey of Google Books

Abstract:
The Google Books Project has drawn a great deal of attention, offering the prospect of the library of the future and rendering many other library and digitizing projects apparently superfluous. To grasp the value of Google’s endeavor, we need among other things, to assess its quality. On such a vast and undocumented project, the task is challenging. In this essay, I attempt an initial assessment in two steps. First, I argue that most quality assurance on the Web is provided either through innovation or through “inheritance.” In the later case, Web sites rely heavily on institutional authority and quality assurance techniques that antedate the Web, assuming that they will carry across unproblematically into the digital world. I suggest that quality assurance in the Google’s Book Search and Google Books Library Project primarily comes through inheritance, drawing on the reputation of the libraries, and before them publishers involved. Then I chose one book to sample the Google’s Project, Lawrence Sterne’s Tristram Shandy. This book proved a difficult challenge for Project Gutenberg, but more surprisingly, it evidently challenged Google’s approach, suggesting that quality is not automatically inherited. In conclusion, I suggest that a strain of romanticism may limit Google’s ability to deal with that very awkward object, the book.
The findings outlined in the entire article are interesting and some are not a surprise. As he wraps things up, the author (Paul Duguid) states what we wish wasn't true:
The Google Books Project is no doubt an important, in many ways invaluable, project. It is also, on the brief evidence given here, a highly problematic one.
There are now 27 libraries that are part of this project. It would be interesting to hear from them either how they are working with Google to improve the quality of what Google is doing or why they feel this quality is acceptable. Perhaps they are looking past these problems and seeing something grander than what we see.


Technorati tag:

Monday, August 13, 2007

Article: Cornell University Library becomes newest partner in Google Book Search Library Project

In this region, Cornell remains at the forefront in regards to digitization. Last week, Cornell issued a press release with major digitization news. It is joining the Google Book Search project. The press release said:

“In its quest to be the world’s land-grant university, Cornell strives to serve the scholarly and research needs of those beyond the campus. This project advances Cornell’s ability to provide global access to our library resources and to build human capacity across the globe,” said Cornell President David J. Skorton.

Google will digitize up to 500,000 works from Cornell University Library and make them available online using Google Book Search. As a result, materials from the library’s exceptional collections will be easily accessible to students, scholars and people worldwide, supporting the library’s long-standing commitment to make its collections broadly available.

Google will digitize both public domain and copyrighted materials at Cornell. Those materials will be selected in order to complement the other work that Google is doing.

500,000 works is a small part of Cornell's collections which is "close to 8 million volumes in print and more than 60,000 journals, 300,000 e-books and 39,000 e-journals." Even so, Janet A. McCue, director of Mann Library, said “Having Google index our collections is like having a massive concordance to the information in our books.”

How many libraries is Google now working with? 27.

In 2006, Cornell announced that it would participate in Microsoft's Windows Live Book Search. Does this now mean that Cornell is working with Google and Microsoft at the same time? Undoubtedly Cornell is working on other digitization programs too. It would be interesting to hear how all of this work is (or is not) impacting them. Perhaps due to its size, the impact is minimal.


Technorati tags: ,

Friday, July 27, 2007

Digitizing college textbooks for disabled students

Although ebooks have not truly caught on, the ability to create an ebook can be the first step in making that book more assessable to disabled students. In "The Next Textbook? Finding—or Creating—Alternative Instructional Materials for College Students," Robert Martinengo wrote:
The most common use of e-text is the conversion of text to speech. This can be synchronized with the pages of a book, using a program such as Kurzweil 3000. Or the audio can be saved as wav or mp3 files, and played back on inexpensive consumer equipment. The quality of synthetic speech has improved dramatically in the last few years, and the proliferation of MP3 players makes this an attractive option for many students. New software such as eClipseReader and eClipseWriter lets users create their own navigable audio books, with links to page numbers and chapter headings.

Another use of electronic text is conversion to Braille...There are also methods to convert graphics in to raised line drawings.

Another quality of electronic text is its ability to be searched, highlighted, annotated, excerpted, etc. Vendors of assistive technology promote these features as powerful tools to aid students with disabilities with their studies.
As you can see, having the book in digital format allows other technologies to serve it to someone with a sight disability.

A recent article at the Daisy Consortium notes that Google is making its books more assessable.
The very special hidden link that is available from the full view now allows people who use access technology with their computers to read the text. Prior to this change, it was not possible, the views were images, not text. At the National Federation of the Blind's Annual Conference held on July 5, Dr. T.V. Raman, who is himself blind and who works for Google, said, "Consider this to be step zero of many steps that will benefit blind and print-disabled persons throughout the world." Indeed this is a significant step; having hundreds of thousands, perhaps millions of books available to a population that thirsts for information, but which is blocked from using traditional mechanisms for reading, is without precedent and of extreme importance.
The author notes, however, that more needs to be done to make Google Books even more assessable to those with are blind or have a visual impairment.

Digitization is all about access. We tend to think of access to fragile materials or access to materials that are elsewhere in the world. We shouldn't forget that digitization allows people with disabilities to access materials that may have been in their community all along.


Technorati tags: ,

Thursday, July 26, 2007

Sivacracy.net

When a blog moves to a new URL, it can be difficult to know where it went to. Sivacracy.net moved this month from its home at NYU to its new home at Institute for the Future of the Book (IF:Book). Its chief blogger, Siva Vaidhyanathan is now on the faculty of the University of Virginia faculty and has been appointed the first fellow of the IF:book. (Hence the change in URL.) IF:Book noted that:

Siva is one of just a handful of writers to have leveled a consistent and coherent critique of Google's expansionist policies, arguing not from the usual kneejerk copyright conservatism that has dominated the debate but from a broader cultural and historical perspective: what does it mean for one company to control so much of the world's knowledge?

mmm...criticizing Google?!

At any rate, Sivacracy (which is team blog) often touches on topics that may be of interest to those curious about some of the broader issues that can impact digitization. If you're unfamiliar with the blog, feel free to take a peek at it.


Technorati tags: ,

Wednesday, June 20, 2007

Follow-up to "The Google Project continues to grow"

A couple days ago I wrote about the latest partners to the Google book digitization project. Yesterday, Roy Tennant also blogged about this. He noted:
To this point, the only Google partner library to aggressively mount the digitized books in its own repository has been the University of Michigan. Therefore, it surprises no one that the University of Michigan, which had already developed their MBooks platform for its own digitized books, will serve as the central repository for the CIC project.
That was a connection in the story that I had not seen, so thanks Roy for pointing it out.

Later he wrote:
This project raises the bar for the other libraries participating in mass digitization projects. Most of the libraries cooperating with Google are making no effort to mount the resulting files themselves. Some may not even be keeping a copy of the files. I think it is disturbing that we don't even know how true that statement might be.
It is disturbing that these libraries are relying so heavily on Google to digitize the materials and make them available. Libraries have gotten burned by companies/vendors in the past that made bold promised then didn't keep them. I'm not saying that Google won't be around forever, but is their future really guaranteed? And will the digitized materials be maintained always as these libraries hope it will be? I hope someone at every Google partner institution has considered those questions.


Technorati tags: ,

Monday, June 18, 2007

The Google Digitization Project continues to grow

In some ways, the larger this project gets, the less news-worthy it becomes. Yes, more institutions have joined the project. Yes, this is wonderful. No, there are no new details about how they are doing it and nothing trickling out of this project that will help smaller projects with technology, metadata, processing, etc. (When that occurs, that will be news.)

Here,though, are a few interesting quotes from the Penn State press release abut the agreement between Google and the 12-institution consortium called the Committee on Institutional Cooperation (CIC):
"We haven't identified the specific works to be included yet," said Nancy Eaton, dean of the Penn State University Libraries. "However, the aggregation of large collections is more important than any specific title, as it is the 'critical mass' of large collections that will make Google the place for users to go to search first."
And:
As a part of the agreement, the consortium also will create a first-of-its-kind shared digital repository to collectively archive and manage the full content of public domain works digitized by Google that are held across the CIC libraries.
I would think that as Google enters into more agreements that selecting books to digitize could become more of a headache. They must consider what they have already digitized, what is already in the pipeline, what they have promised to digitized for their existing partners, etc. New partners must bring to the table, I suspect, something unique that can easily be identified upfront.

BTW there must be a massive database within Google that tracks all of this stuff. Wouldn't that be interesting to look at?

The second quote notes that the CIC will create a shared digital repository of public domain worked digitized by Google and that these 12 libraries already hold. Notice that the wording does not say that these books will necessarily be digitized from these 12 institution, but that they "hold" them. So it could be -- if I read this correctly -- that they will build this repository using books already digitized by Google elsewhere that are in the public domain and that already exist in their collections. That could be quite nice and very valuable to students and faculty.


Technorati tags: ,

Tuesday, May 15, 2007

Competing with Google & others

In their final assignment, my students had to decide if a mythical library should digitize part of its holdings or not. What was important to me were the premises they based their decision on and their thought processes. Had they learned enough to argue for or against?

Several decided that they would not digitize any of the old first edition books by famous authors because they felt there was a good chance that Google (or Microsoft or OCA or...) might digitize those books. Why spend money doing what you might be able to use through another source? I found that to be a very interesting argument and, I'm sure, there are real libraries with the same idea.

Assuming that the books are in the public domain, and held by institutions that are cooperating with one of the mass digitization programs, then they should be digitized and made available in full-text for anyone to use. So I searched Google Book Search for "Huckleberry Finn," knowing that an early edition should be in the public domain. What did I find? I found several later editions of The Adventures of Huckleberry Finn that are available with a limited preview. I then looked for only books tagged as "full view books" and found ONE edition that is totally available. The edition was published by Plain Label Books (and who they are is a good yet unanswered question). Yes, this is the content of the book, so the content has been preserved, but you get the sense that the layout is not from an early edition, if fact it looks sterile like it has been retyped.

mmm...so using this limited test, I don't see a first edition copy of Huckleberry Finn search option to search library catalogues, I can easily find those libraries that have early editions of the book, although it would take some time to figure out who had the earliest edition according to this search (which uses in Google. Now the question becomes -- as a casual researcher -- do I care that a first edition is not available? No. Someone who is interested in the book -- the artifact -- itself might want to see the typeface, etc., online but likely would go to the institution for a better view. Using Google'sWorldCat). And I would think an early edition would be at Elmira College, but none pop out as being located there. (I'm guessing an early edition would be there, because of Mark Twain's association with that area.)

If a library wanted the content and did not care about the edition, then relying on Google (and others) may be possible. However, if that specific edition that the library has is important (perhaps because of notes in the margins), then the library should digitize it. If the library wants to make its materials known without digitizing them, then getting them catalogued in WorldCat would be quite helpful.

Deciding whether or not to digitize books is not a simple "yes" or "no." You need to think about "why" you want to digitize the books and consider what others are doing (so you don't perhaps duplicate effort). I would not only search to see what has been digitized, but I would also search those partner libraries to see what they own. And if possible, I might contact them to ask specifically if they were going to digitize the books that I had in mind.

The decisions are never as simple as we would hope...


Technorati tags: ,

Wednesday, April 11, 2007

OCRopus

Google is helping to develop OCRopus. The Google press release about OCRopus is here. The web site describes it as:

...a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities.

The web site goes onto say:

The OCRopus engine is based on two research projects: a high-performance handwriting recognizer developed in the mid-90's and deployed by the US Census bureau, and novel high-performance layout analysis methods.

OCRopus is development is sponsored by Google and is initially intended for high-throughput, high-volume document conversion efforts. We expect that it will also be an excellent OCR system for many other applications.

An alpha release of the product is scheduled for the third quarter of this year, so it looks like our benefiting from this may be a "ways off." However, it is good to see a major company working on this open source product.


Technorati tags: ,