Monday, July 13, 2009

Problems and resolution with the Paper of Record (Google)

Back in Dec. 2008, I noted that Google has purchased the Paper of Record. At that time, the Paper of Record had 20 million digitized historical newspaper pages. This came a few months after Google announced a newspaper digitization project with ProQuest and Heritage Microfilm (post, post). The April 2009 Google Book Search newsletter said:
Try a search for "Americans walk on moon" on Google News Archive Search, and you'll be able to find and read an original article from a 1969 edition of the Pittsburgh Post-Gazette. Not only will you be able to search these newspapers, you'll also be able to browse through them exactly as they were printed -- photographs, headlines, articles, advertisements and all.
While this alerted (or re-alerted) people to the fact that Google was adding newspaper content, at least one email discussion list began talking about this in January and the effect the acquisition was having on research. At some point, the web site was redirected to Although that seems minor, researchers from around the world noted that content once available through the Paper of Record was missing from the Google site.

In February, a Google employee said in email (as part of the discussion):
We're currently working on the most effective way to search and browse this valuable content. We're doing our best to find a solution to include as much of the acquired content as possible.

While a lot of this content has been made available through Archive search, we're still refining processes to include incompatible newspaper images in our index. We're also working with certain publishers to acquire the rights to display their content. All of this takes time, and we appreciate your patience. We're constantly making improvements to ensure the best user experience.
Researchers wondered by Google had not left the old site available while it when through this transition. Google's blindness to Paper of Record users made matters worse. Several things happened between February and June when things seemed to get resolved (article). In his article on the topic, Robert B. Townsend said:
Regrettably, this proves yet again Roy Rosenzweig’s warning to the profession six years ago about the “the fragility of evidence in the digital era.” While it may be beyond our capacity to adjust copyright laws and the behavior of large corporations (however well meaning), as a profession we can and perhaps should develop new habits for working with digital materials—by copying down information when we see it online, and not becoming overly dependent on any one data source or having illusions about its permanence.
In early June, a Google employee provided this information on the content from the Paper of Record:
  • 4.91M articles representing 522 titles obtained from Paper of Record are now live on Google News Archive search. This includes previously live content as well as content added as of this week from Paper of Record, all free of charge. Please note that all articles from these titles may not be comprehensively available, but will otherwise be made available in browse-only mode within 3 months. The full list is here [2].
  • ~0.5M pages representing 381 titles obtained from Paper of Record will be made available in browse-only mode within 3 months, also free of charge. The full title list is here [3]. Many of the images we obtained were of low quality, and we were therefore unable to get quality text after following the OCR process. We are working to put up content from these titles so that they can be browsed.
  • Finally, for these 10 titles here [4], we don't have the rights to display these newspapers. We've reached out to the publishers who hold rights to these papers, but not all want to participate in Google's programs. To access these, you may need to travel to a library if you can't find an online source, or contact the publisher directly.
So, nine months after announcing the acquisition of the Paper of Record (and actually three years after it had secretly acquired the database), Google finally was able to provide information that users needed. In between, Google frustrated researchers who wrote blog posts, articles, and letters of protest. Google's inability to be customer focused left a bad taste in many people's mouths.

I heard today that there is one remaining question - Will the Paper of Record (or which seems to have access to the same content) make institutional access available to historical and genealogical societies. Evidently societies have inquired about this, but have not received a response. I believe (and please correct me if I'm wrong) that part of the issue is that the Google search interface is not robust enough.

Finally, while some people saw the acquisition as moving Google one step closer to world domination, what it really showed was:
  1. Google can be sneaky in its dealings.
  2. Google doesn't have the users' best interests in mind.
  3. We cannot have an illusion over the permanence of any content.
Sadly, every day we all become more reliant on Google. Google, however, is not some government agency that receives public oversight. Google is a large for-profit company. If it becomes the center of all of our universes (whether we like it or not), it will make a profit.

BTW on the gossipy side of things, this blog, Gawker, carries news tidbits about Google that some might find interesting (e.g., which executives are leaving the company like Doug Bowman).

Thanks to Rod Nelson for alerting me to the Paper of Record story. Rod, sorry that it took me so long to dig into it.

Technorati tags: ,


Eric Rumsey said...

Jill, Thanks for the good article - typo in last sentence ... "living the company"

Jill Hurst-Wahl said...

Eric, thanks!

Alison said...

Not all of the former PoR content is there. It is not in any of the lists. Went to World Vital Whatever and paid my money to get my data and it is almost unusable there - can't search as you searched PoR - if you browse, it gives you machine-meaningful names for the page numbers, out of order ... you get to guess which one is page 1 if that's what you want ....


Finally, we've just completed another go around with Google and will return in it's normal form at World Vital Records. Institutional subscribers will access POR in the format most academics and students have grown accustomed to. A press release will follow in the next few days.


Bob Huggins

Anonymous said...

This was a month ago. Where is the press release?

Jill Hurst-Wahl said...

Anonymous, your guess is as good as mine. This is disappointing.

Alison again said...

They've resurrected the main page for Paper of Record, with links to World Vital Records and Google News Archive. But nothing is any different. Got so excited to see the PoR main page again, only to be let down again. I'm still paying for WVR, but it's almost unusable. Don't quite understand why the OCR seems so much worse there than it was on Paper of Record. And I miss the Boolean search ability. And the ability to find Page 1 of each release (which is not really what I want but it would be at least slightly useful).