Friday, July 10, 2009

Newspaper digitization

I'm going through notes from various conferences I attended this spring and have come across notes from a session at the Society of Ohio Archivists Annual Conference where members of the Ohio Historical Society talked about newspaper digitization. They began this past winter on a two-year newspaper digitization program under the auspices of "Chronicling America". Here are my notes:
  • Newspapers have not had a standard format over the years, which makes them more difficult to digitize.
  • Chronicling America is using a standard set of practices that were outlined by the National Digital Newspaper Program (NDNP).
  • Ohio Historical Society is selecting one newspaper from each of its 10 regions.
  • Difficulties have included copyright on the microfilm as well as some technology concerns.
  • They are doing three levels of quality control.
  • Scanning at 300-400 dpi, grayscale. They are creating TIFF file (master), then derivative files (PDF and JPEG200 files) as well as OCR'd text.
  • Metadata is being embedded into the files themselves so that the metadata can travel with the files. (As much metadata is embedded as possible.)
  • They are using descriptive, structural, administrative, technical and preservation metadata.
  • Rather than plain OCR, they are doing optical word recognition (OWR) which tries to predict what the word is not just what the characters are.
If this topic interests you, the Documents section of the project wiki contains links to both presentations the team did at the SOA Annual Conference.

Technorati tag: ,

No comments: