- Newspapers have not had a standard format over the years, which makes them more difficult to digitize.
- Chronicling America is using a standard set of practices that were outlined by the National Digital Newspaper Program (NDNP).
- Ohio Historical Society is selecting one newspaper from each of its 10 regions.
- Difficulties have included copyright on the microfilm as well as some technology concerns.
- They are doing three levels of quality control.
- Scanning at 300-400 dpi, grayscale. They are creating TIFF file (master), then derivative files (PDF and JPEG200 files) as well as OCR'd text.
- Metadata is being embedded into the files themselves so that the metadata can travel with the files. (As much metadata is embedded as possible.)
- They are using descriptive, structural, administrative, technical and preservation metadata.
- Rather than plain OCR, they are doing optical word recognition (OWR) which tries to predict what the word is not just what the characters are.
Technorati tag: Digitization, Newspaper