Wednesday, March 23, 2011

CIL2011: Value through Longevity: File Format Migration Using Open Source Tools

Donna Scheeder, the track moderator said that one area that Lee Rainie (morning's keynote speaker) missed was that we can add value by preserving information.

Lisa Gregory and Jennifer Ricker - State Library of North Carolina

Tasked with preserving state digital publications forever.

Strategies
  • Emulation
  • Migration - transferring the files to a stable format.
Small library, with a small staff & budget, and almost no IT support.  Have started doing some migration testing.  Focusing on using open source tools (free and well supported).

Currently using ArchiveIT for web harvesting, CONTENTdm and OCLC's digital archive.

Approaching to Migration Testing
  • What file formats do they have?  ~20 different file formats, including older file formats. Mostly text files.  Digitized and born digital files.  Included some corrupted files (they corrupted them).
  • Tools - Ffmpeg, Inkspace, PLANETS testbed, XENA, ArcMap/TerraGo.  Not all transformations were successful.  Tools where free, open source, documented, supported, audit trail/reporting, easy to use, and versatile. 
  • Expectations - no visual/auditory loss of content, no loss of metadata, minimal degradation in quality, etc.
Findings: (more details/tables in their presentation)
  • ffmpeg - not so successful with one file format
  • Inkspace - some font changes, but acceptable
  • PLANETS Testbed - many document file type.  Most worked beautifully.  Word 95 didn't work.  Converting to PDF/A did not work from some specific software packages. 
  • Xena - Many of the same tests that they did with the PLANETS testbed.  Similar results.
  • ArcMap and TerraGo are both proprietary software tools.  Worked.
Where are we now?
  • File format observations - Challenges that they expected and found
    • Complex and related files - No open source tool that could migrate these and keep the file relationships
    • Had trouble with files that had layers (e.g., Adobe Illustrator)
    • Proprietary formats that are not widely used (e.g., Microsoft Publisher)
  • Surprises
    • Audio-video formats have their own complexities
    • The files are huge
    • Frame rates, compression and codec, oh my! My want to find someone who already knows this stuff, rather than coming up to speed yourself.
    • PDF/A (argh!) -
      • 1A -1B restrictions plus  lower level of performance.  Better accessibility.
      • 1B - self contained, no external references, lower level of compliance, digitized materials, metadata required.  Could be 1B compliance with Adobe Acrobat, but not with open source tools.
  • Tools to have
    • FFmpeg
    • FITS
    • FLAC Frontend
    • Ghostscript
    • Inkscape
    • MPEG streamclip
    • PLANETS Testbed (RIP?)
    • XENA
  • More helpful knowledge
    • Free and open source has downsides
      • "Free in upfront costs
      • Might be developed by a single person or by hundreds
      • Learning curve can be steep
    • Documentation can be confusing or nonexistent
    • Can you rock the command line?
    • Build in time for stops along the road
      • Tool installation
      • Troubleshooting
      • General Googling for assistance
    • There are still unknowns
      • QA- what should we use / rely on?
      • How can we facilitate batch processing?
      • On the fly or scheduled bulk migration?
      • QA - how much should we do?
      • ARC to WARC?
    • Overcoming challenges to production implementation
      • Usual culprits - staff time, resources, IT restrictions, programming skills
      • Testing Archivematica by Artefactual Systems - OAIS complaint
      • Formal workflow descriptions - striving to be OAIS compliant
      • Tackling at-riskier files
        • Older files
        • Older formats
        • Obsolete formats
        • Databases
        • More work on A/V formats
They don't see emulation as being sustainable.

No comments: