Digitization 101: Backing up your digital images?

Saturday, March 01, 2008

Backing up your digital images?

A woman at yesterday's workshop (Promotion & Use of Digital Projects) asked a question during break about digital preservation that I would like to toss out to you. Please leave comments to tell us how you have handled this situation. Thanks!

At her institution, they are creating digital assets which are being stored on the organization's servers. The IT department backs-up the servers nightly and those backups are stored in a secure location and in the correct environment. However, all backups are not kept indefinitely with backup tapes being regularly reused. In order to ensure that they have the files necessary for any migration efforts needed (or for restoring the production system), I suggested that a backup be made and stored off-site indefinitely (utilizing the services of an company that provides this type of service). In addition, I recommended that they create a new backup yearly and store that one also off-site indefinitely. My assumption is that the backup would contain both the high quality files as well as the access files and metadata (or other associated content). My other assumption is that tapes -- stored properly -- should last for extended lengths of time (10 - 30 years) and that having yearly backups would allow the organization to go back to specific snapshots in time.

My answer above is based on my years working in IT (a previous life), but it may not be what large projects are doing (or recommending) currently. Therefore, please let us know what have you put into practice? What are the holes in my logic?

Technorati tag:

Digital Preservation

8 comments:

Anonymous said...: Hi, Jill,

This is a complex issue, but the short answer is that data tapes need to be copied more frequently than 30 years.

The most common data tape today is the LTO tape and they have a published roadmap

http://www.lto.org/newsite/html/format_roadmap.html

The plan is that the new drives will be able to read two generations back and write one generation back.

Depending on the precise timing of generations, this limits the physical tape life to approximately ten years if you are being careful.

With each generation doubling the capacity of the previous generation, there are economic (space-based) considerations about migrating to the newer versions.

I'm not sure that with software such as D-Space that verifies the MD5 hashes of each file on a regular basis that you need long-term storage of backups as long as you protect against willful deletion. Archiving to tape, of course is a different thing.

I am currently maintaining about 3000 GB (3TB) of spinning disk storage. The protection plan is that one copy is live, off-site at the end of a fibre optic link.

The disk fail-safe plan is that if the storage module is a RAID array (RAID 5), I have two duplicate storage modules. If the storage module is a stand-alone (unprotected) single disk or disk array, then I have three duplicate storage modules.

With 1 TB storage modules with Gigabit Ethernet interfaces becoming very inexpensive, I'm thinking that this is a good way to go rather than a several-TB RAID NAS array.

As to my avoiding the word "mirror" it is because I do not propagate deletes among the various copies. Rather, when it comes time to delete something, all copies must be individually deleted -- and on the backup storage locations, the users do not have access to do that.

I'm an audio tape digitization provider, and I'm not attempting to be an archive, but I still need storage for both ongoing client projects and long-term storage for my projects.

There is no longer (and perhaps never really was) a put-it-on-the-shelf-and-forget-about-it audio/video media.; 12:54 PM
Ben W. Brumfield said...: I don't have any particular expertise on this issue, but it seems to me that you're conflating backups for migration with backups for catastrophic data recovery.

Were I in her shoes, I'd rely on the IT system for keeping the bits from disappearing, and rely on her own devices to keep the formats readable and migratable. Thus I'd do something akin to "branching" a software version whenever formats change: archive the old data not by moving it offline to a storage facility, but by making a full copy of it on the disks that remain online. The IT department continues to back the data up, but the old formats are always accessible for migration.; 2:03 PM
Jill Hurst-Wahl said...: Ben, I guess I didn't include one important piece of information. The woman does not believe that the IT department is retaining backups long enough. If a file problem is not found quickly, she fears that an older backup (with the correct files on it) will be unavailable. By retaining some backups long-term, that problem should be mitigated.

I agree that the ideal would be to have "copies" available for data restoration and "copies" that are used specifically for migration. Using online storage (like Richard does) would be wonderful and is what some organizations are able to do. Refreshing files are done by moving files from server to server (automatically). Files are instantly accessible, when needed.; 6:07 PM
Anonymous said...: Hi Jill, All,

I'm a UNIX hacker with a work life around IT and internet services. I'm also an amateur book digitizer, and have worked as a contractor for libraries on internet-based engineering projects in the past.

With that, storage and backups are something I understand in depth, and take very seriously.
This Sunday I took to writing this comment, and ended up writing all afternoon- so what follows ends up being a sort of guerilla-guide to digital archive backup. I'll re-post much of this on my blog- I hope the information is useful for someone!

--
The Social Angle:

Reading into this a bit, the woman's problem seems to be a lack of an internal economy or direct communications at her institution which provides her IT department the means to pay the proper attention to this problem. E.G., it just seems to me there's nobody paying the IT department to address this issue proactively with her.

Additionally, I'd like to make the point that the woman's problem seems like it could be complicated by a lack of defining simple requirements for backups- giving her IT department base requirements they can meet makes their job clear. Otherwise, the IT department is left with an untenable situation of providing a very abstract notion of 'backup', and expected to magically understand the context.

--
On Content:

Many institutions have different requirements for digital archives, I have never seen any real standard practice in place across organizations. That stated, here are some metrics to make backup policy requirements simpler to recognize (and communicate to an IT department or vendor):

- How much data
- Estimated rate of data growth
(growth obviously affects storage size planning)
- Rate of changes to existing data set
(changes to data set affect storage needs)
- General type of content
(e.g. plain text can be compressed with great savings,
jpegs are already compressed and do not save much space
when compressed)

- Data availability needs
(speed to recover from complete disaster)
- Geo-redundancy needs
(off-site backups are important)

- File state needs
*most important to actively edited archives*
+ minute/hour/day/week/month/year
Once the previous requirements are met, the features
for file recovery become a simple matter of budget in
storage media and personnel to manage the ensuing complexity

Even more simply, is to conceptually draw a line between 'catastrophie backup' and 'undo backup', as they are different problems entirely- and different current technologies and methodologies are available for each.

--
Against Tape Backup

I'm of the new school with regard to storage medium, and have happily abandoned tapes as a means for *any* kind of backup. I feel tapes to be an antiquated and unreliable technology, which only *feels* solid because humans can actually *see* proof of the raw tape's physical existence.

This is of course my opinion here, (from years of experience): Tape backups are a dead idea from a previously disk-constrained era, and were never an ideal backup media to begin with!

Put simply: harddrives are cheap, fast, and their capacity (scaling with Moore's law) has now surpassed many application storage needs.

Conceptually, it is important to remember archives are about long-term storage. With that, the technical details when deciding media are all about:

Conceptually, digital archiving technology should be chosen to mitigate the risks of:
- human error
- machine failure

With that, tape drives currently are the *worst* medium to reduce the risk of failure. Human beings must replace, carry, store, transport etc... the tapes. Coffee spills. Human hands can plant surprisingly destructive compounds on the tape. When was the last time anyone saw an IT person changing tapes wearing white gloves? (and if they did, would the fibers from the white gloves not pose a risk to the tape heads?)

Coffee spills. Vans filled with tapes destined for offsite storage facilities suddenly sit in sub-freezing, or baking outdoor temperatures while the drivers stop for a cup of coffee.

Worth noting: a server room is never to be confused with a clean room, and the best of thriving and active data-centers are not ideal environments for tape storage hardware.

The tape machines themselves are complicated mechanical devices. Computers are complicated as well, but they have few moving parts. Solid state components (computer chips and boards) fail *far* less frequently than any mechanical component.

Then there are the tapes themselves- even when storing digital information, magnetic media fades over time (measured in years for tape).
Optical media, (CD's/DVD's) seem to be the most archival digital media yet known to man, (600yrs estimated shelf life), but their capacity and the time required to read/write to them make them less than ideal for many operations- introducing a huge (and literal) surface area for human error in the archiving process.

Last, if you have been using digital Tape Backup media for several years, it's worth going out and checking to see if your particular Tape device is still being manufactured and available for purchase. If it is, has the model number changed or been upgraded? Is it tested compatible with your existing tape archive? Do any changes to the hardware require new software drivers for computers using them, can you even hook the drive up to a contemporary computer?
If you can't answer any of these questions with absolute certainty, you should now be officially loosing sleep over the issue... However, I've seen many IT people and decision makers simply bury their heads in the sand when they get to this point.

--
The State of the Art in Harddrives, 2007

Hard drive capacity continues to scale above many storage needs (size scaling with Moore's law)- 1000 Gigabyte single harddrives, nearly 1 Terrabyte, are available today. My hardware vendor in Midtown tells me that he'll likely have 1.5tb harddrives by late summer, and 2tb-ish single harddrives by the end of the year- 2007.

With that, I now raise my coffee cup towards an Apple 40SC hard disk, (40mb capacity), sitting on the shelf nearby- from 1987 (20 yrs old, obviously it no longer works, but makes a cool geeky book-end). This 40sc replaced the 20sc which came out only a year earlier, 1986.

It is replaced by drives physically half it's size, (case and drive), which are 25,600 times larger- and a similar order of magnitude faster, on the same shelf in my living room. The drives today are less than a third the price.

Consider this: The bandwidth of the internet is steadily increasing, (a 20MBps synchronous FIOS connection will be at my Brooklyn apartment in less than a year.) In Japan, a government-sponsored satellite internet service is being launched which will provide 1.2GBps connectivity to select locations by summer's end, 2007.

A PC laptop transfers data between 15-20MBps.

(however in the high-end server market, speeds of 200-800MBps is available for disk i/o, made much faster in practice by memory caching technologies...)

Solid state harddrives are coming en' masse- no moving parts, 'MRAM'. They are already rolling out in many new laptops, with capacities around 60gb- vendors sell laptop replacements if you *just gotta have one*. Not only are there no moving parts to break, but the speed increases are HUGE- it's all solid-state electrical, and storage i/o is no longer bound to the constraints of mechanical physics.
(This is why IBM sold their harddrive business to Hitachi several years back, they focused on R&D for MRAM).

So, consider this, at the current rate of speed/size increase in harddrives, based on a 20mb drive in 1986, and a 1000gb drive in 2007: by 2010, 7pb (Petabyte) single harddrives will be commercially available.

A quick note regarding SCSI:
Capacity is outpacing technology, I'm not a fan of high-end disks and disk controller hardware... In my experience, the hardware is eclipsed in scale and speed by 'cheap disks' (SATA and the like), and in most circumstances I've encountered, new 'cheap' disk systems can be completely replaced several times over- for the same cost of the 'high end' disks- amoratized over the production life of the hardware.
This is obviously a hot topic of debate in many environments, and a sore spot for many IT departments which have already justified considerable budgets for high-end disks.
Additionally, many IT departments who use the 'cheap disks', do not have the experience on hand to maintain enterprise-class storage systems with 'cheap disks', so the debate rages on...

The big point: things change, and with data, things change in fantastic ways.

Synchronizing large data archives between physical locations is becoming simpler, faster.

--
Long-Term Storage using an Online Systems strategy

With the above mentioned changes in technologies, (2007: Tapes Bad, Harddrives Good), an adaptive strategy is the best thing going for digital archiving. This is cheerfully the discourse brought on by the rise of Google, (the largest single data repository in the history of man), as well as by projects like the Internet Archive. (Sadly, decades-old research on digital archiving funded by the Library of Congress, has not proven as successful and wide-reaching with digital media as all this new-blood private enterprise...)

With that, I encourage any organization or individual to focus less on actual technologies for solving digital archive problems, but to focus on timeless methodologies- to enable fluid traversal of data across media over time.

Keep it fluid, plan and maintain data archives as though they'll be on a totally different computer system in the near future. (Different hardware, OS, Databases, everything).
This is the key to designing a digital archive that lasts 'forever', in a world where nothing else will.

With that, there are several 'gotchas' which never seem to disappear in filesystem and database designs over the years, (and previously similar problems existed before contemporary filesystems).

- File Name Lengths and Types
Not every file system treats file names the same way, and every decade or so new filesystem designs will revert back to somewhat primitive restrictions in order to meet some new technical advancement under the hood.
A *really cool* discussion can be found in this article about filesystem metadata fundamentals:
http://arstechnica.com/reviews/os/metadata.ars/4
And even though that article is very Apple-positive, years after it was written, Apple brings back the very problem yet again:
http://www.jms1.net/osx-case-sensitive-fs.shtml

- Database Index Types
Just like the filesystem name problem, common databases index systems revert back to primitive restrictions every decade or so. Any senior/veteran DBA can tell amazing and gruesome stories about the reality of this problem.

- Reliance on Filesystem or Database metadata
Keep it simple, and keep archive metadata in it's own archive-specific application or database!!!

--
On the Ground, 2007 Storage Solutions
(the cutting edge, from my vantage point)

Insomuch as I recommend above that one never be tied to any particular technology when considering digital archives, here are the tools I'm using these days (and excited about!):

Please note: I'm a UNIX hacker who spends a great deal of time using FreeBSD and OpenBSD, but based on the features I describe here, and the 30 year + history of BSD UNIX, you'll see why it's relevant to storage systems:

-
Cool Hardware:

Intel-based (cheap) storage servers:
16 SATA harddrives can fit into 3u of rackspace these days, (if you can get enough power for this kind of server density!!!) Chenbro/Tyan and Supermicro are great, reliable, commodity server manufacturers. General cost is like this, TODAY, (this changes weekly!):

Without Harddrives, $3500 bucks will get you a server with:
+ 16 SATA harddrive bays, hot-swap backplane
+ 2x Quad-Core Intel CPU's (8 logical cpu's)
(multiple CPU's are very important, contemporary operating systems leverage SMP for filesystem index traversal performance)
+ SATA Cards vary- depending on your setup needs, expect to spend approximately $1000
(Areca cards are the only RAID cards I know on the market which can handle over 2tb of contiguous logical disk [64 bit LBA's]!!!! I hate to recommend any vendor, but this is a VERY serious issue.) Recent

Add to that,
+ 16x 1000gb bare harddrives, $330/ea, $5280 total

$3500 server + 5280 in harddrives = 8780

So, ballpark this at around $9k, for 16,000gb (14.5tb) of fast, raw storage in a single server.

-
Now, getting on to using this much storage can be a challenge. 32 bit LBA (logical block addressing) is a very serious issue these days. This issue affects everything- RAID cards and hardware disk controllers, filesystems from various Operating Systems, as well as actual software which operates on the files themselves! A surprising amount of software in the world cannot deal with files larger than 2gb, and filesystems larger than 2tb- from web-servers to desktop applictions to network server softwares.

The UFS filesystem.
I'll put an exclamation point on my *BSD bias right here: There's a saying among system administrators who deal with lots of disks, 'Linux has 40 or so filesystem implementations which are not UFS, too bad none of them are stable'.
Riser, Ext2, etc... I'm sure there are people out there who have great experiences using these filesystems, but I've seen nothing but failures from these systems in my experience, from show-stopping implementation shortcomings, to design flaws which lead junior sysadmins to snap sizable organizations necks in management... And many of these filesystems simply are not stable- disk stability is critical to *any* computer over all else, not just because file content is important, but these files also are the instructions to run the computer!

The UFS2filesystem is the natural 64 bit LBA predecessor to the classic UFS/FFS filesystem, which is argueably the most influential filesystem design since the advent of the modern file system for UNIX in 1969.
The UFS2 filesystem was first implemented under FreeBSD, and put quite simply, is the benchmark for reliability and performance in filesystem design.

The problem with UFS2 and FFS, historically and more than ever, is fsck (file system consistency check). If the disk is interrupted during a write, (by loosing power or otherwise), the filesystem may be in an inconsistent (possibly corrupted) state. To fix corrupted data, the fsck utility has been important to UFS for 30 years.
The problem with fsck is speed- especially with big disks. Recovering from a power failure or other disk problems requiring a fsck can take hours. For some problems, fsck can run as a background process- keeping a system online, but slowing it down. However, for most problems, fsck needs to run at system boot- rendering the computer (and data) inaccessable for hours.
Recent fsck runs I've had to make, on servers with around 5tb of storage, have taken about 3 hours on average. Hence, the quaint sysadmin colloquialism of 'oh fsck' is at the tip of all our tongues as disks get so big- fsck is leaving the realm of sanity for large disks until a massive leap in disk i/o happens, (as it has in the past, giving UFS a long history and bright future).

Journaling came along to replace UFS- and people like me rejoyced.
I'm not sure which filesystem was first to implement journaling, but my first experience with it was when it hit MacOSX, over 5 years ago.
Journaling filesystems, as opposed to the design which requires fsck or similar, (softwupdates in UFS), keep a journal of recent file write operations, effectively writing files before they are written. In the event of a power outage (or other event which would corrupt data being written), the journal is 'replayed'. A small performance hit is taken as files are written, since they are effectively written twice- (though the journal is itself an optimized subsystem in implementations, so it's quite fast).

With that, however, all filesystems still suffer from silent data corruption problems, which can only be found by checking an actual file- (or blocks of a file) and if it is corrupted, replace the data from a backup.
This is partially where various Hardware and Software RAID becomes important, because data can be replicated at a block level- reducing the risk of silent corruption- but it still doesn't eliminate it.

-
Enter The ZFS filesystem:
(the coolest thing since sliced bread)
http://en.wikipedia.org/wiki/ZFS

As an aside, if you are in NYC next month, a colleague and I are presenting a lecture about the ZFS filesystem to a fairly technical UNIX crowd at the NYC BSD Users group:
http://www.nycbug.org/index.php?NAV=Home;SUBM=10153

Sun has created one of the most interesting filesystem implementations in years, with one of the most practical applications of advanced research in filesystem design since UFS.
I say that, because I've had the pleasure of speaking with Kirk McKusick (creator of the original UFS filesystem, Berkeley CSRG), and when I asked him what I thought he found most interesting in computing lately, he said: 'ZFS is truly exciting, of course!'.
(as an aside, that's a *real* hacker- someone who is actually tickled pink when a chunk of their life's work is threatened to become obsolete with a new technology!)

I've been using ZFS in near-production and development systems on the FreeBSD operating system (7.0-RC), Apple computer has also slowly implemented ZFS in MacOSX as a developmental feature. ZFS has been in the main trunk of Solaris since 2005, and in FreeBSD since the early 7.0 branch. It's young, but based on it's design, it represents one of the most mature filesystem implementations in the history of computing.

With that, here's a few of the design and features of ZFS which are TREMENDOUSLY appealing to anyone working with large data archives:

Massive Capacity:
- 128 bit filesystem LBA
"If a billion computers each filled a billion individual file systems per second, the time required to reach the limit of the overall system would be almost 1,000 times the estimated age of the universe." -Sun tech propoganda :)
- contiguous 16 EiB (Exabyte) max size
- Maximum size of a single file 16 EiB
- Number of snapshots of any file system2^64

Dynamic Striping:
- Infinitely grow disks from pools of cheap physical drives, in real time.
- a 'pool' of ZFS disk can be created by adding (and removing?!) disks at will- growing and shrinking what the operating sees as 'disk'. This is VERY flexible, and lends itself to upgrades in disk size with online and available systems.

Filesystem Snapshots (granular backup strategies?):
- 2^64 possible snapshots in any ZFS system
- snapshots can become r/w clones, only clone changes get written to disk

Copy-on-write transactional model (NO FSCK!):
- All block pointers within the filesystem contain a 256-bit checksum of the target block which is verified when the block is read. This prevents silent data corruption, at a block level, while the disks are in use!
- Periodic disk scrubbing can be performed manually to check checksums and verify actual data integrity, (not just filesystem integrity!).
- Replaces fsck (softupdates) and Journaling alltogether.

Fast:
Comparable speed to UFS2 in practical use, relies on host computer memory- (get lots of RAM, it's cheap these days!)

RAIDZ and RAIDZ2:
- Blazing fast RAIDZ, much like RAID5 in practice- (single parity)
- RAIDZ2 is double-parity
- zfs volumes can have additional parity set per-volume
(e.g. setting 5 parity bits writes blocks for that volume to 5 physical disks, if available)

With that, managing the actual disks for massive storage systems just got WAY CHEAPER AND EASIER.

-
For network storage access, I'm terribly fond of the geom gate facility of FreeBSD.
http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/geom-ggate.html

It's an open source answer to expensive, complicated hardware SAN/NAS solutions- and a great replacement for NFS for many applications.

Geom gate is simple to use:
One simply can map a raw disk device to an ethernet NIC, and therefore raw disks can be easily shared over a network. Gigabit ethernet is far more bandwidth than actual raw SATA disk i/o, and multi-port gigabit nics are inexpensive- (under $400 for really nice ones, last time I checked).

With that, Geom gate can be used to share disks for use with ZFS pool servers, or, Geom Gate can be used to export disk mounts FROM ZFS pools.

As you can see from these technical notes, as a big storage user (for work and personal uses), I'm terribly fond of FreeBSD.

--
Jill: your blog is fantastic, keep up the great work!; 6:40 PM
Jill Hurst-Wahl said...: .ike

Thanks for the compliment and THANKS for the long and detailed comment! I think you've given many people food for thought.; 9:40 AM
Anonymous said...: As I am the woman mentioned in the post, I just would like to make clear that my quandary has nothing to do with any lack of confidence in or respect for the IT department. On the contrary, IT here is nothing but helpful and cooperative. Digital preservation is simply new to everyone, and we haven't quite found our footing yet.

Reading the above comments, I am getting the sense that the best method of preservation is using hard drives rather than tapes. Should I understand that the preservation drives should be kept offsite in archival storage conditions, and routinely refreshed, migrated, etc, like we would with optical media?

I am envisioning our live server holding our resources in current formats, and a set of hard drives collected from previous years that store previous file formats. That way, in case we notice data loss or some other issue with our live files, we have the older versions to which to return and try conversion again. Am I understanding everyone properly?; 9:25 AM
Mr. Mitja Decman said...: Hi all,

it looks like this debate went into a technical and hardware direction and I appreciate all the posts.
From reading the original post I wanted to say just a simple basic thing. A backup is not an archive or a digital preservation. So in the mentioned case I would say that one thing is server backups that IT department has to make to guarantee that in the case of a disaster everything is back in working order as soon as possible. The other thing is long term preservation of data. Meaning that a separate archiving server, storage, repository has to be used for that. It can be tapes or hard drives. It can be in one location or two, bringing all the problems of synchronization or mirroring if you will. So you need people for that, changing their tasks or employ someone for that. Digital preservation is something new and therefore an addition to the working environment in all possible ways.; 8:12 AM
Mr. Mitja Decman said...: Hi all,
it looks like this debate went in a technical and hardware direction and I appreciate all the posts.
From reading the original post I wanted to say a simple basic thing. A backup is not an archive or digital preservation. So in the mentioned case I would say that one thing is server backups that IT department has to make to guarantee that in the case of a disaster everything is back in working order as soon as possible. The other thing is long term preservation of data. Meaning that a separate archiving server, storage, repository has to be used for that. It can be take or hard drives. It can be in one location or two, bringing all the problems of synchronization or mirroring if you will. So you need people for that, change their tasks or employ someone for that. Digital preservation is something new and therefore an addition to the working environment in all possible ways.; 8:16 AM