Sunday, July 09, 2006

Backup Strategies

Since my recent conversion to running Linux (Ubuntu, to be specific) as my desktop OS on my notebook full time, I've been playing with various options for backing up data.

Since I have separate Linux desktops both at home and work, rsync is a natural first choice. It comes as a standard part of the OS and lets you easily mirror a copy of your data. One of the best features is that when you run it to mirror your latest changes to the backup copy, it doesn't have to actually copy your entire dataset - only the differences. This makes the process much faster, and also makes it more likely that you'll make backups frequently.

All this requires only a single command (assuming that you have previously set up SSH to use public key authentication):

rsync -avz -e ssh ~/e/work eric@myworkbox:~/backups/laptop/e/work

In my case, I have a directory under my home directory called e which contains all the data I am concerned with backing up. Under e is another level in the hiearachy, a directory called work, which contains all of the work-related data I wish to backup (source code, documents, etc.). This makes it very easy to make a backup mirror at work that contains only work-related stuff, and none of my personal data. When I get home, I can run a similar command on the e directory which will back up both my work and personal data to my other personal Linux box.

OK, now we're cookin'. But one shortcoming of this solution is that the only thing you can restore is whatever is in the last backup copy you've made. For example, let's say I have a source file foo.cpp that I'm working on. I haven't checked it in to the revision control system because it's still a work in progress. Yesterday I had a bad day and I deleted the file, but I didn't realize it until today. Unfortunately, I ran my backup script at the end of the previous day, and it dutifully deleted my backup copy of the file to keep the mirror in sync. Bummer.

I stumbled across this nifty little script which adds a level of versioning to your backups. It's just a bit more complicated than the one-liner above, but it's worth it. It still keeps an exact copy of your data on the target, but it also keeps a version of your files which changed for each day of the week. So in my scenario above, when I ran my backups the previous day, it would have removed the file I deleted from the full mirror copy, but also saved a copy of the file in the special daily backup subdirectory.

This solution is an improvement, but still not perfect. For one thing, the assumption is that you are only going to run your backups once a day. If you run it on a Tuesday, and then run it a second time the same day, it will see the "Tuesday" directory and assume it was from last Tuesday, delete the contents, and archive off only the differences between the first and second backup from the same day. Also, seven days is the farthest back in history you can ever go back.

So the solution I'm currently using is a utility called rdiff-backup, which I read about in Sys Admin magazine. This solution required a bit more legwork, as I had to satisfy some dependencies it had (Python, librsync and zlib), but it was worth it. It still uses rsync at its core (well, techincally, the rsync libraries, not the standalone utility), so you aren't copying all the data each time. But the additional, wonderful benefit that it offers is that it makes "diff's" of all your files, including addition and removal of those files. So now, I have a small script that runs the following commands:

cd /home/eric/e
rdiff-backup -v5 work workbox.ericasberry.net::backups/laptop.work
rdiff-backup -v5 --remove-older-than 60D workbox.ericasberry.net::backups/laptop.work


The first rdiff-backup is what actually backs up the data. The second command tells rdiff-backup to remove from its repository any file revisions older than 60 days. It's a design limitation of rdiff-backup that requires the removal of older versions to be run as a seperate step. Fortunately, it executes very quickly.

I have to admit that while the backup process is super-simple once you create the script, the restore process can be a little cumbersome. This is especially true if your not exactly sure of the time and date of the specific revision you want to retrieve. It's also completely a command line proposition, so if working at the shell level intimidates you, it's probably not the solution for you. I think at some point I'm going to try cobbling together some kind of GUI front-end to make this process a little easier, unless I find someone else has already beaten me to it.

So, overall, I'm pretty happy with this solution. The only thing that worries me is the case of a disaster - say a fire or robbery where I end up losing all my computers. Replacing the hardware would be enough of a headache, but my backup efforts (at least for my personal, non-work related data) would be in vain, because I wouldn't have any copies of those backups in the disaster scenario. Obviously, the ultimate, secure solution would involve off-site backups.

There are a few possible alternatives here here. First, I could just mirror my personal data in addition to my work data to my machine at work. There's plenty of disk space available, but I just don't feel comfortable doing that. Probably a bunch of smack talk in the employee handbook of using work resources for personal use anyway. ;) So, scratch that.

Another solution would be to archive my data to a CDR or DVDR and keep that disc in a fireproof safe or a safe-deposit box. Well, couple of problems here. First of all, I have so much data (when you include things like photos, digital music, etc) that it wouldn't all fit on removable media. So I'd have to backup only a small subset - my most critical data. Second, it would simply be too much of a hassle. I know I'd probably do it once in awhile, but suspect I would not be disciplined enough to do it weekly, much less daily.

The final solution is to use one of the many online backup solutions that are available. I've spent some time researching this but a few things have made me hesitate.

First, most of them require that you use some kind of proprietary backup software which they provide, nearly all of which run only on Windows (though I've found a few that will also run under OSX). I want something that will work in Linux, preferably with rdiff-backup. I could probably work around this by setting up Samba shares, etc, but I'd really like to keep the process all contained on my server box.

Second, in general you either have to trust that nobody is going to snoop on your data on the third party's server, or you have to encrypt the data yourself before you make the backup copies. (Maybe I'm just paranoid). I could set up something with gpg to automatically encrypt the data, but I believe this would wreak havoc with rsync/librsync, and would most likely add a lot of time to the process.

Finally, and probably most importantly, most of these services are just way too expensive to use with large amounts of data. (Not that they are cheap even with relatively small amounts of data!) I believe, however, that I've found a pretty good solution that I'm going to begin experimenting with over the next few days: Amazon's S3 in conjunction with a backup utility called JungleDisk.

This looks like a great solution because Amazon's pricing for data storage/transfer is just pennies per gigabyte. JungleDisk is available for Windows, Mac and most importantly, Linux. By installing another package called DAVfs, you can actually treat your remote backups as part of your regular filesystem. (This feature apparently comes for free if you're running under Windows or OSX). Not only that, all of your data is automatically encrypted. Nice!

The only drawback I've found so far is that JungleDisk can't be run as a daemon. There is a GUI interface that you must launch to start it up. That's not really much of a problem, assuming the software is stable, as I can just start up an X session with VNC Server, connect with the VNC client, start up the GUI and disconnect the VNC client and leave JungleDisk running merrily away in that headless Xvnc session. I can always reconnect to that Xvnc instance later if I need to restart or tweak JungleDisk.

Well, I hope this post helps someone out there looking to devise their own backup strategies. If nothing else maybe it will provide some ideas to to further explore.

Read Full Post

2 Comments:

Anonymous Anonymous said...

Thanks eDog! I'm now using rsync to backup the wiki at work. I was already doing something very similar where I was zipping the src content and copying it into daily folders. I like the incremental backup of rsync better.

You can do the daily-version backup in 1 line:
rsync -avz -e ssh user@server:/var/www/twiki ~/backups/twiki/`date +%A`

3:45 PM  
Blogger eric said...

Your one liner is certainly a valid approach. But I would like to point out that the script I linked to does offer some advantages over the one-liner, IMHO:

The one liner has to copy each file to the destination directory every time, whereas the script only has to copy the differences.

The one liner involves some additional work, either another script or manual steps, to clean up older versions of the backups. The script automatically keeps one week's worth of history.

The script allows you to identify what files were removed/modified/etc. on a particular date, just by peeking in the day-of-the-week subdirectory.

It just depends on what you need, I guess. For me, I prefer the script over the one liner, but I prefer rdiff-backup over both of them.

9:46 AM  

Post a Comment

Subscribe to Post Comments [Atom]

<< Home