Techniques for backing up huge data sets

Summary: Use a backup system based on "intelligent differentials", where each backup results in a new "full backup", but is done at the speed of a differential. There are three known solutions: the Microsoft Block Level Backup Engine in Server 2008, and BackupAssist’s File Replication Engine, and Rsync. All three methods can be scheduled from within BackupAssist.

It’s becoming increasingly difficult to back up huge sets of data. Traditional backup methods simply do not work adequately. There are two main problems that old school backup methods suffer from:

Data capacity of the backup device
Time taken to do the backup - also known as the "backup window"

Let’s look at how to overcome these two issues.

Firstly - the capacity of the backup device is an issue because the growth in capacity of many backup devices is simply not keeping up with the growth in data storage requirements. This is particularly true for tape drives. And to further compound the problem, many businesses experiencing data bloat are businesses that produce data that cannot be further compressed, such as medical images, digital photographs, music, and video. Because the data is not compressible, tape drive manufacturers’ claimed data capacities (based on 2:1 compression) are simply inaccurate and irrelevant.

The best solution currently is to switch to disk based backup - such as eSata/USB connected hard drives or NAS devices. Disk sizes are growing, with the capacity of individual 3.5" hard drives increasing from about 250 GB to 1.5TB in around 4 years. And in keeping with Moore’s law, technology is likely to double in capacity/speed every 18 months.

Secondly - the time taken to do the backup is increasing. For instance, in our recent tests, the fastest USB HDDs currently run at 30 MB/sec, or 108GB/hr. So for a business that has 1TB of data, a full backup takes approximately 10 hours - and that’s assuming a sustained transfer rate, which is a rather optimistic assumption.

The easy way of "fixing" this situation is to perform incremental or differential backups every day instead of full backups. However, in the past with "old school" backups, differential backups meant that you needed a full backup to be done, and then only differences between that full backup and the current state are backed up. This was flawed on two levels - firstly the restore became a multi-step process - restore the full, then the differential (or in the case of incremental, restore each incremental). The second problem is that the restored system is not identical to the original to the original because items deleted prior to the last differential are restored as well. This can leave large amounts of unwanted files on the file system.

However, the new intelligent way of performing differential backups totally avoids these problems. What the new, "intelligent differential" technologies do is take an existing full backup as the base version, and then merge in the changes so that it becomes the new full backup, and the differences that were replaced in the merge are retained as past versions of the backup. This means that each backup is a full backup, and can be restored in one operation. It also means that you can store large amounts of backup history on each device, as only the differences between successive backups are stored.

This means that "intelligent differential" backups give you the advantages of a full backup, without the drawbacks of conventional differentials!

This new method of backing up is extremely suitable for backing up large data sets. For example, if you have a working set of 1 TB and another 20 MB is added to for the day, then only 20 MB needs to be transferred to the backup device - making it extremely fast.

There are three different backup technologies that use this new "intelligent differential" method:

At a drive image level: Microsoft’s Block Level Backup Engine in Server 2008
At a file level: BackupAssist’s File Replication Engine, in BackupAssist v5
Over the Internet: Rsync, which features in-file delta transfer technology

All these three technologies are actually available in BackupAssist.

So in order to back up a large data set, these are the steps to follow:

Choose a backup device that suits your needs. If you’re after a solution that involves the user swapping media onsite/offsite (like a traditional tape backup system) then look at eSata or USB connected drives. At the time of writing, individual drives can be as big as 1.5 TB and this will undoubtedly increase as time goes by. If you’re after a fully automated backup system, then use a NAS device, which can go up to several terabytes in size.
Perform your backups using one of the "intelligent" differential backups in BackupAssist. Choose either the Drive Imaging engine for Server 2008, the File Replication Engine or Rsync.
Perform your first backup - this will be slow, as all the data needs to be copied to the device for the very first time. If you’re using multiple devices - like USB HDDs, then the first backup to each device will take considerable time.
Perform daily backups - automatically,