Backup and disaster recovery in the age of virtualisation

We do backups because we know we have to – in case we lose the primary versions of data and/or the systems that create and manage that data.

It could just be that the original gets accidentally deleted or changed; however, the possibility of system failure will be a top priority for many. That could be anything from a disk crash on a user’s device to a datacentre crushed by a meteorite.

When such a failure happens, it is not just data that needs restoring, but the full working environment; in other words, disaster recovery.

Backup and disaster recovery are not directly interchangeable terms; but disaster recovery is not possible without backup in the first place. Disaster recovery is having the tested wherewithal to get systems restored and running as quickly as possible, including the associated data.

The increasing use of virtualisation has changed the way disaster recovery is carried out because, in a virtual world, a system can be recovered by duplicating images of virtual machines (VM) and recreating them elsewhere.

VM replication, disaster recovery and the way the market has adapted to virtualisation are critical topics to consider.

In the old days, if a server crashed then you would probably go through the following steps:

  • Get a new server. Hopefully you would have a spare to hand – probably an out-of-date model, if it had not been needed for some time;
  • Then, either: Install all the systems and applications software, attempting to get all the settings as they were before, unless of course you had done that in advance – which would not have been possible if you had only invested in one or two redundant servers on standby for many more live ones, not knowing which would fail;
  • Or, for a really critical application, you may have had a “hot” standby, all fired-up and ready to go. However, that would have doubled the costs of application ownership, with all the hardware and software costs paid twice;
  • Restore the most recent data backup, for a database that might be almost up to date, but for a file server, an overnight backup may be all that is available, so only as far back as the end of the last working day. Anything that was in memory at the time of the failure is likely to have been lost. How far back you aim to go is defined in a backup plan as the recovery point objective (RPO).

Virtualisation changes everything and increases the number of options. First, data can be easily backed-up as part of an image of a given virtual machine (VM), including application software, local data, settings and memory. Second, there is no need for a physical server rebuild; the VM can be recreated in any other compatible virtual environment. This may be spare in-house capacity or acquired from a third-party cloud service provider. This means most of the costs of redundant systems disappear.

Disaster recovery is cheaper, quicker, easier and more complete in a virtual world. In the idiom of backup, faster recovery time objectives (RTOs) are easier to achieve. At least, that is the theory, but it can get more complicated with the need to co-ordinate different VMs that rely on each other – for example an application VM and a database VM – so testing recovery is still paramount and can forestall problems in live systems.

There are a number of different approaches, from tightly integrated hypervisor-level VM replication through to disaster recovery as a service (DRaaS).

The leading virtualisation platform suppliers – including VMware, Microsoft Hyper-V and Citrix Xen – offer varying levels of VM replication services embedded in their products. They are tightly integrated into the hypervisor itself and so limited to a given virtual environment. However, this does give them the potential to achieve the performance needed for continuous data protection (CDP) using shadow VMs as virtual hot standbys, minimising both RPOs and RTOs.

There are other products that tightly integrate VM replication at the hypervisor level, for example EMC’s RecoverPoint, which supports the co-ordinated replication and recovery of multiple VMs, so it can ensure a VM running an application is consistent with an associated database VM. Currently this is only for VMware but Hyper-V and cloud management stacks such as OpenStack are on the horizon.

Another is Zerto, which says it has built in better automation and orchestration than the virtualisation platform suppliers, further minimising the impact on the run-time environment. Zerto currently supports just VMware but has plans to extend support for Hyper-V and Amazon Web Services (AWS) which means, in the future, it will support failover from an in-house VMware system to, say, AWS or another non-VMware-based system. Its product could also be used for pre-planned migration of workloads.

Many other virtual-aware tools work by taking snapshots of VMs at given intervals. This involves pausing the VM for long enough to copy its data, settings and memory before returning it to its previous state. The snapshot can be used to recreate the VM over and again. The RPO depends on how often snapshots are taken (which could be often enough to be close to CDP, but that would affect overall performance). The RTO depends on little more than how quickly access can be gained to an alternative virtual resource which, with the right preparation, should be almost immediately.

A number of new suppliers specialise in virtual environment backup. Swiss-based Veeam launched its product in 2008 and supports VMware and Microsoft Hyper-V. Nakivo (founded 2012) only supports VMware. As these products have been built for a virtual world, they have many of the required adaptations built-in from the start, for example creating VM snapshotting and network acceleration to make off-site replication more efficient.

The traditional backup suppliers have adapted their products. For example, Symantec has just released Backup Exec 2014, which it believes matches the capability and performance of the new arrivals. Dell claims that its AppAssure mimics CDP by using a “smart agent” that avoids freezing the VM and takes a snapshot at least once every five minutes. CommVault’s Simplana and Arcserve have also had the challenge of catching up.

One difference with many of the traditional suppliers is their capability to support both older physical environments alongside virtual ones, which remains the situation in many organisations. It also means their products are often used for migration, that is, for backing up a physical server and restoring it as a VM.

Many cloud infrastructure service providers, for example Rackspace and Amazon provide VM replication, enabling customers to put their own failover in place, but generally this is limited to their own platforms.

Disaster recovery as a service (DRaaS) providers

The widespread use of virtualisation and availability of cloud platforms for recovering workloads has led to a proliferation of DRaaS offerings. Here the replication of VMs is embedded in the service, so the customer has little to do other than due diligence and to sign on the dotted line.

Some are offered by cloud/hosting service providers; for example NTT Communications has a European offering in partnership with US-based DRaaS provider Geminare. Broader disaster recovery specialists such as SunGard and IBM include DRaaS in their portfolios.

DRaaS providers provide unique value to make it worth their customers’ while. Some take this to a new level, for example UK-based Plan B Disaster Recovery says its Microsoft Windows Server DRaaS offering can guarantee recovery, because it includes nightly testing of the recoverability of the images it takes of its customers’ server environments. This not only ensures recoverability but often pre-empts problems the customer has yet to notice. Plan B operates at the application level so is hypervisor-neutral, supporting VMware, Hyper-V and Xen. Plan B’s service can image physical servers as well as virtual ones.

Quorum offers a service called onQ that was originally developed for the US Navy to enable the rapid movement of processing from one part of a ship to another in times of battle damage, so it is very fast and very resilient, supporting physical or virtual Linux and Windows servers. OnQ is also hypervisor-agnostic. In the UK it uses a local datacentre partner to recover the customer server images as VMs, which it claims allows RTOs as quick as a server reboot.

Interestingly, Plan B says that, whenever its service has been invoked to recover a physical server in a virtual environment, the customer does not go back. In other words, disaster recovery services can be used to migrate to virtual environments, but can also provide the motivation to do so in the first place. And that may have got you thinking – if cloud is good enough as a secondary backup for even our most critical applications, could it not actually also become our primary platform in the longer term?