Disaster Recovery Platforms

In this section, different cloud-based DR systems will be introduced briefly. Also benefits and weaknesses of each system will be discussed.


SecondSite

The SecondSite is a disaster tolerance as a service system cloud. This platform is intended to cope three challenges: 1. Reducing RPO 2. Failure detection 3. Service restoration. For this reason, it uses three techniques:

  • Using a storage to keep writes between 2 checkpoints: Checkpoints move between sites in a specific period. However, if a failure happens in this time, some data will be lost. For this reason, a Distributed Replicated Block Device (DRBD) is used to store replications in both synchronous and asynchronous modes.
  • Using a quorum node to detect and distinguish a real failure: A quorum node has been designed to monitor primary and backup server. If replications have not been received by the backup site in the waiting time, backup site sends a message to quorum node. In this case, if the quorum node receives a heartbeat form primary node, it means primary server is active and the replication link has a problem; otherwise the backup site will be active.
  • Using a backup site: There is a geographically separated backup site which allows to replicate groups of virtual machines through wide-area Internet links. SecondSite increases ability to fast failure detection and also differentiate between network failures and host failures. Using DRDB, resynchronize storage can be done for recovering primary site without VMs interruption in the backup site.

Although, SecondSite is not suitable for stateless services, however it increases availability for small and medium businesses.


Remus

Remus - based on Xen hypervisor - is a high availability cloud service to tolerate disaster using storage replication combined with live VMs migration. In this system, a protected software is encapsulated in the virtual machines to asynchronously replicate whole-system checkpoints in a backup site with a high frequency. It is assumed that both replicas are in the same local area network (LAN). Remus pursues three main goals: 1. Providing low-level service to gain generality 2. Transparency 3. Seamless failure recovery.

Remus uses an active primary host and a passive backup host to replicate checkpoints. All writes have to be stored in backup RAM until a checkpoint completes. Migrated Virtual machines execute on the backup only if a failure is detected. Remus consists of 4 stages:

  • Stop running VMs and propagate only changed states into a buffer
  • Transmission of buffered states into backup RAM
  • Send an ACK message to primary host after checkpoint completion
  • Release the network buffer to external clients.

This system integrates a simple failure detection into the checkpoint process. If checkpoints are not received by the backup site in an epoch, backup site will be active, on the other hand, if backup response is not received during a specific period, then primary site will suppose a failure at the backup host. However, Remus increases performance overhead which leads to some latency, because it requires to ensure consistent replication. In addition, this system needs a significant bandwidth.


Romulus

Romulus has been designed as a disaster tolerant system based on the KVM hypervisor. This platform is an extension of Remus system. Romulus provides an accurate algorithm for disaster tolerant in seven stages in details, which are:

  • Disk replication and network protection
  • VM checkpoint
  • Checkpoint synchronization
  • Additional disk replication and network protection
  • VM replication
  • Replication synchronization
  • Failure detection and failover.

The flaw of Remus is that it uses one buffer to replicate writes between primary host and backup. Happening a failure in this buffer before transferring checkpoint causes an inconsistency between disk and VM state; and it can break fault tolerance of Remus. For this reason, Romulus uses a new buffer to replicate disk writes after any checkpoint. Second flaw is that network egress traffic cannot be released until completely transferring checkpoint to storage backup host which can decrease system performance. However, Romulus uses a new egress traffic buffer to solve this problem. Romulus can tolerate failure in two situations:

On the fly: it consists disk and VM state replication into a new writes buffer during VM running.

Failover: the ability of service recovery after a disaster.


DT Enabled Cloud Architecture

It is an extended architecture based on Romulus seven stage algorithm. It uses a hierarchical tree architecture based on the Eucalyptus IaaS architecture. It provides a disaster tolerant service with respect to resource allocation issue which is a challenge in DT services. Host and backup clusters are monitored by high availability controllers. Each cluster has three different controllers:

  • Storage controller: To control and manage the cluster storage.
  • Cluster controller: To manage IPs, centralized memory and CPU availability.
  • Node controller: To load, start and stop the VMs.

Different nodes and also different clusters can communicate with each other for better resource allocation. For this purpose, backup cluster controller allocates a VM to a node. Then, node controller loads and starts the VM and allocates it to the primary host. Finally, primary node controller loads and starts the VM.

In this system, VM failover consists of two scenarios. The first scenario is cluster failure. In this situation, backup cluster will be activated. Node failure is another scenario in which cluster controller releases VMs' IP and allocates a backup node to compose required VMs. This system is most useful for extended distance and metropolitan clusters because of low latency requirements.


Kemari

Kemari is a cluster system which tries to keep VMs transparently running in the event of hardware failures. Kemari uses primary-backup approach so that any storage or network event that changes the state of the primary VM must be synchronized in backup VM. This system has gained the benefits of the Lock stepping and the Checkpointing - two main approaches for synchronizing VM state- which are:

  • Less complexity compared to lock stepping approach.
  • It does not need any external buffering mechanisms which can affect on output latency.


RUBiS

RUBiS is a cloud architecture aims to both DR and also minimizing costs with respect to Service Level Agreement. As shown in Figure 5, in ordinary operation, a primary data center including some servers and a database accomplish normal traffics. A cloud is in charge of disaster recovery with two types of resources: Replication mode resources for getting backup before a disaster which is active; and failover mode resources that will be activated only after a disaster. It is notable that service providers can rent the inactive resources to other customers for revenue maximization. In the case of a disaster, leased resources must be released and allocated to the failover procedure.

Figure 5. Overviews of RUBiS system architecture


Taiji

Taiji is a Hypervisor-Based Fault Tolerant (HBFT) prototype which uses a mechanism similar to Remus. However, instead of Remus which uses separated local disk for replication, Taiji uses a Network Attach Storage (NAS). Shared storage may become a single point and cause a weakness of this method, so RAID (Patterson et al., 1988) or commercial NAS (Synology, online) solution should be deployed. On the other hand, because of using shared storage, the need of synchronizing is decreased and also file system state is maintained in the event of disaster.


HS-DRT System

The goal of the HS-DRT system is protecting important data from natural or subversive disasters. This system uses a HS-DRT processor -which we described as SDDB (section 6, part 5) - with a cloud computing system. Clients are as terminals which request some web applications. The HS-DRT processor has functioned as a web application and also encryption, spatial scrambling, fragmentation of data. At the end, data is sent and stored in a private or public cloud. The system architecture is shown in Figure 6. This system severely increases security of data before and after disaster in cloud environments. However, It has two weaknesses:

  • The performance of the web application will be decreased if the number of duplicated copies increases.
  • This system cannot guarantee consistency between different copies of file data.

Figure 6. The architecture of the HS-DRT system


PipeCloud

This cloud-based multi-tier application system uses the Pipelined replication technique (as mentioned in the last section) as a DR solution. PipeCloud architecture is composed of a cloud backup site and a primary data center. The goal of this system is mirroring storage to the backup site and minimizing RPO. The main tasks of PipeCloud are:

  • Replicating all disk writes to a backup site by the replication technique
  • Tracking the order and dependencies of the disk writes
  • Releasing network packets only after storing the disk writes on the backup site.

This system results in a higher throughput and lower response time by decreasing the impact of WAN latency on the performance. For this purpose, the system overlaps replication with application processing. Also, it guarantees zero data loss consistency. However instead of Remus, PipeCloud cannot protect the memory states because it leads to large overhead on WAN.


Disaster-CDM

Huge amount of disaster-related data have been generated by government, organization, automation systems and even social media aims to provide a Knowledge as a Service KaaS) framework for disaster cloud data management which can lead to better preparation, response and recovery of disasters.

As shown in Figure 7, this system uses both cloud storage and NoSQL to store data. Disaster-CDM consists two parts:

  • Knowledge acquisition: Obtaining knowledge from a variety of sources, processing and storing in datacenters.
  • Knowledge delivery service: Merging information from diverse databases and delivering knowledge to users.

Figure 7. Disaster-CDM Framework


Distributed Cloud System Architecture

In Silva et al., the authors have introduced a cloud system to provide high dependability of the system based on severe redundancy. The system has multiple datacenters which are geographically separated from each other. Each datacenter includes both hot and warm physical nodes. VMs are active in both warm and hot physical nodes but only running in the hot nodes. In order for DR, there is a backup server which stores a copy of each VM.

When a physical node failure occurs, the VMs migrate to a warm physical node. In the case of a disaster which makes a data center unavailable, backup site transmits VM copies to another data center. Although this system architecture is expensive, but it highly increases the dependability which can be adequate for Infrastructure as a Service (IaaS) clouds. In addition, this paper has introduced a hierarchical approach to model cloud systems based on dependability metrics as well as disaster occurrence using the Statistic Petri Net approach. Figure 8 shows the architecture of this DR system.

Figure 8. Distributed cloud system architecture


Table 7 shows an overall comparison of different cloud-based DR platforms in terms of 10 key properties.

Table 7. Comparing cloud-based DR platforms in terms of different properties