By failing to prepare, you are preparing to fail. - Benjamin Franklin
In the aftermath of Amazon's S3 storage service interruption, there have been several articles written detailing the incident including Amazon's very candid summary and admission. Considering this was not the first cloud provider to have a significant service outage, many of the companies impacted by the interruption need to ask themselves an important question:
What is our business continuity/disaster recovery plan if our data center goes offline?
What is the plan?
Regardless of whether a company uses a cloud provider or hosts applications in their own datacenter, business continuity and disaster recovery planning has been a core activity undertaken by enterprises for decades. A locally hosted application can be knocked out by a faulty disk array, a power outage or a network disruption within an internet service provider. Without a plan in place to bypass any of those potential issues, the application would fail. One of the benefits of using a cloud provider for hosting applications is the availability of built-in functionality to assist in implementing a business continuity plan, but as we saw with the S3 outage, being in the cloud by no means guarantees a higher service level agreement. The cloud providers have given us the tools to make our applications resilient, but it is up to us to make use of them. If a company either has no disaster recovery plan or the plan has not been regularly tested to ensure it is adequate, Benjamin Franklin's quote will ring true every time: by failing to prepare, you are preparing to fail.
Establish recovery objectives
When performing a business impact analysis, two metrics must be determined when coming up with a disaster recovery plan:
A business's RTO is the target time set for resumption of product, service, or activity delivery after an incident. In simple terms, it is the maximum time the business needs to be completely up and running after an incident occurs. The shorter the RTO (2 hours vs 2 days), the higher an investment in a DR strategy will be required.
The RPO is different - it defines the maximum tolerable period in which data may be lost. Defining the RPO immediately determines the frequency in which backups must be performed. If the RPO is 8 hours, then backups every 8 hours is sufficient. If it is 1 hour, then backups need to occur every hour and the investment will be higher.
Determining the RTO and RPO will help answer the following questions:
- How much risk are we willing to absorb?
- How must does a disruption impact our bottom line?
- How much does a disruption impact our reputation?
- What are the short and long term effects of a disruption?
- How much are we willing to spend to mitigate those risks?
Once those questions are answered and the RTO and RPO have been calculated, the process of mapping specific requirements to available options to meet those objectives can commence. Cloud providers often provide whitepapers and other literature on how to architect for a DR scenario. Whether the proposed DR solution is to use multiple regions within one cloud provider or to use multiple cloud providers, the options to design for a DR scenario are seemingly limitless. The constraints of the solution will be determined by budget and the requirements set forth by the RTO and RPO metrics.