Will online backup services will become obsolete as computing moves to the cloud? The short answer? We’re not shaking in our boots, and you need not be either–managed service providers become even more relevant as they build solutions that leverage the power of the cloud but that also mitigate new and subtle risks. We examine the benefits and hidden risks of cloud IaaS services.
So you’ve just migrated all of your servers to run in the cloud… now you’ll never have to worry about things like power supply failures, hard drive deaths, data corruption, networking issues, and a myriad of other types of IT failures, right? Now you can rely on the likes of Amazon AWS (and friends) to do all the heavy lifting, provision IT servers with a mouse click, and then live happily ever after? Isn’t that the promise of cloud infrastructure as a service (IaaS)?
To be blunt, no.
To draw an analogy between driving cars and cloud computing, cloud computing provides:
- Scalability (drive as many cars as you want)
- Elasticity (change how many cars you’re driving at any time, based on demand)
- Multi-tenancy (everyone draws from one big fleet of cars purchased with massive economies of scale)
- Virtualization (switch what kind of car you drive at any time)
- Utilization-based billing (pay for miles driven, no up-front costs to buy cars)
When it comes to IaaS, no promises are made that individual “cars” (instances/VMs) will never crash or stop working. It’s just the opposite, public clouds are transparent about the fact that failures can and will (and do) happen, and thus provide tools and APIs to be alerted of and response to instance failures. You won’t be driving to the data center to physically swap stuff, but you still have to know what the risks of failures are and how to diagnose and respond to them.
Beyond individual instance failures, much larger failures can and do happen, so cloud providers divide their clouds into availability zones where any failure in one availability zone should not affect other services in another availability zone (in theory, but not always in practice). So now with the power of the cloud, you can build next-gen applications distributed across tens, hundreds, or thousands of instances across several availability zones, and have 100% uptime, right?
It turns out getting this right (both for cloud providers and companies using the cloud) is extremely complex, as evidenced by recent outages that took out Netflix, Heroku, and others, despite the fact that these services are built for redundancy across multiple availability zones already.
And what if you’re attempting to run traditional IT workloads that weren’t originally designed for the cloud? It then becomes even more important to be aware of the practical operating constraints of cloud IaaS, to properly educate your customers on the risks and benefits, and prepare in advance to remediate known failure scenarios.
To summarize, in the cloud, expect and plan for downtime of instances, availability zones, and even multi-zone outages.
So what could be worse than down time? Data loss (or data corruption).
Cloud IaaS providers offer (at least) three types of storage:
- Object storage: This provides infinitely scalable “buckets” that allow arbitrarily sized, uniquely named “objects” to be uploaded, accessed, listed, and deleted. Data is typically replicated across several places. Data is accessed through custom APIs, so traditional filesystems (with full POSIX semantics) cannot run directly on top of object storage.
- Ephemeral block storage: This provides virtual hard disks attached to running instances that are wiped clean each time an instance reboots (or has a failure). You can (and must) run some type of filesystem on top of the block storage.
- Persistent block storage: This is similar to ephemeral block storage, except that the data is stored independently from any instance so it is preserved until explicitly deleted. Data is also stored redundantly within the same availability zone so that any single hardware failure will not cause data loss. Examples of this kind of service include Amazon EBS, OpenStack Swift, and Ceph RBD.
Effectively utilizing object storage and ephemeral block storage requires apps that are designed from the ground up to run on the cloud. If you’re running traditional IT workloads in the cloud, you’ll be working with persistent block storage services, so that’s what I’ll focus on here.
What are the considerations and risks of using persistent block storage?
- Data loss is expected: Say what? Yes, that’s right: it’s not if, but when, these volumes will fail, just like physical hard drives. Cloud providers are trying to be forthright about this reality. For example, Amazon EBS publicly gives an expected annual failure rate (AFR) of around 0.5% for an EBS volume. The developer docs for Google Compute Engine warn, “To protect against data loss, always back up your data and have data recovery policies in place.” To put this into another perspective, if you have 100 customers running in the cloud, each with 2 EBS volumes (one OS and one data), then you can expect about 1 EBS volume to fail each year.
- Silent data corruption is possible: Unless systems are engineered from the ground up to address silent data corruption from end-to-end, previously stored data and new data is at risk of being corrupted. Public clouds are no different. Amazon engineers have stated in forums that silent data corruption with EBS volumes is possible. eFolder is only one of a small handful of storage cloud providers that openly discusses and guards against silent data corruption. Certainly none of the big players are addressing this. (If you know of examples where they are, please send me details!)
- Write caching issues: Another problem is caused by how a cloud provider caches writes and flushes writes to stable storage. In a serious outage event (e.g., as in the recent June 29th AWS outage), data writes that were temporarily in flight may not be properly flushed to disk. Cloud providers are not fully disclosing how they are handling commit, cache flush, or “fsync” operations. For example, Amazon’s comments here and here indicate that volume corruption was possible because of the recent outage, and were placing any EBS volumes that had in-flight writes at the time of the failure in an “impaired” state until users manually checked (e.g., chkdsk’d) their volumes and manually resumed I/O. Even if the filesystem was OK, this could have potentially catastrophic consequences for critical applications (e.g., databases) that depend on storage properly honoring cache flushes and write ordering guarantees. It’s the same problem as if you were running a database server on a RAID controller with write caching turned on but without a battery-backed (or capacitor-backed) write cache and the power goes out. You’d have to run some application-specific integrity checks to be sure your data was good still.Guaranteeing proper write ordering and flush semantics is difficult and must be dealt with at every layer in the cloud — inside the instance OS and paravirtualized device drivers, in the hypervisor, in the hypervisor “host” OS (if relevant), in the network, in the storage processing nodes, in the storage HBAs, in the enclosure storage controllers, and in the hard drives themselves. Rather than engineer cloud services to fully guarantee data integrity, public cloud services have been engineered for performance and scale, and instead give the expectation that data loss and corruption are possible and expected to happen.
Back to the topic at hand… if data loss and data corruption in public clouds are expected (and many incidents of data loss and corruption have been documented already), what do cloud providers expect customers do? The larger cloud providers offer volume snapshot capabilities that allow point-in-time snapshots of persistent block storage volumes to be copied over into cloud object storage, where data is replicated across geographic regions, and the presence of multiple snapshots reduces (but not eliminates) the risk of data corruption.
Problem solved? If only it were that simple.
Here’s some of the key reasons why snapshots are not a suitable replacement for backups:
- Human error: Human error is estimated to be the cause of data loss in 3 out of 4 cases. If an administrator accidentally deletes a cloud volume, how are you going to get the data back? What are the risks of keeping your data all in one ecosystem vs having secondary copies that are authenticated separately? What are the unknown risks of human error in the organization running the cloud? (read more on hidden risks here [PDF])
- Snapshot automation, management, and monitoring: Often snapshot services are provided via low-level cloud APIs — what tools will be used to automate and monitor the snapshot process? How will you coordinate taking the snapshots with running applications to ensure application data is in a consistent state when the snapshot is taken? If you’ve striped multiple volumes together into one logical volume, is it possible to have the cloud provider take snapshots of all of the relevant volumes atomically (all at once)? If you have hundreds or thousands of clients, how will you be sure that snapshots are working (or not) for all of your customers? Is there a centralized management and monitoring interface?
- Verification of the integrity of snapshots: Snapshots allow you to go backwards in time on the volume, but do not guarantee by themselves that they contain a good copy of your data. How do you know that the snapshots are indeed accessible and the filesystems contained within them are not damaged? How do you know that application data (e.g., SQL) within the filesystem is intact and ready for use?
- Long-term data retention: Taking snapshots is only half the battle — old snapshots need to be pruned automatically according to business requirements. How will you automatically enforce data retention policies? Is a tiered retention policy supported? (e.g., retain hourly snapshots for X days, daily snapshots for Y days, weekly snapshots for Z days, etc.) What functions are provided to efficiently export one or more snapshots? Can retention policies easily be customized on a per-volume basis?
- Frequency of snapshots: In order to meet your customer’s recovery point objectives (RPOs), how often will snapshots need to be taken? Can it take snapshots as frequently as every 5 minutes? If so, will it be able to efficiently implement your desired data retention policies?
- Time to restore snapshots: In order to meet your customer’s recovery time objectives (RTOs), what guarantees (or even estimates) does your cloud provider make on the time that it takes to restore a snapshot into a new volume? (Note: I haven’t seen any cloud providers make guarantees here — if you have, please let me know!)
- Replication of snapshots: Some cloud providers will automatically replicate volume snapshots across availability regions to provide additional geographical redundancy. However, what visibility do you have into this replication process and how it relates to your RPOs? If an availability zone goes down, and you have to restore from a replicated snapshot in another region, what guarantees do you have on how far back that replicated snapshot is? Perhaps you’ll get lucky and your last snapshot replicated before the failure occurred, or you might get unlucky and your cloud provider’s tech support will inform you that they discovered (after the fact) that replication was back-logged and your last replicated snapshot is over 1 week (or 1 month!) old… Will you leave it to luck? If not, how will you monitor replication for all of your customers to ensure that you are meeting your customer’s required RPOs?
- Restoring individual files: Volume snapshots are effective for restoring entire volumes, but what tools are provided to mount and browse individual files in snapshots? If your customer says they want a file that got deleted 60 days ago, how much labor will it cost you to get the data back? Hopefully it does not involve using low-level cloud APIs to re-populate a new volume from a snapshot, attach it to a new temporary instance, login and mount the volume, find the desired file(s), and attempt to copy it back to the production system. This becomes even more complex when multiple volumes are being combined by an instance into a larger logical volume through software RAID.
- Software bugs: Bugs have potential to cause data loss at many different layers in the storage stack (filesystem, device driver, firmware, etc.)–the cloud is no different and introduces yet another layer. Bugs in cloud provider’s infrastructure have already publicly caused data loss (e.g., 2011 incident). How will you mitigate the risk of volume or snapshot loss caused by buggy cloud code?
An old adage says that “RAID isn’t backup,” and snapshots aren’t either. Cloud snapshots may be suitable as the only backup solution in some special cases (especially for apps built from scratch for the cloud), but it’s not suitable for most IaaS customer scenarios. Make sure you have a good answer and a prepared plan when (not if) Murphy’s Law hits your customers in the cloud.
Don’t get us wrong — snapshots are very powerful for cloning data volumes and having another layer of protection on your data. We recommend (and so does Amazon) doing both volume snapshots and volume backups (using cloud-aware backup and replication technology), but if you have to choose, our assertion is that cloud-aware cross-cloud backups will provide much better protection against the real risks to your data, and will also drive down your overall operational costs.
When it comes to evaluating data loss risks, there is more to consider than just technology risks:
- Vendor lockin: Once critical data gets stored with one cloud provider, how easily will you be able to switch cloud providers in the future if business requirements and the competitive landscape change?
- Security incidents: If a large public cloud gets hacked, what is the risk of data loss if all of your data (including volume snapshots) lives under the same technical umbrella?
- Billing disputes: This goes back to human error — what if someone in accounting makes a mistake or a check gets lost in the mail, and they think your account is delinquent, and subsequently delete all of your data stored on their cloud? Sadly, there are stories of this already happening (e.g., one unconfirmed story here [see end of thread]). It’s important to mitigate this risk by having your data live across multiple organizations.
Service shutdowns: It’s not unheard of for services once offered to suddenly be shut down without much notice (e.g., HP discontinued Upline after acquiring it). As unlikely as it seems now, what would you do if you were given 30 day notice to get all of your data out of a discontinued cloud (or even no notice at all)? If you’re relying only on snapshot-style backups, how would you efficiently export all of your snapshots out of the cloud?
The cloud sounds easy and care free to use, but the devil is in the details: it solves some historically thorny challenges, is full of promise for allowing companies to be more agile and cash efficient, but with it comes a host of new risk factors and operational complexities.
Give us a call at 847 329 8600 and we will be more than happy to assist you answer these questions for your organization!