Netflix has mastered the art of keeping its Amazon Web Services infrastructure online over the years, but even it was afraid when it learned AWS would be rebooting a significant number of its physical servers in order to fix a bug in the Xen hypervisor they run.
Mainly, as explained in a blog post Thursday night, Netflix engineers were concerned about their massive Cassandra database cluster. That database is one of the most-critical pieces of Netflix’s infrastructure for its video streaming service, and was one of the last to be programmed for automatic failover. But the work to make Cassandra resilient paid off during the AWS reboot:
Out of our 2700+ production Cassandra nodes, 218 were rebooted. 22 Cassandra nodes were on hardware that did not reboot successfully. This led to those Cassandra nodes not coming back online. Our automation detected the failed nodes and replaced them all, with minimal human intervention. Netflix experienced 0 downtime that weekend.
Downtime, scheduled or not, is one of the unfortunate realities of cloud computing and probably one of the areas where cloud providers will seek to distinguish themselves in the coming years. Thus far, the various open source tools Netflix has released are some of the best methods for failure-proofing AWS instances, but I suspect AWS will have to automate some of this for its users in an attempt to keep up with what Google is offering. Third-party software such as the increasingly popular Apache Mesos could help mitigate downtime issues, as well, by balancing the workloads of failed nodes across the rest of a cluster.