Netflix lost 218 database servers during AWS reboot and stayed online

SUMMARY: While some websites were forced offline as a result of Xen hypervisor updates affecting multiple cloud providers, Netflix once again remained up entirely. The biggest fear last weekend was for the nearly one-tenth of its Cassandra nodes that had to be rebooted.

Netflix has mastered the art of keeping its Amazon Web Services infrastructure online over the years, but even it was afraid when it learned AWS would be rebooting a significant number of its physical servers in order to fix a bug in the Xen hypervisor they run.

Mainly, as explained in a blog post Thursday night, Netflix engineers were concerned about their massive Cassandra database cluster. That database is one of the most-critical pieces of Netflix’s infrastructure for its video streaming service, and was one of the last to be programmed for automatic failover. But the work to make Cassandra resilient paid off during the AWS reboot:

Out of our 2700+ production Cassandra nodes, 218 were rebooted. 22 Cassandra nodes were on hardware that did not reboot successfully. This led to those Cassandra nodes not coming back online. Our automation detected the failed nodes and replaced them all, with minimal human intervention. Netflix experienced 0 downtime that weekend.

Downtime, scheduled or not, is one of the unfortunate realities of cloud computing and probably one of the areas where cloud providers will seek to distinguish themselves in the coming years. Thus far, the various open source tools Netflix has released are some of the best methods for failure-proofing AWS instances, but I suspect AWS will have to automate some of this for its users in an attempt to keep up with what Google is offering. Third-party software such as the increasingly popular Apache Mesos could help mitigate downtime issues, as well, by balancing the workloads of failed nodes across the rest of a cluster.

From: https://gigaom.com/

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

	ask on How to Disable Annoying Autopl…
	rf hunxaie on How To Limit Internet Bandwidt…
	rf hunxaie on How To Limit Internet Bandwidt…
	seo kent on How To Rank Number 1 In Google…
	rajesh093038 on How To Limit Internet Bandwidt…

Rajesh Paul

WEB-DESIGNING, TECH NEWS, UPDATES

Netflix lost 218 database servers during AWS reboot and stayed online

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply