A Ferrari on the Wide Open Web



Designing and building a highly scalable web site on the cloud is quite a challenge, even though the tools and infrastructure are there to make it possible.  Being able to summon up servers out of thin air to meet increasing demand would have been impossible without cloud computing, but that power comes with its own set of problems and issues, as the team at Lockerz recently discovered. 

If you haven't heard about Lockerz yet it's about time you do - started by an early Amazon employee and funded by Amazon and a group of blue chip investors, Lockerz is a revolution in social media and  e-commerce. Their web site was designed from the ground up to take full advantage of the Amazon cloud and all the latest high scalability innovations, from distributed memcached to a master-slave MySQL architecture running on high CPU Linux Apache servers.  Quite simply, this is a Ferrari on the wide open Web, where crashes and slowdowns happen all the time.  But just like a Ferrari, it is possible to go too fast, and the consequences can be as severe as they are on the open road!

Actually, designing a high scalability web site is a lot like designing a race car – to operate at its fullest capacity there’s a whole lot of testing, tweaking and tuning that has to be done.  The web server, the load balancer, the database, the memory cache and the network all come into play, and problems can arise with everything from default configuration settings to race conditions that are extremely difficult to reproduce, much less to diagnose and fix.  In a two hour meeting between their development and operations teams, Lockerz was able to accomplish what easily might have taken days if not weeks to do. 

This particular case involved a memcache race condition in which one thread requested a tag that was still being written by another, resulting in an invalid object exception.  As you might expect, this problem only surfaced at higher loads and was first uncovered by means of an 8 hour burn-in test at 50,000 users.  At that level, the servers began returning internal server errors (HTTP 500) on random pages that seemed to have no rhyme or reason to them.  At an average of 3,500 hits per second, as many as 6% were 500 errors. Even though the error page shows a really cute video of a dog licking your screen, this was clearly unacceptable to a company aiming for 99.9999% availability!

To diagnose and fix the problem, each member of the team floated his or her best guess about what might be causing it and as a group they decided on a test case that would either prove or disprove that theory.  Using CapCal tests that quickly ramped up to the level where the errors began to occur, variables were first removed and then re-added one by one to analyze their effect.  For example, from a load balanced web farm of 85 Apache servers they reduced it to a single Apache and determined that the errors did not occur.  From there they added another one to the mix, then two, and were finally able to reproduce the problem.  The fact that the error occurred more readily on a high CPU instance with two virtual cores was a very useful hint that helped them zero in on the cause.

By means of this disciplined and focused approach they were able to diagnose and fix it in a couple hours, so that when the next 50k user burn-in test was run they got a single “connection reset” error out of 17.5 million page hits – well within their stated SLA goal! 

No comments:

Post a Comment