Walmart's Site Down on Black Friday Again


This appeared on the Marketing Vox site today (along with several others) about the Walmart Black Friday website crash that occurred on the busiest shopping day of the year. Faithful readers of this blog will remember this posting about another Black Friday meltdown at Walmart a few years back. Since a visitor from Walmart showed up on the blog today, I do hope he or she will return because we are offering them a free CapCal Crash Test with up to 200,000 users if they are up to it! Unfortunately I don't know anybody there but if anyone out there does, please pass this along!

But hurry because this offer is only good until December 22, 2012 (and that's only because the world will end that day according to the ancient Mayan prophecies)!

Stanford Uses CapCal with EC2 for Student Portal


Coalition Networks is a consulting firm in the Bay Area that was contracted by Stanford University to calibrate the performance of their Student Housing Portal, a web-based application that all students use to select their residence options and apply for a residence. As might be expected, this can often lead to a usage peak in the hours leading up to the deadline, and Stanford RD&E IT wanted to make sure their servers were adequate for supporting up to a couple thousand simultaneous users.

According to Akin Ajiboye with CNI:

Our test plan called for generating a load of up to 2,000 users, which even at 100 user per computer would require 20 machines. Fortunately, Coalition Networks partnered with Aligned Technology and CapCal to get the job. With CapCal running on the Amazon EC2 cloud we were able to fire up as many servers as we needed in less than a minute. While the test was running we gathered all the relevant network and database statistics and were able to form a complete picture of the application's performance. We recommend CapCal with EC2 as a great way to get excellent results quickly.

As a CapCal partner, Hiroaki Ajari of Aligned Technology had this to say:

We’ve worked with CapCal since the earliest versions so we were very excited to see it become available on the Amazon cloud. Of course it was a great honor and a privilege to work with one of the finest universities in the world, right here at the center of the technology universe. This was our first time to see and use EC2 and experience scalability as required. It's such a luxury to utilize servers on demand, minimizing waste - resources, cost, total footprint. Amazon's EC2 is perfect for the cyclical nature of testing; especially to handle performance testing's environmental needs for generating and distributing load from the test servers.

Our joint efforts with Amazon, Aligned Technology and Coalition Networks provided Stanford a quantitative way to measure risk, triggering mitigation strategies that allow them to maintain first class service to their students and administrators.

CapCal Demo Video Now Available Online


We recently completed an 8 minute Camtasia video showing how CapCal works with Amazon EC2. Please have a look and give us your feedback - we still want to make a few tweaks but it's 90% there!

Interview in Software Test &Performance Magazine!

It was a great honor and a pleasure to be interviewed by Andrew Muns for the October issue of Software Test and Performance magazine. I am planning to be at the STPCon 2009 conference in Cambridge this month, which I highly recommend as one of the best ways to stay ahead of the curve in our industry. I would love to meet up with any of our customers, partners and friends in the Boston area (you know who are and will be getting a call)!

Hope to see you there, and thanks again to Andrew Muns with ST&P for the honor of being in print in such a classy publication!

A Very Delicate Load Balancing Act


Everyone knows what a load balancer is but exactly what it does and how it goes about doing it are often mysterious. Since every single page request goes through the load balancer, how it is configured and what its capacity is can have everything to do with how well an application performs under load. For example, the maximum number of connections is a configurable parameter that is often very low in its default setting.

A current project for a cruise ship booking engine is a case in point. This particular application is being moved to the cloud for all the right reasons, namely, to have the extra capacity when needed to handle sudden usage spikes without bogging down or crashing. Our initial tests showed that the bottleneck was the database server, which is often the case. So we beefed up the database server quite a bit and saw dramatically better results. But we also saw reams of errors start to appear once we crossed a magic threshold that had to do either with the configuration or the capacity of the load balancer.

In the first case, it was exactly what I referred to earlier - if the load balancer is limited artificially to 100 connections and you try to open 200 or 2000, every request that exceeds the limit will receive an error of some kind, depending on the load balancer. In this case it was a 502 (out of resources), but I've also seen 404 (page not found) or just plain timeouts while making a connection.

Once we solved that, we tried again and hit a limit at about 1,000 users. This time we determined that the load balancer itself was being maxed out so we beefed up its CPU and memory (which is a snap on the cloud) and tried again. Now we got to 2,500 users before we started seeing any errors or delays. Our customer had this to say about the experience in a review on Amazon.com:

With the CapCal CloudBurst delivery system we were able to effectively simulate thousands of instantaneous/continuous users hitting our development environment. The after-test reports provided us with the guidance we needed to streamline our code. I highly recommend CapCal to anyone looking for a fast, affordable performance testing solution. John Hill, President HIL-TEC

Thank you, John!

We'll be publishing some actual results from these tests in future blog posts. But load balancing is such a fundamental part of the scalability and performance picture that it deserves to be studied and analyzed on its own, so that's what we'll be doing!

Blastoff Leaves the Launch Pad!!

If you haven't heard about it yet I'm sure you will soon enough. But Blastoff Networks left the launch pad on Sunday and CapCal has been working closely with them to ensure that they are ready.

Blastoff belongs in the category of "Wow, why didn't I think of that?" They've teamed up with hundreds of retailers to offer discounts to all their members, and to me it makes perfect sense. Unlike companies that push their members to sell for them, Blastoff just asks you to buy the things you would normally buy but for less money. Oh, and you get paid every time one of your friends buys something!

Pay less for what you buy and get paid when your friends buy for less - that's hard to beat! If this takes off the way I think it will, there will be another household name in the online retailing space before long!

And don't forget you heard it here first, on the CapCal Blog!

Barclay's Online Banking Takes a Dive (Again)

This just in from my daily Google alert for website crashes - apparently after a new version of the site was launched, along with a marketing push, Barclays has had a number of issues according to John Oates of The Register:

Either the bank has been plagued with problems or a disproportionate number of Reg readers use the bank's online services. In July the service went down after a new version of the site, and associated marketing push, was launched. In June it lost its ATM network and watched its website crash twice. It had similar problems in October of last year too.

Of course, this kind of thing happens all the time; there's no use in singling out Barclays. A marketing push is usually preceded by a new version of the site, which ends up being a double-whammy : if the marketing push is successful, it can often overwhelm the servers and thereby nullify the entire campaign.

First of all, they should know that a successful marketing campaign can result in way more activity than normal, and in fact that's the whole point! This is where EC2 comes in, with load balancing and autoscaling that will handle whatever comes their way.

Secondly, even with the full power of Amazon's data centers behind them, if the app itself isn't scalable they will still have problems. This blog beats these two points like a dead horse, and yet they can't be overemphasized - cloud computing + cloud testing = success. Forgo either one and you are asking for trouble!

Finally, this is a BANK, not just a place to buy the cute little coats worn by the Obama girls or to check out the latest swimsuit fashions. When people can't access their money they tend to get very nervous (or at least I do).

So I hope the CIO of Barclay's is entering the right phrases into his or her search engine of choice. If so, this page will likely turn up along with an offer for a free CapCal Crash Test of up to 10,000 users on Amazon EC2!

Sounds like a jolly good idea to me!

Your Servers Autoscale But What About The Rest?

It has traditionally been assumed that load testing is done for the sake of fine tuning performance and that's true. But it's also done to fine tune scalability, and that's a whole different ball game - the next CapCal customer or prospect I meet whose site is actually able to handle the kinds of loads they expect will be the first. Not because they lack a first class, multi-tiered infrastructure with all the latest hardware (or even better, a load-balanced, autoscaling cloud deployment), but because there are umpteen million "gotchas" laying in wait to surprise you at the worst possible moment - everything from load balancer settings to database, web server, OS or network settings and application configuration parameters, the list is endless. So instead of showing a performance drop at a certain load a server will begin spewing out errors indicating that an invisible boundary was crossed somewhere.

Have a look at the 2 minute CapCal test above (click on it for a better view) that attempts to reach 1,000 virtual users but at about 700 begins generating thousands of 503 (out of resources) errors. The green bars that show the ever-increasing bandwidth also drop precipitously when the errors start because only error headers are being returned instead of content.

Is this a site that expects to have more than 700 people online at any given time? Try 7,000 or even 70,000! Could this test be run in the lab using a tool like JMeter? Maybe at 1,000 virtual users, but at higher loads it just isn't practical or feasible because of the number of computers that would be required.

This is an example of a scalability test done against a single static page that doesn't even begin to stress the servers and yet it shows a scalability limit that required all of 2 minutes to uncover. Proof once again that performance rocks but scalability rules!

RightScale - Cloud Computing Delivered (Really)!

Click to see the RightScale Home Page

On Friday afternoon I saw the future of cloud computing and from what I can tell it is not only available right now but actually works as advertised! A 50 minute online demo of the RightScale platform made a convert out of Yours Truly in more ways than one. For starters, I saw a lot of the work I do every day being done far more easily and quickly in terms of managing and deploying CapCal controllers and agents on the Amazon cloud. That in itself would have been a godsend. But it goes way beyond that - as you will see if you read earlier posts below, our quest for the past month or so has been to move CapCal itself to the cloud and lay our tired, power-hungry servers to rest.

RightScale makes this whole process very easy and intuitive while solving many of the problems I've encountered on EC2 (like, how do I deal with the fact that the IP address changes every time, how do I handle autoscaling, load balancing, etc). They have default configurations for what CapCal needs, which is a basic four server cluster with load balancing and failover redundancy that can scale up automatically on demand. To me it is a dream come true, not only for CapCal but for any company that wants to minimize expenses while maximizing scalability and performance. That would include any company whose website generates revenue the way I see it.

One thing that really blew my mind was how they can clone the live production environment and use it for a staging or QA environment with a click of the mouse! For someone like me, with my 20-plus years in the automated testing field, to see what is normally hours if not days of work being done in minutes was dumbfounding to say the least.

But the climax of the demo was when I was able to show CapCal running with RightScale towards the end of the GotoMeeting - I recorded a test session with the CapCal client against the RightScale demo site, uploaded it to the CapCal server, built a test plan of 500 users over 3 minutes, and ran it, all in a space of about 10 minutes. We could not only watch the results in real time on the CapCal graph, but on RightScale's Dashboard as well! We saw the very usage spike that would have triggered autoscaling had the server been configured as part of an autoscaling cluster (it wasn't). We did, however, get the "full spectrum view" of performance, from the delays that users were seeing all the way down to the CPU activity on the server.

I urge you to have a look yourself since I can't give it justice in such a small space. But we will be blogging, as promised, on the steps involved in moving CapCal to the Amazon cloud on the RightScale platform. So keep coming back!

CapCal's Debut in the Amazon Solution Catalog!

CapCal made its debut on the Amazon Web Services Solutions Catalog today, the first of many announcements we'll be making about our partnership in the weeks to come! We are now working on a Case Study to be eligible for a spot as Featured Solution, which we hope to have available shortly. We are proud to display the Amazon logo on our website and to be part of the fast-growing community of developers and businesses taking advantage of this amazing, world-changing technology!

Handling the Dirty Work of Dynamic Content

One reason performance and scalability testing can be so challenging these days is that the content to be tested is so dynamic and ever-changing. It's a dirty job, but somebody's gotta do it - for a test to be realistic and effective it can't be done the "old fashioned" way, just by recording a single path through the application. It has to be automatically generated from the content itself, and that's what we're working on right now.

There are two types of dynamic content, server-side and client-side. An example of server-side content is online news, which changes constantly but is stored on the server and thus can be used for creating performance tests. Client-side dynamic content comes from AJAX or web services calls and shows up only in the browser, like a list of tickets available for a sporting event or concert. This particular site "sports" a rich client interface on the front end and uses Amazon EC2 on the back end. When you click on an event, the list of available seats is totally dynamic - it isn't stored anywhere except in the browser at that moment. Like an airline flight, it can disappear in an instant if the event gets sold out.

One of the companies we are working with is dealing with server-side dynamic content (online news) and another is dealing with the client-side content described above (online tickets). In the first case, what we envision is fairly straightforward - a small script will query the server database and generate a comma delimited text file to fill in the parameters of "template sessions" on the CapCal server, then call another script that automatically uploads it and kicks off the test. This will generate thousands of unique test cases that can be run by thousands of virtual users against a staging site before moving new content or application changes into production. Not only will the most current and dynamic content be tested, it will be tested in enough combinations and at high enough load levels to flush out any bugs or bottlenecks that might be lurking undetected.

In the case of the online ticket site and the countless other similar applications, we are at work on a solution that will be quick and easy for users while not so trivial for us developers (as it should be, of course)! Basically it will be an extension to the browser add-on in the CapCal client that allows fields or links in the browser to be identified and grouped so they can be dynamically accessed at runtime by each virtual user and manipulated by a new command called "Click-On". I can't divulge the magical powers of this new command until they fully exist and have been confirmed by users. But suffice it to say that it will easily solve the seat selection problem and that's what matters most right now!

I'm especially excited about this because it is true automation, with no manual intervention at all other than writing the extraction query or identifying the dynamic fields in the browser and recording the session template. Add to that the enormous economies of scale made possible by the cloud and all of a sudden something that has never been done effectively (if at all, as in the case of dynamic client content) can suddenly be done very thoroughly and cost effectively with no manual intervention.

Simple, powerful and clean - like soap and AJAX!

How Much Did Michael Jackson Rock the Web?


This story in today's New York Times gave some meat to the rumor that Michael Jackson's death wreaked considerable havoc on the Wild Wild Web. One of the metrics that stood out for me was that Yahoo experienced 800,000 clicks in the first 10 minutes, breaking their previous record!

This is copied from the Yahoo corporate blog :

The passing of the
King of Pop set multiple records across Yahoo!. On our front page, the story “Michael Jackson rushed to hospital” was the highest clicking story in our history. It generated a whopping 800,000 clicks within 10 minutes and news of his death saw 560,000 clicks in 10 minutes. Also, the news area on our front page experienced five times the amount of traffic it normally receives.

Yahoo! News set an all-time record in unique visitors with 16.4 million people, surpassing our previous record of 15.1 million visitors on election day. Four million people visited the site between 3-4pm Pacific time, setting an hourly record. We also recorded 175 million page views yesterday, our fourth highest after Inauguration Day, the day after the Inauguration, and Hurricane Ike.

Michael Jackson's hospital visit generated more traffic than the election of the first black president in history! (Or should that be HIStory)? Wow.

I'm thinking of offering an "800,000 clicks in 10 minutes" test and naming it after the King of Pop, what do you think? There would be the Michael Jackson test, the Inauguration Day test and the Hurricane Ike test. Take your pick.

Why Self-Service and On-Demand are Critical

Last week I got a call from the senior developer at an online retailer with an urgent request - they were seeing errors in production at load levels higher than they could generate in the lab and needed to do some very large load tests immediately. I instantly put myself in his shoes and imagined how I would feel if it were my server and the word "immediately" took on a very visceral meaning!

Within an hour and a half we had negotiated the business terms and he had been trained on CapCal and was building and running his own tests! To make sure each user and session was unique, he created a few thousand test accounts and generated a CapCal replacement data set with user id, password and so forth. That took a couple hours, and then to make a short story even shorter a series of 5,000 user load tests using 10 Amazon instances uncovered a bug in their PHP framework that was causing the errors. Problem solved! (After hours of re-coding and unit testing that is).

They'll be spending a week or two doing more testing now that they've solved their immediate problem. They want to reach 150,000 simultaneous users while maintaining sub-second response times with no errors, which is pretty awesome in itself. What's even more awesome is how quickly and easily that kind of load is generated by CapCal using the Amazon cloud!

Click on the chart above and see how quickly it ramps from 1 to 5,000 in the Users column with excellent response times and zero errors. But at about 3 minutes into the test, server errors begin showing up and rapidly climb. Response times hardly degrade at all, which once again reminds us that scalability and performance go hand in hand but scalability is king!

I hope to get permission to publish their name and get a quote for the blog but I have to say that this particular developer is extremely bright and caught on to CapCal faster than practically anyone I've ever seen. But even mere mortals are capable of getting the same results in the same amount of time since there's no programming involved.

The buzz around here now is about moving CapCal to EC2 and running load tests on itself to test the autoscaling and load balancing features. Stay tuned!

How to Crash an Airline Site in 3 Minutes or Less


This is the second time on this blog where the words "airline" and "crash" are used together but fortunately we are referring to websites, not airplanes (see Pet Airways Crashes on Opening Day). The above graph shows the results of a CapCal Crash Test on a new promotional site for one of the major airlines (sorry, I can't use their name without legal approval).

There are only two pages in this test, the first one going to the main page and the next one submitting a registration form. The objective of a crash test is not to actually to crash the site but to find the point at which performance starts taking a nosedive (there's that airplane thing again). "Good" response time is less than a second, where "bad" starts at about 2 seconds and goes up. (These aren't subjective measures as much as the result of having run thousands of tests and seeing the same pattern over and over - usually after it reaches 2 seconds it goes up rapidly from there).

This one was able to retain good response time until it reached around 400 users. At that point you can see the "hockey stick" effect in the red bars on the right that reach 6 seconds at about 1,340 users. At 2,000 users, most people would think the site was offline and go somewhere else. For all intents and purposes it has crashed, even though it is still handling requests. Since the average response time is 6 seconds, some of the pages are probably taking more than 10. Life is much too short for slow web apps.

However, have a look at the Site Errors column to the left of the green bars - at a little over 1,000 users the server starts generating errors, and those can be even worse than slow response times. Don't you just hate it when you get a "500 server error" after you've patiently filled in all the fields in a form? Wouldn't you hate it even more if you were the airline or the company that put together the web site? That's why scalability and performance go hand and hand but scalability is king!

The site we tested was a staging site that mirrors the production site so this is what they can realistically expect to see in production. This test used five small Amazon instances so the cost was negligible. Not too shabby for something that could prevent a million dollar disaster if the promotional campaign itself flopped because of a glitch in the web site!

Integrated Performance Testing in Action


Automated functional testing, for someone who is watching it for the first time, is quite amazing. What would take a person hours to do by hand flashes before you on the screen within minutes. It's that phenomenal time savings and increased test coverage that makes it worthwhile, but the "while" part is the rub - if you end up spending all your time maintaining the test scripts as the application changes, the time savings starts to degrade to the point of ever diminishing returns.

So obviously making the tests easier to create and maintain is a huge leap forward, but if you have to turn around and create performance tests using a DIFFERENT tool you are faced with the same dilemma if not worse. Take the HP Mercury suite as an example - QuickTest Professional uses a scripting engine based on VBScript, while LoadRunner uses the C language as its basis. In other words, maintaining a load testing script can very often require more technical expertise than it took to write the application itself!

One of the major differences between functional and performance testing is that functional testing can be done manually while it is extremely difficult and unwieldy to do performance testing that way. That doesn't mean that people haven't tried it, and the thought of coordinating dozens of people to perform a load test would be funny if it weren't so absurd - in fact it was only last year that I heard of a case where this was being done at a major corporation for lack of a suitable load testing tool.

But even with a load testing tool at your disposal, if you've already invested in functional test automation doesn't it seem bizarre that you would have to turn around and duplicate all that cost and effort just to do performance testing? Sure, there are some major differences between functional and performance testing but that doesn't mean that an entirely different tool with its own language is necessary. It's evolved that way for historical reasons that have a lot more to do with the search for more revenue by test tool vendors than the search for more efficient testing methodologies.

If you could use a subset of your functional tests for performance testing not only would it save time and money but you could be many times more productive and efficient in your testing cycle. That's what we call "integrated performance testing" and it involves capturing an automated functional test while it is running and then immediately using it to create and execute a performance test. Not only have you saved the time and money involved in writing and maintaining a separate set of tests but you've also made sure that both kinds of testing get done at the same time.

Here you can see a screen recording of an automated Worksoft Certify test being run with CapCal Integrated Performance Testing. The demo shows batch command execution because that is how these tests will be integrated into the nightly build cycle. However, we've also tested with HP QuickTest Pro, Compuware QA Tester and Automated QA Test Complete. For a proof of concept in your own environment with whatever tool you are using, just write to info@capcal.com and someone will get right back to you!

What the Cloud Really Looks Like

This article today in New York Times online was an interesting read. Called "Data Center Overload", it gives a good overview of data centers in general and talks about cloud computing too. Besides the huge amount of space and power these things consume, the subject of latency (or the time it takes to get a packet from Point A to Point B) was also covered.

As it turns out, nowhere is latency more critical than in financial trading, where seconds can make the difference between a million dollar trade won or lost. One of the data points that intrigued me the most was this one:

Latency concerns are not limited to Wall Street; it is estimated that a 100-millisecond delay reduces Amazon’s sales by 1 percent.

One percent of total sales for 100 milliseconds?

Let that sink in for a moment.

That just boggles the mind. Do you suppose that performance testing is practiced as it should be at Amazon? I certainly hope so!

Another interesting tidbit was that Amazon Web Services now uses more bandwidth than Amazon's massive retailing operations. If that doesn't point to where the cloud is heading, I don't know what does!

On Finding a Home in the Cloud


Load testing is obviously a perfect fit for the Amazon Cloud - need 1,000 computers for a big test? Just spin them up, use them, and tear them down. To deliver that same capability, CapCal once had over 10,000 people around the world running our agent for $0.30 per hour of testing. A small Linux instance on Amazon EC2 costs $0.12, almost 1/3 the cost! We didn't pay for bandwidth, though, and that was the main attraction. But they weren't as secure and reliable as the Amazon agents, and that makes all the difference!

Load testing is also called stress testing, performance testing, volume testing, scalability testing and so forth, all of which refer more to the objective than the means. But what about other kinds of testing, like functional and regression testing? Do they have a Home in the Cloud as well?

Of course they do! Actually anything but unit testing and manual testing can be done on the cloud and should if you read the post below this one. Running functional and regression tests takes time, and time can be squeezed out of the cycle if you can throw more computers at it. Generally people do unit testing, then functional and regression testing, and then performance testing. But the ugly truth is that performance testing either a) doesn't get done at all or b) gets done at a point where it's too late to make changes.

Why is that? We'll be answering that question and showing some examples of a different approach in the next few posts so y'all come back now!

The Test Lab is Dead - Long Live the Test Lab!


A test lab, as I define it, is a room with nothing but computers which may or may not include desk space for testers. I’ve seen more than I can count and you have too if you have been doing test automation for any length of time.

Earlier test labs had a person at each computer, which was even more wasteful – not only are you taking up space and filling it with expensive hardware, but you’ve got people doing mind-numbing, repetitive work. Nowadays test labs are used only for automation if they are used correctly – manual and acceptance testing can be done at the user desktop instead of the lab.

Cloud computing is on track to replace ALL those machines and free up ALL the space they occupy, a forward leap just as huge as the leap of automation; first we replaced people, now we are replacing machines. It makes perfect sense if you think about it, and yet it is just starting to dawn on people that we can do this.

The cost and space savings combined with the dramatic increase in productivity and throughput is astounding. So if nothing else, software testing is a “killer app” or “poster child” for cloud computing; nowhere else are the benefits so obvious and immediate.

But the tools haven’t caught up yet for the most part, with a couple notable exceptions. CapCal is one, of course, and we've seen SOASTA CloudTest. I'm sure other companies have projects in the works and announcements will probably be forthcoming. But what kinds of things can you do right now that take advantage of the cloud to increase your testing coverage and reduce your costs at the same time?

For one thing, virtualization already provides a huge reduction in the amount of hardware needed for a test lab and that’s good. So maybe instead of an entire room, just a corner or a wall might be used. Then it’s simply a matter of doing the math to see if the cloud is cheaper and usually it turns out to be. In general terms, if a computer is not being operated by a human being it should also not take up space.

The cloud will bring up some interesting licensing challenges once people realize they can install software on one instance and duplicate that instance as many times as they want. Windows itself is covered, of course, since Amazon pays them. Everyone else has to trust their users to abide by the same restrictions applied to making physical copies, where these are virtual.

In this series, we’re going to explore some real life cases of testing as a service in the cloud. So keep on coming back!

Amazon Meet SAP - SAP Meet Amazon!

Only the biggest companies with the biggest IT budgets use SAP and they use it to run their businesses, both internal and customer facing. SAP supports a wide range of client protocols, from HTML to Flex to Adobe Interactive Forms, but the core UI is still the venerable SAPGUI for Windows, now at version 7.10. Until now, CapCal has only supported HTML and Flex, but going into beta this week is CapCal for SAPGUI, which of course means CapCal for SAPGUI on the Amazon Cloud (another first in case anyone is counting).

Many SAP customers have thousands of employees and hundreds of thousands if not millions of customers themselves. So let's say a company has 30,000 employees and all of them have access to one or more SAP applications (HR, sales orders, invoices, inventory, etc). How do they know when they implement a new patch or upgrade that their performance or scalability won't be impacted?

They don't, really, except for running relatively small load tests in the lab with a tool like HP LoadRunner. Ask HP what it would cost to do a single 30,000 user load test with the SAP protocol and let me know what they say, will you?

If the answer is less than $100,000 I will be very surprised, plus the number of computers that the LoadRunner agent would have to be installed on would be somewhere between 200 and 300. So just the prep work and setup could take a week or two if not more. As far as I know, nothing above 2,000 simultaneous users has ever been done - someone correct me if I'm wrong! That in itself could take dozens of computers at least.

With CapCal for SAPGUI, just record a session in the GUI and press the Run button - whether the test runs on the local machine, on other machines in the network, or on the Amazon Cloud it will begin executing within seconds. How many agents are needed is a function of the total number of virtual users required divided by the maximum number of SAP sessions allowed per machine, usually about 100. So 30,000 users would require 300 agents, which will spawn automatically for the duration of the test and terminate at the end.

The CapCal Blog now has a sister edition on the SAP Developer Network and we'll be posting the links here on the main blog as well. So please check back for the latest on CapCal Cloudburst for SAP!

SOASTA Declines CapCal Challenge!


If you followed the comments on the Geeks Are Sexy blog post, you saw that someone named Brad from SOASTA, "the Cloud Testing Authority", left a rather odd comment about how a million users happens all the time, just not all at once. It's the "all at once" part that's the point - give me enough time and I'll do a trillion user load test! Just ask the Election Commission in India who got 8.6 trillion hits over 8 hours and it flatlined the server yesterday!

Anyway, I challenged Brad to a contest to see who could reach the million user mark the fastest in a way that could be fairly judged by a third party and he declined, calling it a "scientific experiment". Where I come from it's called "competition", and in this case the competition is really not either one of us, it's the 800 pound gorilla in the load testing space, HP, who is not going to watch idly as their LoadRunner franchise evaporates. SOASTA may think of themselves as the 800 pound gorilla in the cloud testing space, but I say the cloud testing space is too new for a gorilla. A baby gorilla or two, maybe, but no 800 pound gorillas.

CapCal has the decided advantage of having been around longer and having designed for a much more challenging "cloud", the Internet, and a lot of those earlier decisions and trade-offs are starting to pay off in a big way on the Amazon cloud. I don't know if SOASTA uses a Windows agent or a Linux agent but if it's Windows we could easily have a 10 to 1 advantage! Windows takes too much network and disk time to copy, takes too long too boot up, requires too much memory, and is more restrictive than Linux. And even if your agent runs on Linux, how you scale and control your agents has a lot to do with how many virtual users you can handle on a given instance, for example.

Either it seems obvious or counter-intuitive, but an application that tests the performance and scalability of other apps can itself have performance and scalability issues, and these kinds of things only come out at really high load levels. This is where CapCal has had lots of attention focused and a lot of work done - the scalability of the controller-agent design has proven itself over and over again to be virtually limitless. Even so, the very first million user test - a record that will stand until someone else beats it - will be a challenge, and it will be reported here "live", so come back soon!

Election Commission Website in India Crashes


India Times reported today that the Election Commission website had crashed because they "had not anticipated so many hits". Apparently they were only prepared for 2,800 hits per second and got 3 lakh hits per second resulting in "complete breakdown of the server," according to this article.

I don't know what a lakh is, do you? All I know is that it's 8.64 trillion hits in eight hours at 3 lakhs per second so go figure it for yourself. Deputy election commissioner Alok Shukla said "I will not leave office till it is fixed", but the servers "could not be fixed till late into the night". Sound familiar to anyone?

How funny that I was just here rattling on about government websites a few minutes ago and now I find this little nugget nestled in my inbox!

So governments and militaries, like celebrities and businesses, are also exposed to the "common crash". Just how common it is can be easily seen if you check back here every day or two.

Life, Liberty and the Pursuit of Mega Loads


If I ran a million user load test on somebody's site without permission I could be sued, so I would never do that. But what about public web sites, the ones my tax dollars are paying for? Shouldn't I have the right to know about their scalability and responsiveness?

It seems to me that I do, but that doesn't mean I'll be testing the IRS site anytime soon. Is there some federal department that isn't so important that wouldn't object to being the subject of some scientific research? Or does "unimportant federal department" sound redundant to you?

Speaking of scientific research, don't you think the military should have the most solid web infrastructure on the planet? Not that they are prone to sudden usage spikes caused by young men wanting to sign up for the war or anything. In fact I've never seen www.army.gov so I'll have to pause this blog post for now and come back after I do.

Well, for one thing it's a .mil extension instead of a .gov but you can even sign up online from the looks of it! That would be a great trick to play on your college roomate, wouldn't it? "Guess what? You're shipping out to Afghanistan at 04:00 hours tomorrow, soldier"!

Seriously, the Army has not adopted much Web 2.0 yet so they are lagging way behind. "Tweets from the Trenches" haven't appeared, nor has "GreasePaint Facebook". I didn't dare to read the blog, however, but if you do you are welcome to post about it here.

So what is the verdict? Do I have the right, if not the duty, as a tax paying American citizen to ensure that the US Army web site is up to government standards? And what, pray tell, are those?

Idle HANDS are the Devil's Tools, not Servers!


If "peak provisioning" refers to the capacity necessary to handle peak work loads, then I suppose "idle provisioning" is what is necessary to handle a very small load. We already know what peak performance is, and that is what a single user sees on a single server with a 1GB connection. The only thing standing in the way of delivering that to every user is capacity, and as long as capacity is elastic like it is on EC2 we are pretty much there, right?

Well the devil is in the details, of course, and there are tons of details to take into consideration. Web servers scale far easier than database servers, for example. And Amazon has yet to announce support for load balancing although there are third party alternatives. Finally, autoscaling servers is one thing, bandwidth is another - slowdowns can occur at either the server or the network layer, so scalability must take both into account in order to deliver "sub second response time regardless of load".

Fortunately, this is one of the many benefits of EC2 cloud computing, knowing that your app will never be starved for bandwidth and you won't have to shell out thousands per month for a dedicated OC3 or anything. You only pay for the extra bandwidth when you need it, just like the extra servers.

Haven't heard back yet from Todd at High Scalability but I think some of the techniques being discussed on his blog are perfect for cloud computing, with things like "database shards" that allow databases to be far more scalable. This will be a critical factor in the success of cloud computing, just as it is for bare metal.

Bare Metal? Bear Medal? See how it easy it is to confuse these things? But cloud computing is not confusing at all - extra capacity on demand. What's so hard about that?

Sub-second Responses Regardless of Load

Back to this again, which I believe sums up the benefit of cloud computing to end users pretty well - sub second response times regardless of load. It was only by going virtual that we could achieve this, of course, since before that every computer required its own box. Virtualization was the big breakthrough that cloud computing so adroitly exploits. To me, cloud computing has every bit of the same excitement as the Web itself - possibilities have suddenly multiplied while the number of physical resources required will actually shrink.

If you ask me it means that computers are finally going to have to WORK for a living and I say it's about time! The average server has a lot more horsepower than the mainframes of old and is almost always doing nothing but listening to port 80 and waiting for something to do. Not so with cloud computing - those computers are doing real work, since they wouldn't have even been summoned into being had there not been a demand for extra capacity.

I am still confused about what Microsoft and Google are doing with their cloud computing initiatives. It seems to be all about application hosting somehow, and we've had that in some form or other for eons. What we haven't had is the ability to clone another instance on demand to share in the workload, which is a much more powerful concept.

The net result of this will be that a lot fewer computers requiring a lot less space and power will be doing far more work than all the ones that presently exist are doing now. All the "workhorse" apps that many people rely on will run on the cloud instead of company-owned data centers or mainframes. Anything that is tailored to the individual will run on the desktop, notebook, smart phone, etc.

Makes perfect sense to me!

Grandpa Says the Web Used to CRASH!

I'm a grandpa, and that means I'll be able to tell my grandchildren and hopefully their children about the days when websites would just crash when too many people went to them! Can you believe it? Like having your cell phone cut off without warning. How strange that must have been!

It is pretty absurd if you think about it - the definition of success is system failure - but that's just the way it is with a non-elastic infrastructure. To borrow from cooking, it would be like a pressure cooker without a release valve - think of the experiments that must have gone awry in the early pressure cooker labs!

Well, the Web is exactly like that - without an "overflow valve" that draws on extra capacity when needed, how can we EVER get past the problem of website crashes? And forget about "peak provisioning", or having enough capacity on hand to meet peak loads - no matter what you imagine your "peak" to be, there is always more.

I refer somewhere in this blog, at least a couple times, to "eating your own dog food" and how CapCal is being used not only to test its own capacity but to autoscale on demand. To test this we will have to run some pretty big loads in different patterns to make sure we maintain sub second response times regardless of the load factor.

Sub second response times regardless of the load factor. It almost sounds like a Perpetual Motion Machine if you think about it - a way to minimize your hardware resources while maximizing your capacity and performance. To be crash-proof as well as small and nimble is the best of all worlds!

Google Traffic Routing Issue Causes Global Outage!

This just in off the AP wire about a massive outage of Google's email and search engine services today due to "the routing of too much traffic through computers in Asia". Those computers crashed, of course, because they got overloaded. But if you read the article you understand that it effected the entire Web!

That's right, so many sites have links to their analytics and ad services that any glitch or slowdown at Google will cause a slowdown on every site that uses them! Long page delays were rampant across the Internet during that time. Even I noticed it when my gmail icon kept turning into a question mark as if scratching its head!

This "spidering effect" shows how truly interconnected the Web is, and how a few overloaded computers in Asia somewhere can waste the time of hundreds of millions of people for an hour and cut untold numbers off from their email! Check out this article for what is described as a "fail whale" (that huge bulge you see in the graph above).

Let's do the math - a hundred million is a small estimate of the total online population during that hour and let's say that each one wasted five minutes because of those poor computers in Asia. That's a 500 million minutes, or 83,33,333 hours, which is 347,222 days or 951 years!

When overloaded computers can waste almost a thousand years of human time over the course of an hour long "fail whale" it's probably time to devise a way to keep that from happening! Or is it just me?

Hmmmm...the term is kinda funny if you think about it. "Artificially induced fail whales" anyone?

Geeks Really ARE Sexy and Here is the Proof!

Any time you have a story about being the first to do something, the first one to publish the story becomes part of the story, and in this case the award goes to Geeks are Sexy. Dubbed as "tech, science, news and social issues for geeks", we think it's the ideal place for the First Million User load test to appear!

The first ten readers who leave a response will receive a free "find your limits" test, details of which will be posted here on the CapCal blog. Within an hour there were 8 so we will be starting on those right away. With the permission of the participants we will post the test results here also.

As we mentioned before, on the Web size really DOES matter when you are talking about what it takes to handle large numbers of visitors. Maybe we can rename the promotion "Show Us Your Size", or would that be overly suggestive?

Anyway, congrats to the folks at Geeks are Sexy for being the first to publish the story and thank you for a well written article! Check back here for updates as they become available.

Announcing the First Million User Load Test!


There are lots of reasons nobody has ever done a load test with a million simultaneous virtual users and computer capacity is just one of them. But with capacity on the Amazon Elastic Cloud as virtually limitless as it is, CapCal is aiming to be the first ever to accomplish this feat. There is, however, the slight problem of finding a company or individual with a sufficiently scalable website willing to be the first to actually prove it!

Do we have any takers?

Here are the details:

You will record a "happy path" through your site using the CapCal client, at most 10 or 12 pages, and upload that to the CapCal server. You will schedule a time you want the test to begin and it will ramp from 1 to a million users over 4 minutes and stay at one million for one minute. That means that several million page hits will be registered over a 5 minute period - enough to flat line most sites! The number of instances of the CapCal EC2 load agent necessary to pull this off will be calculated based on bandwidth but will be many hundreds for sure.

We'll want to make a couple trial runs beforehand, say at 50,000 and 100,000. For all I know it could take a few contenders before we find a site that can handle the load! Lots of sites handle more than a million visitors a day but not all at once!

One side benefit of doing this test is that we can see at what point either the servers start to falter or the pipe starts to clog. My guess is that for most sites that number will be 100,000 or less but that's just a guess - we won't know without actually trying! We'll do it in the wee hours, of course, so has not to disturb anything.

If you are interested, please leave a comment here or send an email to info@capcal.com.

When you think heavy loads, think Diapers.com!



As I glanced at the stats of the blog a while ago my jaw nearly hit the floor when I noticed a visit from 1800Diapers.com! I came up with the idea of comparing the two types of "load testing" a long time ago as a joke, and now it's right here staring me in the face! It's like when the priest, the nun and the rabbi walked into a bar and the bartender said "What is this, some kind of a joke"?

There really IS a God, of course, and I believe God wants people to laugh - how can you not laugh at a record-breaking load test on diapers.com? Like the lady whose baby was emitting a foul order and when someone suggested a change she replied "But it says on the box that it's good for up to 10 pounds"!

Ba da bing!

But seriously, folks, an online diaper store has got to be online and responsive at all times if it wants to sell diapers. Why, in such uncertain times as these we can just as easily expect a rush on diapers as on any other essential item. So knowing that they have the capacity, or if they don't they can easily add it on Amazon EC2, is essential to their business, unlike the supermarkets who stock them on shelves. So it's no laughing matter - even though it is patently hilarious - to say that CapCal and Amazon provide the "other" kind of load testing that is equally important in the diaper industry!

If one company could ask another for its hand in marriage (publicity-wise, that is), I would be on my knee at the diapers.com company headquarters in New York, genuflecting to the V.P of Marketing and extolling the many hilarious possibilities of a joint campaign! Just getting into the Guinness Book of Records would be a load of fun by itself!

Let's hear some possible ad campaign slogans, shall we? Anyone dare to be first or must I always be the one to get the ball rolling?

A Breakthrough in Diapers? Say It Ain't So!


Diapers are pretty much a recession-proof industry, thank God, or I might not feel much like going to a restaurant or shopping mall these days. What the disposable ones do to the environment worries me, however, so I expect our friends at Diapers.com to start offering a biodegradable diaper Really Soon Now. Maybe they already have one but not enough people are buying it, who knows. As long as it doesn't disintegrate too soon is the main thing, but the sooner the better.

Being pretty much a newbie in the whole online diaper space, I imagine diapers.com must be the 800 pound gorilla don't you? Surely there's others, maybe Diapers R Us, but when my baby needs somewhere to poop I will not have time to remember that wacky spelling. No, diapers.com has mindshare, pal, from top to bottom. They wipe out their competition in what is sometimes a messy fight for market dominance as you can imagine.

At the big Diaper Convention in Vegas and Orlando, at least as I imagine it to be, there's rarely any innovations or breakthroughs to announce, and in fact "breakthrough" is a bad word to a diaper designer. So I would think a million user load test on diapers.com would be something to shoot the you-know-what about at the hotel bar after the show closes.

"It's a fact, Jack - moving diapers through a brick and mortar channel is a load of crap! Online is the ONLY way to go!"

But don't take our word for it - go to www.diapers.com for the real poop.

Yesterday's News - KFC is Load Tested By Oprah!

I'm late with this one but that's only because I didn't realize there was a website crash along with the hordes of people camping out in stores demanding their free food! Since you had to download the coupon to print it out you can bet their site's bandwidth got as clogged as the pipes in a diaper testing lab!

Who am I to question the wisdom of having the Queen Herself give away free chicken and then be totally unprepared for the response, both at the website and the counter? (Actually, if you look at all the publicity it has stirred up maybe it was pretty smart after all)!

On the cyber side of things, I wonder if KFC might be willing to take a CapCal Crash Test in order to restore at least some of the pride they lost in this debacle?

Hmmmmmm..

New KFC Coupon - Not for Corporate Chickens!

KFC president Roger Eaton posted a web video apologizing for the ruckus that was caused by the Oprah free dinner stampede but I refuse to accept his apology until he redeems this coupon for a free CapCal Crash Test! Once they discover what their real limits are, they can go about expanding them on demand with an EC2 cloudbursting strategy. If the fast food industry is really fast, they'll beat the crowds who are stampeding in this direction. If not, they will be trampled time and time again by the effects of their own success.

Please note that this is not the same thing as the record-breaking million user load test we are hoping to do on diapers.com. Rather it is the first of many coupons that will be issued to companies who prove they need it by making headlines due to a website fiasco.

We thought about exchanging coupons with them but I'm on a slow food diet so we opted not to. On the theory that there are only 6 degrees of separation between me and Roger Eaton, if you happen to know somebody who knows somebody who knows him, would you please let him know about this incredible offer? I don't want to resort to posting my own web video but if it comes to that I will. I see this as a matter of their fiduciary duty to the KFC shareholders and not only that, Colonel Sanders would have wanted it this way!

A Free Coupon for the Queen of Everything!

Rather than do the investigative reporting that would have been necessary to discover where the crash actually happened, we decided to do the right thing and issue the same coupon to Oprah as we did to Roger Eaton at KFC! It doesn't hurt that she's probably the richest woman in the world and hobnobs with the First Couple, so wherever her legions of fans go a-clickin' you can expect her website to really take a-lickin' (and I don't mean finger-lickin' good either).

One of the most interesting articles I've read recently is by Nati Shalom from GigaSpaces talking about autoscaling to avoid peak load provisioning. Peak load provisioning is what Oprah would have to do to handle the heaviest possible load on her website but how on earth will she know what that is? Could it be a million, like the soon-to-be CapCal World Record, or could it be higher?

It could be much higher, that's what, and that's why the kind of autoscaling Mr. Shalom is talking about holds the key to the success of the "elastic cloud" model. As I mentioned, I am going to be doing a number of scientific experiments in this area and posting them here on the CapCal Blog. Specifically I want CapCal itself to autoscale so I don't have to own and operate an entire server farm to keep it going! And don't get me started on all the funny things that come to mind when I hear the term "server farm" or I will exceed the allotted space for this posting.

But come back soon, y'all hear?

How Crash Tests Make the Web a Safer Place

If I had the time and energy of a real investigative reporter I could tell you how many crash tests are performed by automakers each year but it has to be a big number. Computer simulations might do just as well for designers but I'm sure the insurance companies and customers demand more.

The CapCal Crash Test is being offered to those whose websites have already crashed as a way to repeat the worst case scenario in a controlled fashion so that internal limits can be defined to trigger extra capacity when needed. While absolutely no harm is done to the website being crash tested, nor are any users inconvenienced or animals tortured, it will nonetheless discover server and bandwidth limits wherever they exist.

A corollary to the statement "if you can't measure something you can't manage it" is "if you don't know your limits you can never ensure against reaching them". What it takes to find the limit of each site or application will vary from a couple dozen to tens of millions of virtual users, as in the case of a Yahoo or an eBay. Whatever that number is becomes the peak capacity from which all kinds of other useful values can be derived.

In case there aren't enough technical buzzwords floating around out there, I chose the name "crash-proofing" a while back because the object of a crash test is to make something crash-proof, to make it able to withstand a crash without buckling or caving in like the car above. So in addition to the crash test there has to be the equivalent of airbags for users, and I suppose that would be the Amazon Cloud - airbags as fluffy as a cloud!

Or am I taking this analogy too far? If so I apologize, but I think it serves to make the somewhat obvious point that crash tests are necessary to make something crash-proof. So when I refer to crash-proofing as a methodology, I'm talking about both things - the crash testing and the cloud bursting working together.

How to Tell When a Load Test Fails

It's important to stress the fact (ahem) that being subjected to a load test and actually passing it are two different things! In this case, the donkey did not pass the load test as you can see.

Here I show what a website crash looks like in graphic form, but not nearly as graphic as this photo! This is a good visual for what most websites would do if they received a million visitors within a 5 minute period. So we aren't looking for companies that want to make an ass of themselves,we're looking for the likes of Mr. John Evans, who can balance a bloody car on his head (he's British, don't forget)!

Sure, sites like Google, Amazon, Yahoo and eBay eat million user loads for breakfast, many times that number in fact. But if you aren't one of these companies it behooves you to see what would happen to your ass in the event of a million user load! You don't want it to be all exposed and dangling in the air like that do you?

I can't show an image of a failed load test in a diaper lab because Google doesn't have one, plus it would be tasteless and offensive. But it's still funny and you can't tell me it isn't - even my mother says it is, so there!

Early Attempts at Breaking the Load Barrier


Unlike Britain's John Evans, who holds the Guinness World Record for Head Balancing, breaking the web site load barrier is not a death-defying feat! In fact, CapCal helped NASA get its website ready for live streaming of the shuttle launch in 2001 by testing up to 10,000 users from a globally distributed network like seti@home. Before cloud computing came along, this was really the only way to deliver load tests of virtually unlimited size on demand. I call it cloud computing 1.0 but don't know if anyone else does. Ask me if I care!

As far as our friend John goes, making a living as a World Class Head Balancer has got to be a major headache, don't you think? Do you think it's easy for him to get health and life insurance? Are girls impressed by his card with this photo on it? How many would-be head balancers never made it this far for having assumed the shape of a pancake by trying?

No, we'll settle for the World Record Load Test and leave the Head Balancing to Mr. Evans. However, we are in awe of all the other record holders and will be greatly honored to be listed among them!

How to Fake a Website Crash and Make News!


I mentioned earlier how a website crash can be seen as a positive thing since it indicates an overwhelming interest in something. Well, today I discovered how such a thing can be manipulated as a way to influence public opinion and to me it is downright sneaky if true!

I'm sure you are tired of hearing about the senate race in Minnesota that seems never to end, even after "final" court rulings are handed down! Apparently this came from the website of the Republican contender Norm Coleman:

ST PAUL – Information recently added to the Coleman for Senate website, whereby people can find out which Minnesota voters the Franken campaign is trying to disenfranchise, has resulted in the website being inundated by tens of thousands of hits today – temporarily crashing the website.

This story does a bit of cyber sleuthing to make a case for the report being an outright lie but I didn't even have to look that deeply to know it probably was. For one thing, websites that do nothing but serve up static HTML pages can usually handle lots of users because everything is cached in memory and there is nothing to do but shove it out the pipe. The pipe could get clogged if there's lots of graphics or multimedia and that's a lot more likely to happen than a server overload. Server overloads usually happen when databases and dynamic pages are involved, like with e-commerce sites.

With the crash of the KFC site (or was it Oprah's site? It's not clear in the article) it was almost certainly a bandwidth issue because the coupons had to be downloaded. Let's just say that the entire State of Minnesota represents a fraction of the Oprah fan base and assume that the reported crash was just that - reported but not real.

With cloud computing and the kind of testing that I am preaching there will someday be one less dirty trick in the politician's bag. If they want to post the contents of a weblog that shows the real number of hits, that's fine. But to claim a crash that didn't actually happen is adding fuel to a fire that doesn't really need any more (although I am grateful for the Daily Blog Story).

High Scalability - How High is REALLY High?


If you check out the links on the right you'll see one for High Scalability - Building bigger, faster, more reliable websites, which I follow closely for a number of reasons. It's easily the most highly concentrated source of information on the issue of website scalability there is, but before you can really appreciate that you should understand what is meant by the word scalability. Wikipedia defines it as a system or network's ability to either handle growing amounts of work in a graceful manner, or to be readily enlarged in order to do so. While the term "readily enlarged" is unfortunate as it triggers every spam filter in existence, think instead of a rubber band that can stretch without breaking and you have the picture.

High Scalability is a treasure trove of information about what is "under the hood" at some of the most popular and well known websites in the world and what makes them capable of achieving such amazing feats of high scalability. For example, I found it extremely interesting that "PlentyOfFish", a free dating site with millions of users, ran on only a single server managed by one guy until recently. And reading the history of the eBay site gave me a completely different picture than the one I had in my head - it evolved over quite some time and took a number of strange turns along the way. Not at all what I imagined, but it works and that's what counts.

On the subject of being able to handle growing amounts of work in a graceful manner, what better proof would there be than a million users showing up in a 4 minute period of time? First of all, it would blow right past whatever built in limits are normally defined in the operating system, web server and database - one of the many useful things about High Scalability is how it uncovers those things and shows you how to fix them before they bite you in the arsenal. Windows is especially notorious for setting internal limits at arbitrarily low values and waiting for you to discover them the hard way (like when all of your visitors are treated to an ugly "500 server error" message, for example - definitely not graceful).

What High Scalability consistently reminds us is that you can have all the capacity in the world but not be scalable if you don't pay attention to all the relevant details. That's why I think of scalability testing as a cross between performance testing and functional testing. You should never deploy a new server without subjecting it to a scalability test designed to uncover whatever artificial limits may be lurking underneath and to discover what real and practical limits there are on throughput and so forth.

I've asked Todd Huff at High Scalability to help me find a suitable candidate for the first one million user load test but I'm sure he won't find anything as good as diapers.com. Nonetheless, being first means being the first of many and I hope to be doing these around the clock before long. In fact I'm at work on an icon that companies can display on their web sites indicating that they have passed a CapCal Crash Test of whatever number of users they desire. Of course, with Amazon EC2 there to allow capacity to be r e a d i l y e n l a r g e d, that number could be very, very big indeed - once we've done a million users we'll be aiming for 10 million and so on.

CapCal Architecture and Background


Todd at High Scalability asked some really good questions about the CapCal architecture so I thought I would conserve space in his blog by answering his questions here.

CapCal Host
Windows Server 2003, C#, ASP.NET, SQL Server 2005. By default a hosted (shared) resource, dedicated CapCal Hosts can also be spawned on Amazon EC2. The shared (hosted) version has an account for each customer, which in turn has its own user accounts, tests and test results invisible to all others. In the dedicated (cloud) version, the customer has complete control of the server.

CapCal Controller
Small Java app optimized for Linux. Controllers are spawned by the host for the lifetime of a test. Any number of controllers can be managed by a given host. Controllers, like agents, can be deployed anywhere, inside or outside the firewall, on "bare metal" or the Cloud.

CapCal Agent
Small Java app optimized for Linux. Agents can be deployed on customer hardware, CapCal hardware, the CapCal distributed network or on the Amazon EC2 Cloud with a 20 to 30 second bootup time. Each agent can spawn any number of virtual users, up to whatever bandwidth limit there is in terms of download speed. Tests have been run with up to 100,000 virtual users although higher numbers are possible. Each virtual user maintains session state via cookies, 302 redirects and whatever mechanism the app uses.

CapCal Capture
A small Windows client that captures HTTP and HTTPS traffic to be used in creating test sessions. Test sessions are combined into test plans, which include all the parameters of test execution, like ramp time, hold time, virtual users, etc. Test data can be used to parameterize everything from logins to product codes, account numbers and so forth. Think time is built in but can also be controlled on a page by page basis. Tests can be created manually or captured via execution of a functional test tool like the ones from HP, Compuware, Worksoft, etc.

How It Works
From an icon in the Windows system tray, the user starts and stops the capture of browser sessions to be used in the test. Running a test is a matter of filling in a small form and submitting it. The test is uploaded to the host, scheduled and run as soon as a controller and agent pool is available, which is immediate unless scheduled to run later. The test can be watched in real time and canceled in the event of errors, excessive delays, etc. The page hit detail in the test results shows everything that happened, which can be extremely useful in tracking down bottlenecks and scalability traps.

CapCal History
Capacity Calibration was conceived in March of 2000 for cloud computing 1.0, otherwise known as distributed or grid computing. The CapCal agent has been downloaded and installed on over 15,000 computers all around the world and the network has been up and running 24 X 7 since August of 2001. In November 2001, CapCal ran load tests for NASA (see above) to help it prepare for its first live video streaming of the space shuttle launch. The largest test ever run was 10,000 virtual users, even though the network was capable of much larger capacity. The CapCal distributed network is now being phased out in the transition to Amazon EC2, aka cloud computing 2.0, although it can always be reactivated if needed.

The Scene of a Crash on the Wild, Wild Web

If you ever wondered what a Web site looks like when it crashes, first look at the donkey photo and then have a look at the chart above. This one pretty much flat lined at 365 users - notice how the red bars (delay) get bigger and bigger and the green bars (bandwidth) get smaller and smaller. The Web Speedometer would be pegged all the way to the right at the end of 5 minutes. (By the way, this chart shows the sign of a server overload, not bandwidth congestion - if that had been the case the green bars would be maxed out while the red bars got bigger and bigger).

This was a Web site of a friend of mine so I had permission to bring it to its knees. It recovered quickly, though, but had I continued for much longer it would have required a system reboot. You might even think of this as a controlled denial of service attack and if so you would be at least partially correct. But if this particular Web site expected more than 300 users they would be spending all their time gazing at the hourglass!

There are lots of good stories about famous web site crashes and I will entertain you with as many as possible here in this blog. One of the most famous of all was when Microsoft first launched their Web site in 1995 - it had been awaited for so long by so many that it immediately crashed upon startup! After doubling or tripling the number of servers it came up again and crashed in two days! I can't remember how many times it took them to get it right but you can only imagine the size of their data centers nowadays.

A service that CapCal is now offering is called CloudBurst, and it's basically the ability to do what I did to my friend's web site to find out exactly the point where capacity and performance are no longer sufficient. That could be millions of users in some cases and nobody has ever done a million user load test, or even half a million. I would love it if someone could prove me wrong but I know how many computers it would take and I doubt anyone has that many machines sitting around doing nothing - not physical machines anyway!

I'm hoping (with Amazon's help) to break the 1 million user barrier someday and with CapCal's ultralight agent that can be done with just a few hundred instances. I'll be posting on a regular basis about CapCal CloudBurst and its race for a spot in the Guinness Book of World Records so please check back frequently!

Elastic Server - a Factory for the Cloud!

To appreciate the power and flexibility of Elastic Server you have to have built at least one server from scratch in your life. If you've built as many as I have it makes this fully automated server assembly factory even more amazing - building a brand new, customized virtual server and deploying it to the cloud can be done literally in minutes. There's a package for everything, and what there isn't you can easily add yourself. This is exactly what cloud computing needs in order to really take off in a big way (and the signs are everywhere that this is happening).

CapCal uses Elastic Server to assemble our ultra light weight load agent, which was built to spec by Elastic Server from a Debian Linux with the Java Runtime and nothing else. This little beauty boots up on Amazon in 20 to 30 seconds as opposed to 3 to 5 minutes for Windows Server 2003! And the amount and responsiveness of the support I've received (as a total Linux idiot) says a whole lot in itself.

With CapCal's new CloudBurst service, dozens or even hundreds of these can be spawned in a flash, and with each of them handling up to 1,000 virtual users or more I think you can see how hundreds of thousands of simultaneous sessions becomes a reality - something that was never possible before cloud computing!

Next I am going to blather on about the word "elastic" and why it is such a great metaphor. But make sure to go here and check out Elastic Server - it deserves a place in your Favorites list like it has in mine!

Why Elastic is So Fantastic


When Amazon chose the name "Elastic Cloud" it was a stroke of genius because elastic is something that expands and contracts at will, which is the ultimate breakthrough of cloud computing. Theoretically (since nobody has done this as far as I know) it should be possible not only to insure against a crash but to guarantee top performance at all times - if one server starts to drag, just spin up another one and add it to the load balancer - problem solved! Extra capacity can be delivered on demand when needed and released when not.

I'd like to throw down my gauntlet here (and if I had one I would) about how NO static or non-elastic web site is immune from crashes and slow downs, a claim I can easily prove with any site. To me it seems obvious that only through the elasticity the cloud offers can we have a consistent, high performance Internet, not like the one we have today, which is prone to overloads and wide variability.

In the noble tradition of eating one's own dog food I have devised a way to scale the CapCal server itself, to "elasticize" it on EC2, so that it will grow and shrink on demand based on moment by moment performance measurements (i.e., the CapCal Web Speedometer). I'll be reporting my experience in undertaking this adventure so check back for an update!

CapCal Goes on a Phishing Expedition


Last night I got an email from "AOL Member Services" urging me to "revalidate my account information" or my account would be suspended. I don't have an AOL account and even if I did I wouldn't be so stupid as to comply. Unfortunately there are enough naive people in the world (who may in fact be very intelligent) that will fall for it to make it worthwhile.

But since it was ASKING me to fill in a form with all my personal data (including SS#, credit card details, etc). I decide to comply a thousand fold!

That's right, I used CapCal to capture the POST of the form data and turned that into a single page session. Then I got busy running it in user volumes ranging from 10 to 1,000 and learned a lot in the process!

Here are the results of the last test I ran - 32 users for 20 minutes. You can see that the AOL phishermen (who hail from Denmark according to the .dk domain) received 4,051 responses during this time! With all the testing done earlier I'm sure the number exceeded 20,000 so I was glad to see the next morning the URL was down. One down, oh so many to go!


Welcome to my Koolaid Stand!


It was actually Amazon Web Services evangelist Jeff Barr who coined the term "cloudbursting" to describe the ability for the Amazon Cloud to deliver capacity on demand. But I discovered this after coming up with the name CapCal CloudBurst for our tsunami-style load testing service, so please don't call me a copycat!

Had the CEO of J Crew been drinking the EC2 koolaid like I am they might have made millions on the Obama girls' choice of coatware instead of having their website take a dirt nap like that. I'm sure they made a lot as it is, but you know they lost out on untold numbers of "impulse buyers" who forgot all about it by the end of the day.

I'm rather tipsy on the EC2 koolaid myself and it makes me wonder what they are spiking it with. To me it seems like the only way to keep things like this from happening because when you come right down to it you can NEVER know in advance how your app will respond to huge traffic without testing it, and you can never know how big of a number "huge" is until you exceed your greatest expectations!

Both CloudBurst-style testing and EC2-style cloudbursting are necessary elements of any long term solution to the spectacular website crashes we see every day (and all the ones we don't see). And it's not really crashes that are such a major problem, but the precious time of countless people is being wasted every day by uncalibrated, untested and totally insufficient web capacity. There, I said it!

It seems to me that a Web Speedometer that tells you when extra capacity is needed could be a perfect trigger point for when cloudbursting and its opposite (cloudshrinking?) should occur . And CloudBurst is a Poster Child application of EC2 the way I see it.

Hey, it tastes good to me, wanna try some?

Pet Airways Crashes on Opening Day!


I have a Google alert for "website crash", which means I get lots of news stories about other kinds of crashes that I have to sift through. Normally the words "airline" and "crash" only combine in the most tragic of stories, but today I was treated to a rare and delightful exception!

That's right, the website of Pet Airways, which caters strictly to the travel needs of canines and felines , crashed after going live because of all the Marmadukes and Garfields in the world wanting to book a flight! I thought an airline catering strictly to pets was a joke until I found out otherwise, and there must be considerable demand for the website to go down on the very first day! But it makes sense if you think about it - for many people, their pets ARE their children, and who wants their kids riding in cargo class?

I suspect that news about website crashes tends to be viewed in a positive light by many people because it implies a kind of overwhelming interest in something, like the pale pink and royal blue coats worn by the Obama girls at the Inauguration. But I can assure you they did not receive one million simultaneous visitors like the record-setting test CapCal is aiming to do. The embarrassing truth is that many web apps aren't equipped to handle several thousand, much less a million users all at once. But they could easily have lost a million dollars in tickets that went unpurchased, and that is tragic indeed.

Perhaps Pet Airways might be interested in seeing their name in the Guinness Book of World Records as the first company to break the million user load barrier. I admit that it's not the comic motherlode (ahem) of Diapers.com but it certainly has potential, doggone it!

Meanwhile I am hoping that the visitor from 1800diapers.com will come back and revisit my open challenge to boost their image - can they handle the really heavy loads or not? And if not, why have they not signed up with Amazon Elastic Cloud Services?

These are the questions that I need answered.

Kaiser Chiefs Website Crash Fiasco!

I confess my musical ignorance for not knowing who the Kaiser Chiefs are but the crash of a website selling tickets to their show had an especially nasty twist to it. This happened yesterday, so the news doesn't get much fresher than this, folks!

You can read the article in its entirely here, but the gist of it is that when they brought the site back up it had been restored to a point prior to when the tickets were sold! That means many people bought tickets to a sold-out show, only to discover later that their tickets were worthless!

In this case there was no money actually lost (the show was sold out), but the embarrassment and PR fiasco made up for it. Were it something besides a concert there could have been lawsuits to deal with!

Since I have now signed up for Google Alerts on the topic of website crashes I'm sure I'll be busy as a beaver with these posts, so check back often!

Liza Minnelli Crashes Sydney Opera House Website!

This happened Thursday, the day before yesterday, when fans eager to buy tickets for her show descended on the Sidney Opera House website in droves and brought it to its knees.

The Daily Telegraph has this to say about it:

WE'RE sure it has absolutely nothing to do with Sydney being the gay capital of the nation but interest in pre-sales for Liza Minnelli's Sydney concert tour crashed the Sydney Opera House's website yesterday.

The site went down at midday, with interest in the two Sydney shows - on October 16 and 17 - so great it took tech staff hours to get the site up and running again.

Minnelli, the daughter of legendary singer and movie star Judy Garland, will be accompanied by her 12-piece orchestra when she tours.

I'm not so sure the gay community had nothing to do with it nor am I sure this gratuitous reference had any place in the article. However, this pattern of websites crashing when tickets go on sale seems to be evident all over (see the Jonas Brothers incident from the week before).

Obviously the Sidney Opera House website could have used a CapCal CloudBurst test to tell them at what level it would begin to strain, at which point the Amazon EC2 cloud could come to the rescue by providing the extra capacity on demand (i.e., "cloudbursting").

We're going to be trying this with a large e-commerce site in Sao Paulo, Brazil and I'll be faithfully reporting the results. So please stay tuned!

The Perils of Driving Too Fast


Yes that is exactly who you think it is, the future Richest Man in the World, Richie Rich++, the One and Only Mr. Bill Gates when he was booked into the Albuquerque City Jail on December 13, 1977! Was he caught robbing an old lady on the street? No, he was busted for speeding of all things!

I was living in Albuquerque at the time and might have seen him burning rubber on Central Ave for all I know. In 1979 I went to work for MITS, at the very same place where he and Paul Allen worked, but he and Paul had moved to Washington to start Microsoft by then. What did I do for MITS on the old Altair 8080, you ask? Why, I wrote diagnostic test code in 8080 machine language, that's what! Once a tester always a tester I suppose.

But I digress, back to the topic of driving too fast. If you drive your Web app too fast you will end up exactly like you will if you drive your Ferrari too fast - you'll either crash or run out of gas. From our point of view the app has become very, very slow, but from the server's point of view it is going as fast as it possibly can!

And what do you think it would take to bring a high end server to that point, or even an entire cluster or farm of high end servers? It could take lots of computers, that's what. How many depends on the size of the infrastructure, a case where size definitely DOES matter.

That's where cloud computing comes into play, and right now that means the Amazon EC2 Elastic Cloud. Don't get me started on what a wonderful thing it is or I'll never get back to the subject at hand. But at the click of a mouse I can now spawn as many virtual computers as I like, use them in a large scale performance test, and shut them down when I'm through with them. If that doesn't sound very interesting to you, then you are at the wrong blog - try one of the links on the right!

Coming next - a view into the Cloud!

News Flash - a Bookstore Owns the Cloud!


I think it's marvelous to see that Amazon had it together enough to be the first and the best in the cloud computing arena, something that Google, IBM, Microsoft, Oracle and HP are all salivating over. Take that, ya big technological bullies!

But the EC2 Cloud is only one of a whole string of innovations that Amazon has cooked up, starting from the Web services that allow every Mom and Pop shop in the known universe to sell their books on Amazon. First it was "software as a service" and now it is "computing as a service". Computing as a service is actually a radical idea if you think about it. No longer are we constrained by space or time when it comes to how many computers we can own and use, and just think of the number of warranty forms we won't have to fill out and send in! And don't even get me started about factory rebates, that's a rant topic all on its own - if the Cloud saves us from that hell it is ROI enough if you ask me!

Of course, all the CIOs of the world are asking the same questions, namely will the total cost of ownership be less, will their applications and data be secure enough, will the service be reliable enough, and will they get equal or better performance. Once all those questions are answered in the affirmative we can expect our landfills to fill up quickly with discarded servers and server racks. Either that or Obama can ship them off to Fidel as a gesture of good will and as compensation for not having Amazon EC2 down there. Ouch.

Microsoft and Google have both launched cloud computing initiatives but if you ask me they have completely missed the boat. They are more into application hosting, which to me is not any different from existing web hosting providers - for now their services are free (or at least Microsof's Azure is) but if they aren't doing cloud computing like Amazon is doing it they've been left behind in the pit on the very first lap.

On the subject of being left behind in the pit, it seems to me that the Amazon Cloud, or cloud computing in general, is where Linux will finally come into its own. Windows has evolved into massive bloatware as you know, and waiting for Windows Server 2003 to boot up is like waiting for water to boil. But the CapCal Agent runs on a stripped down version of Linux that weighs only 125MB and boots up in 20 to 30 seconds rather than the 5 minutes Windows Server takes!

The Tortoise and the Hare, the race car and the VW bug - the analogies are abundant and will be exploited with great relish in posts to come. Suffice it to say that if Windows were a car and Bill Gates was driving it he would never have gotten a speeding ticket, that's for sure!

In the News - Jonas Brothers Website Crashes


This just in from www.topsocialite.com on March 23:

Kevin Jonas, Nick Jonas and Joe Jonas

The Jonas Brothers’ brand new fan club website crashed this morning as fans tried to grab tickets for the upcoming tour.

Several thousand fans hit the website in hopes of landing pre-sale passes to the North American concerts, causing the website to crash.

A statement from the Jonas Brothers reads,

“Due to overwhelming demand among Team Jonas members for pre-sale tickets for the Jonas Brothers World Tour 2009, the official Team Jonas fan club website experienced intermittent problems this morning, causing delays among members attempting to connect to purchase tickets.

“For those of you still needing access codes, we are in the process of contacting fan club members individually with a solution so that you can purchase tickets as soon as possible. Rest assured that only Team Jonas fan club members will continue to have exclusive access to purchase tickets for the next 36 hours.”

Looks to me like the Jonas Brothers are in dire need of CapCal services, doesn't it? They obviously tried to go way faster than their infrastructure allowed and ended up crashing. I'm embarrassed to say that I don't know who the Jonas Brothers are, but thousands of people apparently do and were trying to buy tickets when their web site hit the wall!

Who knows how much money may have been lost or delayed but this kind of thing happens every day. I notice it myself at least once a day when I click on a CNN news item that points to the website of some TV station in Nebraska - it just can't keep up with the number of clicks it's getting so I give up after waiting 20 or 30 seconds, don't you?

I hope the Jonas Brothers are reading this but what are the chances of that, eh? If they are I'm sure they will insist that their website be outfitted with a CapCal Web Speedometer and subjected to a CloudBurst or two before selling tickets. Speed laws on the Web are self-enforcing as you can see, so at least they didn't get any tickets (but then neither did their fans)!

The Great Superbowl Trainwreck of 1999


Please pardon what may appear to be a gratuitous sex shot because I think you will immediately see the relevance and importance of it.

As I mentioned earlier, Victoria's Secret ran an ad on the Superbowl that invited tens of millions of drunken guys to their website to see the likes of this beauty, Stephanie Seymour, modeling their latest lingerie fashions. I won't publish the entire CNN article but you can read it here. Here are the first two paragraphs:

The media blitz that Victoria's Secret used to hype its first-ever live online fashion show paid off, creating high demand at its Web site.

But that success also posed a problem because the high traffic volume meant some viewers were unable to see supermodels such as Stephanie Seymour and Tyra Banks strut down the catwalk in the latest lingerie from the retailer.

You're probably thinking what I'm thinking - forget about the impact to Victoria's Secret's bottom line, what about those millions upon millions of guys who were deprived of the chance to see Stephanie and Tyra strut down the catwalk?

If the seriousness of the problem hasn't penetrated your gray matter by now I'm sure it finally has. The crash of the site might have been embarrassing but it brought them even more publicity and that's good. But many of those guys had to wait a day or two before they could gaze at these beauties, and that's just not right!

Wal-Mart Website Crashes on Black Friday!

This newsflash is not as fresh as the Jonas Brothers meltdown, but it's certainly not prehistoric like the Victoria Secret Superbowl event - that was 1999, folks, not just a decade ago but the end of the last Millennium! If I dedicated this blog ONLY to website crashes I would never run out of things to write about! If you don't believe me, do a google on "website crashes" and you will get 7,810,000 hits!

Let's face it, if the Web were a racetrack there would be spectacular wrecks happening far more often than in the Indy 500! Watching the flame-outs is one of the things I enjoy most about the Indy or the Formula One, but I doubt the drivers would agree unless it was some other guy!

You may laugh (I know I do), but you can bet that the shareholders of Wal-Mart were not laughing about missing out on THE biggest sales day of the year. How on earth can a company as big and powerful and RICH as Wal-Mart allow such a thing to happen? Need I remind you that they are THE biggest company in the WORLD?

I'm sure heads were rolling at whatever company manages their IT operations (or formerly managed them that is). That's the kind of news we usually don't read about, maybe because people getting fired is such a routine occurrence. But when the biggest company in the world misses out on the biggest sales day of the year you would think SOMEBODY would say something about the emperor's lack of clothes!

Well, that "somebody" would be Yours Truly and I just may appoint myself Web Traffic Cop if you're not careful!

A Traffic Cop on the Wild, Wild Web

Yep, that's me - can't you tell? I'm going to be issuing speeding tickets to companies that exceed the speed limit and cause their web site to crash. There has to be a crash, of course, or I would never know about it . Reckless handling of Web capacity is rampant these days, as if people were driving Ferraris on the Autobahn without speedometers.

When I figure out a way to turn the Web Speedometer into a Web Radar Speed Detector, THEN the big IT services organizations might finally listen to me and get the point. Until then, I'll continue issuing citations at the crash scene, gathering witness testimony and so forth.

Step back, please, nothing to see here.

Spectacular J Crew Crash on Inauguration Day!


Here is an interesting article about how the J Crew website crashed on Inauguration Day when umpteen thousand women and girls decided they MUST have the pale pink and royal blue wool coats worn by Obama's daughters!

According to the NY Times:

By Tuesday afternoon, the Web page on J. Crew’s site that features women’s gloves had crashed. By Wednesday morning, the whole women’s section of the site had crashed. Later in the day, the entire site was down, with a note that said, “Stay tuned…Sorry, we’re experiencing some technical difficulties right now (even the best sites aren’t perfect). Check back with us in a little while.”

J. Crew’s Web site joins the many media sites that could not keep up with the surge of inauguration Web traffic. Many of the sites that promised to stream live video of the inauguration struggled or failed to provide a steady stream because of the influx of traffic.

I think you can see why I chose this particular photo - it wasn't just J Crew, their suppliers, and their suppliers' suppliers who felt the pain, it was the many media sites that could not keep up with the traffic. Hence the spectacular collision depicted above.

It will take me a while to write up this many tickets so please check back later.

Meanwhile, Off to the Races!


Waiting for diapers.com to get back to me about this once-in-a-lifetime opportunity has given me the chance to return to my other obsession besides load and that is speed. How fast is your server? Is it faster than mine? Is it faster than it was before the last patch or memory upgrade? How much slower is it running in a virtual machine? What about on the Cloud? Or, one of my favorites, which one is faster, Linux or Windows?

I know, those are trick questions because there's no way to know if you don't have a way to easily and effectively measure such things. Well, the CapCal Performance Rating Test is exactly that and you're welcome to use it yourself if you wish - just write to info@capcal.com and we'll send you a form to fill out with a download link. But if you just follow along here you'll see what I'm talking about.

I'm talking about a performance test that takes the CPU, the OS, the database, the network stack and overall throughput into account, and a way of doing the same thing on both major platforms. The Java Pet Store is one of Sun's benchmark reference applications and Microsoft thought so highly of it they developed the .NET Pet Shop equivalent. So this is what we use in the CapCal Performance Rating Test and we think it's the ideal choice, especially since it's been blessed by the powers that be.

Oh yes, and did I mention that I am a total Linux idiot? I know I did at least once. I tried to install the Java Pet Store on Debian Linux and got it copied to the hard drive. Then I realized I didn't have the Apache web server installed, and the word Apache reminded me of how Custer must have felt surrounded by all those Indians. I didn't feel that bad, but I can tell you I felt a similar desperation. If someone will help me, or if I can find the time to blunder through it, I will begin the Linux vs. Windows Speed Races, right here at the CapCal Blog!

Gentlemen, start your engines!

The Need for Standard Performance Ratings

You can't manage something you can't measure, which means the performance of distributed web apps, whether they are running on "bare metal", local virtual servers or remote ones on the cloud, is totally unmanageable. There is no, and I mean NO standard way of comparing the performance of two servers, or even two versions of the same server, much less with bandwidth and connectivity taken into account. We can measure all the bits and pieces, from CPU speed to upload and download speeds, but we can't measure the whole and make valid and useful conclusions of any kind.

That's because the "whole" includes so many things, from the web server to the database to the OS to the network stack. How on earth can you measure all of that and say that it is faster or slower than anything else?

You do it the same way they do it in other industries, by defining and adhering to standard benchmarks that are reliable, repeatable, easy to use, and most of all, done in the same way each and every time. This has been a challenge in the past because competing operating systems, databases and so forth make it very difficult (though not impossible) to compare apples to apples.

The CapCal Server Performance Rating Test is a packaged performance test that is quickly and easily installed on any server and contains everything necessary to run a 2 minute performance test against a standard Web app, the Java Pet Shop (which has a .NET equivalent). The total number of pets purchased by 20 users over 2 minutes will be your rating number, which is large enough to be granular (in the hundreds or thousands) and small enough to be easily categorized. Since it exercises the entire stack (including the storage subsystem), this test is not only a good performance indicator but an excellent health indicator as well - all the important subsystems have to be up and running for the test to even start.

Following posts will show the CapCal Rating Test being run in a variety of instances so we can dissect them and see what they are telling us.

A Busy Customer at the Pet Shop

The test above shows how many pets a single user can buy from the Pet Shop in one minute if they don't pause between pages. Obviously it's not a "real world" simulation since people DO pause between pages, but it's exactly that kind of variability that we have to cut out if we're going to compare apples to apples.

At the top right you see that the total number completed was 55, so we were almost able to buy one persian cat per second. There are 12 pages in this test, which means we are processing almost 12 pages per second, and you can see that the average site delay (the time between when the request was first made and the last byte of the response was received) is only 0.001 seconds, which is 10 milliseconds or 100 microseconds. I'm so old that I remember when certain machine instruction cycles lasted that long!

OK, so what does this tell us about server performance? To test our accuracy, if we double the users we should double the number of cats purchased, right?

Twins at the Pet Shop

Yep, you got it - 2 users can buy 110 cats in one minute. Now we are processing 24 pages a second and buying close to two cats per second instead of one.

But look at the average site delay - it too has doubled, from 100 microseconds to 200 microseconds!

That's to be expected, of course - you don't get something for nothing. So can I safely make the assumption that I can buy 10 cats a second with 10 users? 1oo per second with 100? And if so, where will I put them all?

Ten Customers at the Virtual Pet Shop

As we see, 10 customers can purchase 526 cats in a minute, or 8.8 cats per second. That's a lot of cats, especially for someone like me who is not especially a cat person. If you don't mind, I think we'll start buying dogs instead - it shouldn't affect the test results since both are 3 letter words, and it will help keep all the cats under control!

Perhaps a 2 minute test with 20 users is overkill - maybe 10 users over 1 minute is just as good if not better. It's definitely better for being shorter, and the total is sufficiently large as to save us from having to think in fractions. After all, we're testing our servers, not our brains!

The Buck Starts Here


For those of you non-Americans, there is an old expression called "passing the buck", which means to place the responsibility on someone else's shoulders instead of taking it yourself. President Harry Truman had the above sign on his desk, meaning that he had ultimate responsibility since there was nobody else he could "pass the buck" to.

As you know from earlier chapters, the performance of a distributed, Web-based application is determined by many factors, from database queries on the back end to the speed of the client connection on the front end. Many of these are beyond our control as developers and yet we still have more control over performance than anyone - a poorly written database query is slow regardless of the speed of the network, for example. And not just the network itself, but firewalls, domains, SSL encryption and a host of other things can impact performance in the production environment.

So even if the buck doesn’t STOP with development, it certainly does START with development. And yet developers by and large have abdicated that responsibility to QA and IT Operations, expecting that whatever performance testing is done when everything is in place will uncover any issues that might be lurking. “It worked great in the lab” is a familiar refrain because it’s true – in the lab there are 1GB Ethernet connections to dedicated servers being accessed by developers who have the fastest computers in the company!

No Longer Just a Hardware Problem


In the olden days we were obsessed with performance because it was a necessity – if you had a 64K memory limit it made sense to sweat over such things as the number of CPU cycles a given instruction used up. Performance is just as important now as it was then, but memory is abundant and hardware is cheap so what’s the problem?

The problem is that you can’t manage something you can’t measure, and the kinds of measurements that we developers have at our disposal are sometimes not all that helpful or relevant. We can profile our code and our database queries and look for performance hogs that way, but if our profiling doesn’t include the things end users will be doing it can be of little use in the end. It might be something as simple as a new Flash animation that clogs the pipe and makes performance come to a crawl. How will the developer know that with their 1GB connection and lightning fast client machine?

Agile Performance Testing - an Oxymoron?


Today’s agile development methodologies are based on the notion that quality has to be built in from the beginning, not hoped for at the end. The same applies to performance, which is just as important as functionality – after all, who cares if the ordering system works if it’s too slow for anyone to tolerate?

One of the basic tenets of agile methodology is test-driven development, starting with unit tests that are written explicitly to fail until they can be proven to work. This is exactly the time to start measuring and managing performance, so why do most companies wait until the last minute and hope for the best?

One reason is that measuring performance requires subjecting the application to a level of load that will at least approximate, if not duplicate, the production environment. Even though there are all kinds of tools available to do this, it is a costly and time-consuming effort even if the tool is free. If I’m a developer working on a deadline, what’s in it for me to spend hours if not days creating and running load testing scripts if I know that in the end the issue of performance is out of my hands? If I’m especially diligent I can do some code and database profiling to soothe my own conscience but I’m not likely to waste much time with a load testing tool unless I’m forced to. Are you?

Writing Code to Test Code?


All of this, of course, flies in the face of the Agile Manifesto, wherein quality (and thus performance) is built in from the beginning by repeatable tests that are run automatically whenever changes are made. If I am writing code already it doesn’t make sense for me to be writing even more code to test the performance of the code I’m writing – that becomes maddeningly recursive once your load tests themselves need tweaking!

No, it’s hard enough to write unit tests that have to be maintained along with the code they are testing. Instead, performance testing needs to be a by-product of something else we developers do every day, day in and day out, and that is manual functional testing. If I am writing an order entry system like in the earlier example, I will be entering dozens or even hundreds of orders just to check my work. So why not “repurpose” some number of those as performance tests?

CapCal Agile Performance Testing


CapCal Agile Performance Testing (ATP) is a free online performance testing service designed to turn manual tests into performance tests that can be easily integrated into any agile development framework. Along with the pass/fail status of the unit and functional tests, developers and QA analysts can now track overall performance trends that take everything into account, including bandwidth and throughput. The impact of that new Flash animation on the Sales Order page will be immediately obvious instead of waiting until a few days before Go Live to be discovered. And since we’ve gone to the trouble of building performance tests into the development cycle, it only makes sense for those in QA and Operations to leverage those tests instead of “rolling their own” and duplicating all that work.

With CapCal APT, you can capture your manual tests and use them to create performance tests, which are automatically uploaded to the CapCal host server and used to drive the controller and load agent on your desktop (or whichever machine is doing the work). A 2 minute performance test using 40 virtual users will yield a wealth of information about how the application performs under load, even if it doesn’t begin to tax the server. If today’s performance results are 30% worse than yesterday’s, it’s easy to pinpoint the cause and much easier to fix it than it would be later.

Of course, CapCal APT is just the tip of the CapCal iceberg, which covers the performance testing gamut from development to production, including the cloud and the Internet itself. But for the teaming masses that simply want to measure performance in an agile development environment, it’s just what the doctor ordered!

Performance Testing in the Dinosaur Age


Can you imagine a full Broadway production that only runs for a single night? It would have to be a really bad show! But that’s exactly what load testing used to be like, and still is if you automate the old-fashioned way. Writing test scripts, setting up computers to generate the load, running the test scripts by hand, and evaluating the results is like a really awful Broadway production. Problem is, the next time you make changes to your application you have to do the whole thing all over again. At least a bad show never comes back—this one does, but not often enough to keep the whole cast and crew hanging around!

That reminds me of a story. Back in 1981 (the dinosaur age in Internet time), I worked as a software test engineer for a company called Basic Four in Irvine, California. Basic Four manufactured a “minicomputer” that was the size of a small refrigerator and about half as fast as that old 386 machine in your closet. As many as 100 dumb terminals could be attached to a single machine and used for data entry tasks (or so the marketing guys claimed). My job was to prove that the computer really could support 100 terminals.

Proving that it couldn’t did not seem to be an option. So I dutifully wheeled in several racks of CRT terminals, exactly 100, from the manufacturing floor and painstakingly connected all of the cables and cords. My boss cheerfully suggested that we post an announcement on the cafeteria bulletin board, asking for volunteers from the factory floor to help us with the test. We could even offer free pizza to entice them!

“Forget it,” I said, “most of the people on the floor don’t speak English and have better taste in food. And it seems like a silly thing to do if we have to repeat it every time we change the software—can we really afford that much pizza? If we offered free beer we would draw more people for sure, but it might mess up the test results if they throw up on the keyboard or get tipsy. So I’ll just throw together a little program to write to the disk drive and display characters on the CRT—at least it will be similar to what happens in real life.”

So that’s what I did. One after the other, I logged in at each terminal and started up my test program. The screen began to fill up with dots and then I moved on to the next one. I began to notice that the dots on all the screens moved just a little bit slower with each one I added. By the time I got to about 30, they were barely even moving at all. “Oops!” I thought, “My test program must be too demanding on the CPU—back to the drawing board.”

So I rewrote the test program to be a bit more realistic. I started with what a fast typist might do (say, eighty words a minute) and calculated the average delay between keystrokes. Then I wrote a routine that put the program to sleep for that period of time and woke it back up long enough to send a dot to the screen. After eighty such iterations, I told it to send eighty bytes of data to the disk to simulate saving a record. Then, of course, I had to go back to all the terminals, one by one, to kill the previous version of the program that was running and start up the new one.

By then it was nighttime, and my boss showed up with pizza. (Was it the great philosopher Dilbert who said, “Pizza is the opiate of the bosses”?) “How’s it going?” he asked. I explained what I was doing, and he seemed very excited about the whole thing. “That’s great! Just think of all the money we’ll save on beer and pizza!”

“Now that you mention it, there’s that little matter of the bonus you promised me,” I laughed. “Here, make yourself useful. You start on that end and I’ll start on this end. Just log in to each one, start up RandyTest in the main directory, and then go on to the next one.”

Thrilled to be doing something “technical,” he complied. This time the speed of the dots didn’t degrade as much every time a new terminal was activated. But when we got to around 50, something weird happened—the “refrigerator’s” disk light suddenly started flashing furiously and all the dots slowed to a crawl! Apparently, I had overwhelmed the hard disk by having so many high-speed “virtual typists” save so many records so quickly. For these many years it has been my job to torture computers, but in this case, I needed a realistic load test that focused on the terminals.

I went back and adjusted my routine to simulate a range of slower typists. The adjustment allowed me to get almost 70 terminals up and running before the disk began to thrash and the screens froze up. By then my boss had finished his pizza and gone home, after congratulating me on a job well done. Well done? Depends on how you look at it, I guess. I couldn’t see any way I could possibly back up our marketing claims without assuming that the average data entry clerk can’t type faster than a nine-year-old! I had to set the typing speed to about ten words a minute before 100 terminals could run at once, and even then the dots were barely moving. But hey, that’s what exaggerated marketing claims are all about, right?

The art and science of load testing has evolved quite a bit since those days, but the basics are still the same. Can a system or network of a given capacity deliver adequate performance under real-world usage conditions? There is only one way to find out for sure, and that means putting the application “on stage” to see how it really performs. Fortunately, we’ve at least taken most of the legwork out of the process these days, with the sophisticated load testing tools that are available. And we don’t even have to haul 100 (or 10,000) terminals into the lab to do it. Only the best Broadway shows run night after night—no messy setup and teardown with each production. If you can do your load testing this way, you are way ahead of the curve.

(This article was first published by StickyMinds).

Free Scalability Test - Find Your Limits!


Are you about to roll out a new web site or application? Making significant changes to an existing one? Worried that something might go wrong?

Or maybe you are deploying on EC2 with load balancing and autoscaling and need a fast and easy way to configure and fine tune according to the requirements of your application.

Even if the answer to both of these is NO, do you ever wonder what your actual limits are in terms of users browsing your site, choosing products and so forth? It's often not nearly as high as you imagine!

Find out for sure with a free "crash test" from Capacity Calibration using EC2. Just send an email to info@capcal.com with a time slot from 8 AM to 6PM CST that works for you and we'll send you a GotoMeeting invitation. In 30 minutes or less, we'll create the test, run it and analyze the results! Or, if you prefer you can go it alone you're welcome to - just spin up a few instances of the CapCal Load agent and go for it!

A number of people have already taken advantage of this offer and we'll be posting the results and their comments on the CapCal Blog.

Please note - this is a special introductory offer that is only available to the readers of High Scalability so don't miss out. Depending on the interest level we may decide to make a free community version available long term.