Why "Statistical Regression Testing" Matters

There is a class of problems or bugs that have a very small chance of being detected by manual testing or conventional regression testing, simply because they occur only when certain conditions are met. Take as an example a web server farm in which one of the servers is mis-configured or didn’t boot correctly for some reason. A manual test has a 1 in X chance of hitting the bad server, where X is the number of servers in the farm. These are exactly the kind of issues that bedevil us the most, because we may have to repeat it many times before it happens and there seems to be no rhyme or reason to it. In this example, the “rhyme and reason” is the algorithm used by the load balancer to distribute incoming requests.

In a recent project we had the load balancer configured for the default client header size of 2k bytes, which is fine in most cases but in this particular case there was a small percentage of headers that exceeded 2k because of cookie size. The result was an HTTP error that never seemed to occur in manual testing simply because it was statistically uncommon. Only by load testing were we able to consistently generate these errors and eventually discover their cause and fix them by raising the maximum header size in the load balancer.

This is a perfect example of why load testing is critical, not just for measuring performance but for uncovering the kinds of problems that require a large statistical sampling of client instances. The objective is not to cause stress on the system but to throw enough variations at it to make sure bugs like this are not lurking below the surface. We’ve named this “statistical regression testing” because it is a cross between functional and load testing designed to uncover issues that are statistically uncommon.

I’m sure if you think about it you’ll come up with a lot of examples in your own career where statistical regression testing either did help or would have helped. Maybe you let a functional test run all night or got several people to bang on their keyboards at the same time. Or maybe you WERE running a load test when the issue popped up and blamed it on the load testing tool until you discovered otherwise (and yes, you know who you are)!

One reason that load testing is normally done at the end of a sprint or development cycle is that there’s not much point in stressing a system that doesn’t work to begin with. The Catch 22, as you can see, is that you may need to run a load test or a whole lot of functional tests before you can say that it works well enough to stress it. In any case, “statistical regression testing” is just as important as functional and load testing and the risk of releasing bad code goes way down if you employ all three.

We’ll continue to revisit this topic and finding more examples as we go along. The statisticians among us will discover that their knowledge is very critical in determining the kinds of tests to run, and I am not exactly a statistician. But I do admire them, and hope that my understanding of this field will grow over time.

No comments:

Post a Comment