Wilcoxon Signed-Rank Test for Performance Testing

The Problem

Imagine you’ve modified an algorithm and want to know whether its performance has changed.

You run your test suite before and after the change, measuring execution time for each test case. It’s easy enough to come up with an ad hoc performance metric and get a sense of good/bad, but it’s tricky to be more rigorous, accounting for the different baseline performance of each test, separating signal from noise, and determining the statistical significance of the results.

Of the numerous applicable statistical tests, the Wilcoxon signed-rank test achieves a nice balance in this scenario by focusing on the ranks of the changes rather than their absolute values.

First though, a brief disclaimer. I’ve never been very good with statistics and I’m not a great person to elucidate even a relatively basic concept like this. I became aware of the Wilcoxon signed-rank test at a previous job when it was used for this sort of benchmarking, but I was afraid of it. Like a mystical artifact I admire without understanding, I put it on a shelf somewhere in a dark closet in the back of my mind and decided to dust it off when the time came to finally use it. That time has arrived, so I’m using this notebook to rid myself of any fear and figure out what’s really going on.

When to Use This Test

The Wilcoxon signed-rank test might be a good fit if you’ve measured some indicator for performance of something and want to determine whether the performance has improved or degraded. This could be something like wall clock run time. The key thing for this test is that the measurements are paired, like a single test suite run before and after some change.

Measuring algorithm performance can be weird. Things are pretty consistent, except every now and then garbage collection or some other browser tab eats up all the CPU cycles and you get some extreme outliers. The Wilcoxon signed-rank test is non-parametric, meaning it doesn’t assume your measurements follow a particular distribution. Instead, it ranks the measurements by magnitude and uses the numerical rank as the input, thus preserving some information about the relative size of changes while making it less sensitive to noise and outliers.

There are, of course, tests for normally distributed data (paired t-test) or for when you know the sign of changes only (sign test) or when the tests are ranked but not paired (Mann–Whitney U test). In short, there are many methods for many configurations, but if I had my choice of setting up some basic benchmarking, the Wilcoxon signed-rank test would be a pretty good fit.

How the Test Works

Consider measuring the performance of an algorithm before and after some changes, say and The Wilcoxon signed-rank test proceeds in several steps.

Calculate differences. For each test case , compute the difference

where positive values indicate the algorithm got slower, negative values indicate it got faster.

Rank the absolute differences. Ignore the signs temporarily and rank the absolute differences from smallest (rank 1) to largest (rank ). If there are ties, assign each tied value the average of the ranks they would have received.

Apply signs to ranks. Restore the signs from the original differences to the ranks. This gives us signed ranks .

Sum ranks by sign. Calculate:
- (sum of ranks where algorithm got slower)
- (sum of absolute ranks where algorithm got faster)
Note that , the sum of all ranks from 1 to .

Test Statistic and Null Hypothesis. The test statistic is . Under the null hypothesis that there’s no systematic difference (the median difference is zero), we expect and to be roughly equal. A small value of indicates one-sided behavior.

For large samples (), is approximately normally distributed with

We can compute a z-score,

and derive a p-value from the standard normal distribution.

A continuity correction accounts for approximating a discrete distribution (integer ranks) with a continuous distribution (normal). Since can only take integer values, we adjust by 0.5 in the direction of the mean, which improves the accuracy of the p-value for small to moderate sample sizes.

I sorta get the reasoning, but the precise reasoning is a bit fuzzy to me. At the end of the day, we use the adjusted equation,

P-value. Finally, we compute a “two-tailed” p-value using the normal CDF function. After reading the above reference and working out a line or two of algebra, we end up with

Interactive Demo

Adjust the parameters below to explore how different scenarios affect the results:

Sample Data

Below is the synthetic data generated with the chosen parameters, showing before and after timings for each test case.

The table below shows these same differences ranked by magnitude. Notice we’re ranking by absolute time difference rather than percent change. One or the other may be more appropriate, depending on how you expect performance to change across tests, for example whether by a fixed latency or a proportional change.

The test proceeds by computing the sum of the positive (W+) and negative (W-) ranks:

Sample size (): paired observations
Sum of positive ranks (): (cases where algorithm got slower)
Sum of negative ranks (): (cases where algorithm got faster)
Test statistic ():

Statistical Significance

With the test statistic W in hand, we compute a z-score and p-value using the normal approximation.

For reasonably large samples (), the test statistic is approximately normally distributed. We can compute the expected value and standard deviation under the null hypothesis (no systematic change):

Expected value:

Variance:

Standard deviation:

Z-score (with continuity correction):

Test Results

Z-score:
P-value:
Conclusion:

Interpreting Results

The p-value tells us the probability of seeing differences this extreme if there were no performance change. A p-value below 0.05 provides strong evidence of a real performance change. A p-value of 0.05 or higher means insufficient evidence to conclude there’s a systematic change.

Conclusion

At the end of the day, this was a lot more intuitive and less scary than I expected. We really just rank the changes and compare the rankings to what we’d expect if there were no performance difference. I like the precision and rigor with which it does this, and I like that it respects the relative size of changes without being fully sensitive to bizarre outliers. I’m still not great with statistics, but for what I need, it feels like the right tool for the job and I’m pretty content here.