Apr 8, 2014
So you want test and experiment with your apps to optimize the user experience, aka do science on your apps. But, as it turns out, that’s actually harder than it looks. Gathering data used to be the hard part, but with the improvement of analytics tools, it’s getting easier and easier. Now, the difficulty lies in experimental design. Testing too much will give false positive results, so it’s critical to be selective with what and how you test.
First of all, why is experimental design so easy to screw up? It has a lot to do with uncertainty. All tests have finite time and sample size, so you have to set your standards:
As this helpful chart in the Economist explains, even if you use these scientifically accepted standards, you can still end up making a disconcerting number of mistakes. In fact, assuming only 1 out of 10 hypotheses is true — or if only 1 out of 10 tests has a truly winning variant (to put it in A/B terms) — you’ll still end up with almost half of your positive results being false!
So, statistics tells us that most science is wrong.
Actually, the way most people do statistics is wrong too. If it tickles your fancy, you can read an intensely detailed synopsis of the many pitfalls of scientific statistics. Luckily for you, the Apptimize framework makes it easy to do tests right. But, there are still a few important pitfalls you should take care to avoid. One of the most tempting errors is inviting false positives through multiple hypothesis testing. That’s what leads us to the frightening scenario described above.
When you run a ton of tests, that’s called multiple hypothesis testing. The problem with this approach, as shown above, is that if you test all possible combinations of several variables, some of them will seem to show a significant correlation simply by chance. If you really do need to run many tests, you can correct for the false positives in a variety of ways, from the relatively lenient to the draconian. But beware — any method of multiple hypothesis correction decreases the power of each test, and you might end up burying valid results beneath the correction threshold. Typically, the best solution is, rather than to test and correct, not to open yourself up to that danger at all.
Let’s see an example of how multiple hypotheses can get out of hand. Say you collected user properties including age, gender, education level, height, and dietary restrictions, and we made a two-variant test for each. After combining enough filters, regardless of whether you have any actual causal connections in your underlying population, you’re statistically bound to find something. Perhaps you’ll find that 6’1″ male pescatarians with a bachelor’s degree who are between the ages of 23 and 25 have a special propensity to click the red button instead of the blue one! In the modern age of apps, where we enjoy the luxury and suffer the curse of too much data, we are especially prone to being led astray by the ghosts of illusory significance.
Fortunately, you can combat this curse by augmenting the machines’ intelligence with your own. Just take a manageable set of changes you think might help your metrics, test them, and you’ll run a much lower risk of falsely significant results. In other words, don’t run twenty tests targeted by three different parameters – run two or three larger tests, targeted by one or two. The same goes for variant design; two to four will serve better than dozens.
This kind of moderation saves you time and effort, gives you statistical power, and improves the chances that the differences you find are true ones. Of course, through Apptimize, you can still discover the outliers: our results dashboard lets you see when a metric you weren’t thinking about skyrockets or plunges. But in the default case, you’ll likely find that sensible theories and smart guessing leads to good results.
Now, go forth and test, and remember: With great science comes great responsibility.
Deciding between sequential and multivariate testing without knowing their advantages and limitations can pose significant challenges. If you’re looking to optimize your app’s conversion rate, sequential A/B tests, or experiments with two variants, usually do the trick. But in some...Read More
ChordShaker, an iOS app that makes learning guitar chords easy and fun, increased their revenue from upgrades by 20.1% with an A/B test of a simple copy change. With Apptimize, this change took only a few minutes to set up...Read More
Everyone who’s not A/B testing is leaving their app unoptimized and losing out on more engaged customers and more revenue. Most app managers I talk to seem to realize this, but we at Apptimize still often hear from some customers...Read More