A/B Testing Validity Tests That Could Invalidate Your Results

4 Threats That Could Make Your A/B Test Results Invalid

A/B testing may seem straightforward, but the truth is that there are a lot of “little-known” factors that can shift how your tests perform.

You might have heard about sample sizes, statistical significance, and time periods required for proper testing of landing pages…

But what about the History Effect? The Novelty Effect? The Instrumentation Effect? or the Selection Effect?

Watch to find out why the above threats could invalidate your A/B tests (and how to avoid skewing your data).

 

1. The History Effect

The history effect is a big one – and it can happen when an event from the outside world skews your testing data. Let me explain…

Let’s say your company just closed a new round of funding and it’s announced publicly while you’re running a test. This might result in increased press coverage, which in turn results in an unusual traffic spike for your website.

The difference here is that a traffic spike that’s a direct result of an unusual event means there’s a high probability that those visitors differ from your usual, targeted traffic. In other words, they might have different needs, wants and browsing behaviours.

Now, because this traffic is only temporary, it means your test data could shift completely during this event and could result in making one of your variations win when in reality with your regular traffic, it should’ve lost.

How to avoid it skewing your tests

Maybe this is cliché, but the key to avoidance is prevention. If you’re aware of a major event coming up that could impact your test results, it’s a good idea not to test hypotheses and funnel steps that are likely to get affected.

For example, if you’re testing for a SaaS, and planning on getting a lot of unusual traffic for a week, emphasize on testing flows that only your existing users are seeing, when they’re logged-in, instead of testing the main portion of the website, such as tour/feature pages, pricing pages, etc.,

Be aware that you’ll always have traffic fluctuations and outside events that will affect your test data, it can never be completely avoided. In this case, the #1 thing you need to do to minimize the negative effect the History effect can have on your testing program, is to simply be aware of the fluctuations and differences in your traffic.

When you’re aware of what’s happening, you can dig deeper in Google Analytics to analyze your variations’ performance, and then recognize if your winning variation – is indeed a winner. Never analyze a test solely by using your testing tool.

2. The Instrumentation Effect

The Instrumentation effect is quite possibly one of the most frequent among companies new to testing, and it happens when there are problems with your testing tools or test variations that cause your data to be flawed.

A common example is when the code of one or more of your variations is not functioning properly with all devices or browser types, and often, without the company even being aware!

It can be a big problem: let’s say variation C of your test isn’t displaying properly in Firefox… this means a portion of your visitors will be served a problematic page.

As as you can see, in this case variation C is at a disadvantage – its chances to win are slim.

The thing is… if variation C had been coded and tested properly without any bugs, it may have won by a large margin!

How to avoid it skewing your tests

Before launching ANY tests, you should always perform cross-browser and cross-device testing on your new variations. Good news is – many testing tools such as Optimizely and VWO have this feature integrated and testing your variations for quality is as simple as a click.

My second recommendation is to avoid using the drag-and-drop editors provided by testing tools. Sure, it makes the creation of your variations much easier, but you’ll be shooting yourself in the foot.

When using the testing tools’ drag-and-drop editors, your variations’ code is auto-generated, and usually end up being quite messy.

This is not just a problem for the neat-freaks out there. Messy, auto-generated code typically equals a lot of browser issues – the Instrumentation Effect in action.

So instead, get a developer to code your test variations by hand, perform quality assurance tests, and the results will pay-off.

3. The Selection Effect

There’s a lot of bad conversion optimization advice out there, and I’m not afraid to call it out…

One erroneous piece of advice I hear far too often is the following: “If you don’t have enough traffic to test one of your pages, temporarily send paid traffic to it for the duration of the test”.

Please, don’t do this.

This “piece of advice” assumes that traffic coming from your paid traffic channel will have the same needs, wants and behaviours as your regular traffic. And that’s a false assumption.

I recently had a client that used the same landing page for both email traffic and Facebook ads. Of course, traffic sources were tracked and analyzed… and the result? Facebook traffic converted at 6%, and email at 43%. HUGE difference.

Each traffic source brings its own type of visitors, and you can’t assume that paid traffic from a few ads and one channel mirrors the behaviors, context, mindset and needs of the totality of your usual traffic.

How to avoid it skewing your tests

Simple: be aware of your different traffic sources when running a test. When you’re analyzing the test results, make sure to segment by those sources to see the real data that lies behind averages.

If you don’t segment and analyze your results in Google Analytics, your testing tool could tell you your Control won, but you might discover it won only because of one specific traffic source that doesn’t represent your usual traffic well.

4. The Novelty Effect

This effect is more likely to come into play if a large portion of your traffic is coming from returning visitors rather than brand-new visitors (e.g. landing pages with paid traffic), so please be aware of it when making drastic changes to a webpage for a test…

Let me explain: The novelty effect happens when the engagement and interaction with one of your variations is substantially higher than previously, but only temporarily – giving you a false positive.

For example, if you launch a newly redesigned feature for your SaaS’ users and test it against the existing version, people will need to figure out how to use the new design. They’ll click around, spend more time figuring it out, and ultimately, can give the impression that your variation is performing better than is reality.

Truth is, there are still chances that in the long run, it will actually perform to a lesser degree, but that’s a result of your changes being novel for users during your testing period.

How to avoid it skewing your tests

Because the novelty effect is temporary, if you’re testing a variation that dramatically impacts your user’s flow, it is critical that you run your test for at least 4 weeks. In most cases 4 weeks will be enough time to start seeing the novelty wear off and the test results begin to regulate. 

If you have a variation that wins and you decide to implement it, be sure to keep tracking its performance in your analytics to ensure it’s long-term performance.

 

Conclusion

The History, Instrumentation, Selection and Novelty effects are 4 validity threats that could invalidate your A/B test data, giving you the illusion that one variation won when in reality, it lost.

Keep them all in mind when analyzing your test data, and don’t forget to analyze your results in Google Analytics and/or Mixpanel to see the what truly lies behind averages, and spot the signs of flawed tests.

In the comments below, tell me which of the 4 validity threats you think impacts you the most, and I’ll be happy to respond.