Hi, I’m Sim Wishlade, and I’m a Design Manager / Principal Product Designer at OpenTable. To give this article some context as to why I wrote it: I hate losing. I hate losing more than I love winning. We all got told as children it’s the taking part that counts. Not for me. If I start losing at anything, I get annoyed and frustrated for not being better. I hate losing at cards to my wife, at FIFA to my eight-year-old nephew, or Scrabble against my sister (who cheats, by the way, to make sure she wins).
So when it comes to A/B testing, where on average a test loses 60–70% of the time (depending on the maturity of your product), you would think I would move heaven and earth to avoid spending my time being disappointed and frustrated. After five years of working at OpenTable (and being my first role in which I’ve had experience A/B testing), I’ve distilled some of my learnings into five principles that I hope will let you lose less in your A/B testing journey.
Firstly, I’m going to talk to you about an A/B test we ran around messaging. I’m sure we’ve all suffered from the fear of missing out, and we can see various sites/industries around the web using messaging to create a feeling of ‘I’m missing out’ or ‘I will miss out if I don’t move quickly’ in various degrees of persuasiveness. Since this type of messaging is a core economic principle affecting perceptions of desire and value and motivating us to act, it’s hardly surprising to see tactics like this being used.
On the left it’s factual — it’s data rather than emotionally driven to give us context of how many items are left, whereas on the right it’s the opposite and plays on the emotions of missing out because something might change in the future affecting the item you’re viewing.
We used to use the message ‘Hurry, we only have [X] time slots’ in our restaurant booking widget. And while being a positive A/B test, the messaging felt inconsistent with how we want to talk to our users. We asked our content team to look at other options to see if we could frame it in a more positive light and reduce the persuasiveness of the messaging.
We ran the test with the new string ‘You’re in luck! We still have [number] timeslots left’ and saw no change to booking behaviour. As we looked more closely at the data, we saw some other changes: bookings and cancellations dropped almost by the same number. Our hypothesis from this is that when users are faced with the ‘Hurry; message, they’re making a booking in case they miss out, then booking somewhere else and cancelling the first booking — not a great experience for our users to go through, nor for our restaurants who don’t want to see bookings made and then cancelled. By changing the string of text we give our users a better experience, and also create a better experience for our restaurants.
Ratings are one of the core drivers for decision-making on search results for OpenTable, as I’m sure it is for most businesses. In a piece of research we had a few comments that it was hard to distinguish one rating from another when ratings were close together. A rating of 3.9 was similar to a 4.1 and distinguishing between a few pixels wasn’t that easy to do.
We wanted to test to see if seeing a numerical value made it easier in the decision-making process. We had previously seen in quantitative research that users had preferred a rating out of ten and when it came to A/B testing and getting qualitative data we saw the same — the test performed well for our users.
We had seen in our research that a user’s perceptions changed as we changed the rating scale, some users would change from a minimum of 3 out of 5 to a 5 out of 10, some went from a 3 out of 5 to either an 8 or a 9 out of 10 and so we had to decide even though this was a test win — was changing to a rating out of 10 the right thing to do?
If we keep the rating out of 10 do we need to reset all our ratings and start our ratings all over again? More importantly are our users now overlooking a restaurant that would be perfect for their dining experience because the minimum rating a restaurant should have to be considered for dining has changed?
To be fair and honest to the ratings and diners we went back to using ratings out of 5 stars. When a test wins, not only do we want to make sure it’s right for our users but is also right for our restaurants. Sometimes, a win is not a win.
This raises interesting questions around the ethics in A/B testing: in this example we could have easily terminated to a rating out of 10, manipulated the figures a little (since actually doubling the rating didn’t actually reflect the correct rating, mainly to do with starting a scale at 0 or 1…but that’s for a different article) and seen a benefit for us with more bookings. But we didn’t because of the impact might have on our restaurants. I would suggest that anyone who would adopt this approach, where they put the goals of the company before the experience of their users, rather than balancing them both, is looking for short-term gains that are likely to be harmful to the business in the long term.
I spent a period of time helping refine our Diner Feedback Form (shown in the image above) to make it more effective and usable for our diners. The review form is made up of separate modules including a rating, review, adding photos, and tags about the restaurant. Each piece of content is valuable to other users (reviews tell you how the experience was, photos show you the food/restaurant, tags help us curate lists of restaurants). So as we try to refine how users interact with each module we don’t want to negatively affect the user completing the other modules.
We made some initial improvements to the form that helped our users complete it. But over time we saw our tests becoming less effective, and no matter what we did we couldn’t move the metrics from where they were. We had reached our local maximum.
So what is a local maximum? There comes a point in A/B testing where you’ve optimised the feature, page, or component, and no matter what you do you can’t seem to move the needle on a metric — or by doing so you harm other metrics on the page: it has become as optimised as it’s over going to be.
So what to do next? Our next steps were to really look at where we were: was this the best experience in asking for feedback, how were others doing it, where could we add value, and what was our long-term vision for this form. We went through the design process, testing with users, iterating, re-testing, and seeing what would really drive our users to fill in the form.
Once we were happy with the new design for this feature (as shown above) we launched it to a small percentage of users and we saw a dip in our metrics. This can be typical when launching a redesign or new feature but now we have a benchmark (and some quantitative data) we could bug fix and iterate on the design, relaunch, and continue to do this until we were happy to set this to all traffic arriving at the page.
By being aware that we’d reached our local maximum for the feature, we were able to take a leap forward within the design, rather than smaller iterations. By doing so we could move beyond the local maximum set by the previous design and towards the global maximum. As tests become more and more ineffective, A/B testing empowers us as designers to make these step changes that otherwise we would never get to.
What is a global maximum and how do you know you’ve reached it?
To know you’ve reached your global maximum you need to keep being bold with your design changes to find which one works most effectively for your users. The new layout for our Diner Feedback Form might have proved better for our users than the previous version, but a totally different design for the form might be even more effective, and we’ll only know this by testing a new layout. Ideally, we would reach a point where we’re seeing 100% completion, with all modules within the form filled in, photos added etc, but we’re well aware that this will never be the case, but we’ll keep iterating to get as close to that as possible!
Be aware of reaching your local maximums and so you know when it’s better to leap than to keep taking small steps.
Our search results page is designed to help users shortlist restaurants and make a decision on where to book. Over time, our search card (shown above) had more and more information added to it — each piece we added proved positive in an A/B test and so was helping users decide on where to make their reservation. But we failed to ask ourselves two questions:
How many pieces of information can users process to make a good decision?
How many of these tests were still effective and not detracting from the user experience?
We re-tested each element on the card, removing one element at a time to see the impact it had and found two tests that were now hindering users in decision rather than being helpful. By removing these pieces of information we are creating a better experience in several ways:
The effectiveness of positive tests will depreciate over time, so it’s always good to re-test them. Is there a novelty value in seeing a new layout/piece of information for the first time? When the user sees it for the second/third/fourth time does this become a hindrance? It’s always worth going back and checking.
Making sure you write a hypothesis correctly — we always try to use the same format so we’re clear about the decision we’re making.
We state what we’re changing, what will happen, and why we expect it to happen. We’d include a paragraph of anything we found in user testing or research that backs this up. This provides clarity whether someone is reading our Product Brief or looks at the test in our A/B testing tool.
The great thing about A/B tests is the ability to, well, test. Be aware though: you will damage your results when you start testing too many tests on one page. Imagine you’re testing the hierarchy of filters while also testing the information layout in your search card you now have four variants of that page. Imagine I’m then testing this with a third test on the page resulting in eight variants of this page. The more variants of the page the less confident you can be that a test is a win because of the increased chance of test interference.
There are various times of the year when you might see abnormal traffic. For instance, around Cyber Monday it’s likely that your customers will behave differently — your customer base might even be very different (e.g. deal hunters). For us, we try to avoid running tests in the run up to Valentine’s Day, our core user base is different and it’s a time when everyone is looking for reservations, so we can’t be wholly confident that a test win is really a win.
Peeking is when a tester extends or reduces a test length to reach significance.
An example: you originally planned to run a test for two weeks, but after ten days you saw that it had achieved the right level of confidence that you stopped the test running. But had you left it running for the 2 weeks, your users might have different behaviours at different times of the week.
And the opposite of this is true too: Extending a test. Your test didn’t may not have reached significance after two weeks — this means it didn’t reach your expected uplift. Refine the solution and re-test. Don’t extend the test to get the results you want.
Do not change the experiment settings, this includes:
This will invalidate your test, if you need to change an element, stop the test, make the changes and restart your test.
This is called A/A testing. Essentially this means creating an experiment that has equal traffic but the variants are exactly the same. Run the test for two weeks and the results should come out very similarly.
We have our own in-house tool we test regularly. There are other tools out that that you could use, we’ve previously used Optimizely, but there’s also products like Google Optimize. Regardless of the tool you use, it’s always good to run A/A tests to make sure that the way you’ve set up your tests is correct and the data you’re collecting is accurate.
Losing tests are really valuable to the business — what can you learn from them? Is there a behaviour that your users are performing that you didn’t know previously? Is there a particular part of the experience that a user is dropping out of that might need refining whether visually or from within engineering? Whatever the outcome it’s always worth sharing with the wider team, your insights might help another area within a product solve a problem they’re working on or help the team think about your users in a different way.
After five years at OpenTable, A/B testing has become a really good tool in my toolbox, but I still hate losing.
Thanks to Liz Gershman for her suggestions, proof reading and helping this all make sense.
Sim helps create tools for restaurants to showcase themselves, and how diners can view and interact with this content. He’s based in London, loves food, loves running, loves solving problems.