Confession time: Sequential testing and why machines won't replace us any time soon 🎙 CRO·CAFE

Shownotes

Merritts Sequential A/B Test Calculator
Go Ahead and Peak: The Sequential Test Calculator (explanatory blogpost)
What is sequential testing?
TLC episode on sequential test design

Book(s) recommended in this episode

Transcript

Please note that the transcript below is generated automatically and isn't checked on accuracy. As a result, the transcript might not reflect the exact words spoken by the people in the interview.

Guido: [00:00:00] So Merritt , thank you so much for joining the CRO Cafe podcast. And first of course, we'd like to know a bit more about you. , could you enlighten us to start with, why are you still working in Sierra?

Merritt: [00:01:11] Yeah. And that's the keyword there is still right? Because it's been about 10 years for me. No , I think I go back to when I got into CRO and , I stumbled my way into it actually when I was in college doing a graduate degree.

But what I immediately loved about it was , the breadth of demand of skills that. Went into CRO, right? There's the analysis part of things. There's the technical part of things and the coding, there's the statistics part of things. There's the creativity part of things. And what you're presenting to people and how you're solving problems.

And , and there's the visual and design part of things. And then all of that gets. , the method by which that happens is you have to be good at project management and sort of business skills. And , not only is all of that relevant to CRO, but it rewards you if you're good at, it has some breadth and all of those things.

I've always been a little bit of a Jack of all trades. I don't have formal training any of those things, but I have a lot of interest in all of them.

Guido: [00:02:12] Many people calling themselves Ciro specialists, it should be Ciro generally.

Merritt: [00:02:16] Pretty much. Yeah. Yeah, I'd make a terrible , statistician, a terrible developer, a terrible designer , and probably a terrible general business person.

But you put all those things together and I've got, I'm just dangerous enough in all of them to be decent at my job.

Guido: [00:02:33] Yeah. And , you're still working in it. So do you expect in 10 years we will have the same conversation you're still working in zero. Do you think things have changed? Not necessarily for you personally, but in the field, what zero means.

Merritt: [00:02:45] I don't know. I think they may kick me out by then. What I just said about the, what makes the field really interesting to me and like the diversity of skills and challenges, those challenges continue, but I honestly think it'd be hard for someone to follow that pathway today and be a generalist.

I think as , I think as experimentation becomes a bigger part of the DNA of organizations that. It will require more specialization. Cause more people will have the general knowledge of how to do things and you'll just need people who are really good at one particular thing. That's my, if I look , into the future, that's what I'm seeing.

Guido: [00:03:17] Yeah, exactly. And to give people some context , your optimization, directorates , search decor discovery , what do you do?

Merritt: [00:03:25] That's a great question.

Guido: [00:03:28] I wish I could say, what do you say at parties when people ask you what you do when you have a family reunion?

Merritt: [00:03:34] Yeah, actually one of my favorite things to ask , my family is what do you guys think I do for a living and get all sorts of answers from, , Make stuff online, design people think I design websites.

I don't design websites. No , I tell people that I run experiments on people who don't know they're getting experiments, run on them , on the internet when they visit websites and applications. On a day-to-day basis, my job is working with clients. Mainly to set up and operate optimization programs.

So that sometimes that involves executing experiments, not nearly as much as I would like. Cause I think that's the really fun part of the job. A lot of the times it's just working with organizations to figure out how to. Set up the capability inside their , their organization. Sometimes it's just troubleshooting or support in some ways or office hours.

But there's a little bit of everything. No, no day, no two days are the same. Yeah.

Guido: [00:04:27] And one of the things you've been working on lately is building an, a, an AB desk calculator. Please tell us why we should be using yours and also all the others. The only one with an AB Tesco clutter. Let's be fair.

So why did you feel this urge to build

Merritt: [00:04:41] your own? Yeah, and to be honest, I don't care if you use mine.

I'm pretty darn sure. I'm the number one user of my own calculator here. You can, I can tell you who I built it for. No, I, I've been using calculators that have been free and available online for years. And I rely on them to do my job effectively and , there are some great.

There's resources out there. And and the list of resources only grows. So I'm not trying to do all and be all to , to everyone's needs, but I will say, sequential testing, I don't know how deeply you want to get into this, but , it was something that I actually just started out wanting to understand at a deeper level.

Right to get the statistics a little bit better. I've used a number of calculators out there. I was, I had front row seat to an optimized. You went to a sequential stats engine is , is how they framed it. I've been a user of the analytics, toolkit , dot com resources and the agile calculator that you got to be built there.

I've used Evan Miller's sequential. Yep. Simple sequential calculator, all of those, I wanted to understand a little more and I wanted to be able to self-serve. So I got into this and, I think to my knowledge, this is the only one that , that is free and could be called robust out there today.

, I'm sure that won't last for long, but , but right now I think it's a unique resource in, in, like I said, I use it all the time,

Guido: [00:06:11] for those that are not familiar, what is specific about sequential IBD?

Merritt: [00:06:17] So if you're anything like me, at some point you heard you can't peek on your tests and.

That was a revelation to me the first time I heard it, because I was speaking on my tests because the testing tools, they encouraged us to do it right. Even today, like some of the biggest testing tools out there are showing you stats that aren't valid. Because they're showing you they're there, they're running the test constantly.

Every time you look at it , And so that's peaking. That's not how , how , fixed horizon testing works, which is the breed of testing that most of us use , the alternative to that would be Beijing testing. But , I don't have the wits to go down that road and neither do I have the, the stamina for it today.

But in the testing that most of our used to you. Create a sample size, a plan, and you're supposed to stick to that sample size. You're supposed to not look at the results , at least not calculate a statistical , result until you collect the full sample size. So by Peking you increase your error rate, your type one error rate.

And that's a problem. And even peaking every 10% of your sample size , you may go from like a 90%, sorry, a 10% , type one error rate to a 30% type one area over the course of your test. And that's what the, a flat result. , yeah, so peaking is bad. And sequential testing basically allows you to peak responsibly.

It allows you to guarantee that you maintain that type one error rate as you go through a test and , and look at the results. And I like to say that like it's useful, peaking is useful in the face of extreme results. MDE, minimum detectable effect. Minimum effective interests, whatever that is the.

Thing we don't know much about before we ran the test,

Guido: [00:08:08] but what your needs to calculate your sample size? Yes.

Merritt: [00:08:12] There's this whole test design around this thing that we know. So little about that there's one thing we learned in testing it's that we have no freaking clue what the outcome of subpoena is going to be.

We don't know if it's gonna be good or bad or really good, or the most surprising results. Where is we're testing becomes most valuable. So yeah. So acknowledging that's like the hardest thing to predict and then giving yourself a tool to run an efficient test, regardless of what that thing is, what that impact is.

That's a, that's why sequential testing is really valuable. Yeah.

Guido: [00:08:43] Yeah, I had the exact same , exact case today actually. We're , they wanted to test , three or four variants of a promotion on the websites. And normally when you see promotions on the website, their default was okay, you can get , this amount of discounts.

But only when you order this from this amount, like you can get the 25% discount, but only if you order. 50 years or over a 100 euros. And that's the default. And , one of the tests, one of the variants was okay, you get this discount period without any requirements of minimum order value. And , the team, the local teams, , this is very risky.

We can do the test beds. Maybe let's do it only at the 1% of the , number of visitors. Of course, I'm like, okay. But then the test is gonna run forever. We don't have the sample size to pull that off, so we need to do it, , equal split. But then there, they come with requirements say, okay, but then we need to have some stopping measures in place.

, we want to run this for four weeks, but after one week we already find out, okay, this is going very extremely in a negative way. We need to be able to stop this. This is a perfect case for sequential testing. I think so. How would you go about, federating this and then testing if it's big enough,

Merritt: [00:09:57] That's, that is exactly one of the key use cases for sequential testing.

And I didn't mention this, but

Guido: [00:10:03] I don't mind personally, I would just let test go for four weeks and then see what happens. But BIS business teams, my, my thing differently.

Merritt: [00:10:12] Yeah. And w one thing I see a lot is , people checking in on a test early on, right? Like you're a couple, two, 5% into the sample and you're.

Checking to see if anything catastrophic is happening. A lot of times, I don't see people having a good framework for saying, okay, what is catastrophic in this case? Is this decline that I'm seeing is this significant enough to be concerned about? And the other thing you get out of sequential testing is a rubric by which to measure a decline early on.

, and to determine if that's yeah, a statistically significant decline or not, right? Like one of the concepts in. And at least in my sequential calculator and many others is a futility boundary. Should you give up early? And yeah. So in, in your case or in many others, right? Being able to say, okay, we're, we'll plan the test.

This is the maximum duration of the test, but we're going to check in on an early on 5% just to see if there's anything really extreme going on and the results. And. Hopefully you're using something like the decision plan upfront, where you can say, if we cross this boundary, then we're going to pull the plug or we're going to, , that's what we'll call a significant decline.

Otherwise we're going to keep going. Even if it's down 5%, if that's not past our futility boundary, we're going to stick with it.

Guido: [00:11:30] Yeah. And I can imagine that it's quite difficult to include such a boundary in your calculations because it can be different for each test, right? It's going to be , of course it can be revenue based.

If you go down 5%, then that's a, the boundary. But maybe my experiment is not about revenue at all. Maybe it's about newsletter subscribers, and you might be okay. We're not necessarily losing money. So it's fine if it goes down 10%.

Merritt: [00:11:54] So that's a, that's actually a really good point. There is a.

There, there are many different ways to do sequential testing. And if you start reading the literature on it , it, it becomes a little overwhelming because I don't know, like all stats, literature, there's like all these names that could poke cock, Brian Fleming and I can't even pronounce some of the names cause they're very foreign to me, but all these people have come up with different methods of approaching sequential testing.

Some of them are conservative than others. And that those who are providing calculators have actually made some decisions about how conservative those methods are going to be. I actually hope to, at some point, allow someone to shift those parameters a little bit, to take something that's more conservative or less conservative in terms of their approach to the decision boundaries.

As of right now, it's fixed. I had. I have been pretty transparent about like how those parameters are set. Th those are fairly technical details in the test design, but it's good. It's good for us to know out that there are. Just like when you're designing the test and you're saying, okay, what's my con, what constant competence do I need?

What power do I need? And what's my NDE. That there's a lot, there are a lot more parameters to your tests that you can configure to say, this is how aggressive my decision boundary needs to be. And usually it ends up affecting the shape of it. And I realize we're talking about this kind of abstractly , what are these decision boundaries and could point to it, but , But, yeah, it's like you said , you do need a pretty diverse , toolkit to approach these things because the problems are different and your risk levels are gonna be different.

Guido: [00:13:26] , for people to need , specifically your , sequential tests, , go Clara , what kind of data do they need to have? , as input,

Merritt: [00:13:32] there's only one thing you need to prepare. That's. In excess of what you normally need for just a normal fixed horizon test. And that is you need to plan out the number of , of check-ins or the number of analyses that you're going to do on the data.

, I usually plan something that's check in like once or twice on the data. And. Yeah. Usually the reason you're doing a sequential test is because , you're looking at a pretty long test horizon, like something like three, four, five, six weeks or whatever. And so sequential tests can help you get to that trend answer more efficiently.

But yeah. , other than that , it's your alpha, which is one minus your competence level. It's your beta, which is one minus your power or vice versa. However you want to look at those. It's your base conversion rates. The base rate that you're building off of , or your mean, if it's a continuous metric, by the way, my calculator does not yet support continuous metrics.

Yep. We'll get to it eventually. But yeah, base rates, you need the minimum detectable effect or a minimum effect of interest or effect size, or however you want to title that one. And your tails and , you actually get a fairly different test design. If you select two tail versus one tail , I know that's probably a topic we could chat with

But yeah, normally I go one tail, if it's too detailed, the boundaries look a little bit different, but , beyond that there are some optional inputs, right? Like you can, we have a support to put in your current traffic size. For your test area, like how much traffic you get in a week, a month or whatever.

And then you can estimate how many days it's going to take you. Just some , it's just a utility function in that you don't, it's not necessary, but it's helpful.

Guido: [00:15:10] Yeah. And this is bottom. So can you explain to us, why does it matter how often I pick, my test is going to run for four weeks.

How often I pick is not going to change the results. So Y is it, does it matter for the calculations for the, , afforded calculator that I predetermine the house.

Merritt: [00:15:30] So I will tell you right now, I'm going to fail at explaining this easily. This is one of those things that like, If you don't get it, it's really hard to understand an abstract.

And then once you get it , you instantaneously lose the ability to communicate that effectively to someone who doesn't it's just give it a shot. So there's actually a there's in, in basic probability that you learn, there's the probability of a thing happening? Let's say probably if you gain a heads, if you flip a coin, 50%, if it's an unweighted point. Okay. But what's the probability of you hitting a heads. If you flip a coin twice. I should know this off hand, but really simple calculation, , for doing that. It's it's , it compounds the probability of you getting the head aheads compounds. And it's the same thing in testing, right?

The probability of you getting a false positive, the first time you check on your data. Is with what, at whatever competence level you're setting for your test. And that's actually interesting to know, because a lot of us think like , you have to wait until the end of the test to check your data before you can assure your false positive rate, or just sure that you have the right competence level, but that's not true.

You can have , you can look at data after you've collected 50 samples and run a T test on it. And if it says that, you're 90% confident you can be 95, 90% confident that you're, you don't have a false negative or false positive there. So that doesn't change. However, if you run that test in and then run that test in a week, you've just increased your odds of having a false positive, just like you flipping a coin twice.

And your odds of getting ahead have , have increased. And so every time, every success of time that you run the test, and I realize like , we're mincing words a little bit like running the test. Can you consider collecting your full sample size, but running the test is also, it's just calculating the test statistic, the Z score, the, whatever your test statistic is.

So every time you calculate that test statistic, You are giving yourself another chance at a , a false outcome. Terrible job explaining that.

Guido: [00:17:50] You're gonna fact

Merritt: [00:17:50] check this afterwards, right?

Guido: [00:17:54] We'll put it in the show notes. The actual description. Yeah. That's w where's it's often , even when people know this is where it goes wrong.

Merritt: [00:18:03] I think I, I can't say that people make a ton of bad decisions, but I can say that they're not control. They're not controlling their errors. They don't know which decisions are bad. The fact that their error rates to begin with guarantees that we're going to make bad decisions at some point.

It's the question of the rate of bad decisions that you're making, or, , the rate of your ignorance in your decision-making and it goes to an unquantifiable and unquantified amount.

Guido: [00:18:33] Yeah. Basically you need to determine how comfortable are you with being wrong. Yeah.

Merritt: [00:18:39] Yeah. And if you're peaking, if you're not using this sequential method and not doing something to control your errors, you're just not quantifying that.

Precision in your decision-making. , so I would say that the biggest issue that you'll find is that people in making decisions early well, for one they're going to favor like , I guess this is true in all sequential testing, but you're favoring large effects and you're probably inflating, like you're biasing your effect, sizes upwards.

And so where you may have a confidence interval of this could be a change from between 1% to 30%. There you're mid rate. Your mean change is probably going to be higher. And it actually is on average. If you were to say like I'm taking 20 winners and the average effect size in these winners was 10%.

If you've been using a sequential method, you've probably biased that upwards and it's actually much lower than that. , that's going to be the case, whether you're using sequential method or not, if you're peaking, you're going to be responding to a higher effect sizes than actually exist.

Okay.

Guido: [00:19:43] Are there scenarios where you will, you would still say, please don't use sequential testing. Just let the test run anyway.

Merritt: [00:19:50] Yeah.

Guido: [00:19:53] One example , maybe I'm wrong with it. If there's, , some kind of a cycle, if you , if you, if there's some cyclic data in your , in the behavior of your, or your clients , I can imagine that might make sense where you were to expect, for example, different behavior in the first two weeks versus the second two weeks or

something.

Merritt: [00:20:11] Yeah, I think absolutely. That's a consideration that you've got to, you've got to put the lens of like business cycles and seasonality and stuff over your data. That's not necessarily a reason in my view to not do sequential testing because, just because , those cycles exist doesn't mean you want to ignore a really negative result, for example, like I think even if you have tons of traffic, And, reaching sample size in a reasonable amount of time is not an issue.

And in fact, you're like testing on smaller , on smaller subsets of the data because you have so much traffic and you want to be able to run a test for a week because, , I do typically recommend people run a test for at least a week. Now, maybe even two weeks is probably preferable. So you get a couple of those at least short business cycles in.

But you still want to, you still want to make things can tank and do really poorly. And that could be evident early on. And so having again, having this statistical calculation that you can make early on in the test is still valuable, but you're right. If you haven't run a test for a week, you should think twice about ending the test early on a negative result.

Because you may not have seen , a good cross section of your users. You may not have, you're never going to have a sample of future users, right? Every time we sample users, we're assuming that it represents. All users today and all users tomorrow. But you do want to make an effort and that's why, and that's why we'll try to run through a business cycle or two to try to get a more representative sample.

Guido: [00:21:43] Yeah. And, , your ranch, you already mentioned that we can discuss for hours, , in the comments on , on the one till first hotels. And then of course we have maybe even bigger free consist versus Beijing. ,

Merritt: [00:21:56] No, it doesn't only work. In fact. So I use an R package , To, to make the application work, that sequential calculator work and that our package actually supports the basion testing. , we're gonna hit real fast, like the limits of my expertise with,

Guido: [00:22:16] because

Merritt: [00:22:17] I have a. I have a couple of books on patient stats and, as hard as it is to read a stats book to begin with.

Guido: [00:22:24] No, yeah, I did want to go into Beijing versus first. I just want

Merritt: [00:22:28] to know

Guido: [00:22:28] is it's sequential testing , purely something that's limited to people doing frequent this , experiments or , going, can we both benefit?

Merritt: [00:22:35] Yeah, no. Th they, they do apply to Basian stats. And there is support again in the art package that I use for it.

I don't grok it as well. And I also know that, there's a lot of hard feelings is maybe the wrong answer, but there's a lot of criticism about Bayesian stats, , in the way they present the statistics and the way they make , make. Knowing the assumptions that are going into those stats.

And so I, people will say that Bayesian stats are immune to peaking and that's maybe glossy over some of the details a little bit. It's not completely true. You still do need to worry about , type one error rates. They, Asians , statisticians will say or people who are appropriate.

Beijing, they'll say like the probability, the test stat is valid all the time. Like it's always, whereas in frequented stats that P value does not always tell the truth. Yeah, they'll say that the stat you get from Asian is always , is always true, is always and again, I think that Pat, that glosses over some of the , the detail and the assumptions that are being made in busy patient sets.

So yeah,

Guido: [00:23:44] to me, it feels like a sequential testing is a good mix , between. Wanted to be the more rigorous methods, which is frequent in my view, the more rigorous method, also something I grew up with, and this is what we use. It mainly it's a universities when they ran experiments. But also you want to.

You want it to make the all pros to make businesses, right? We're not applying this to , to just do science and figure out exactly what's happening. You also want to make business decisions here. And , this is a good way to , at least capture , those moments where things can go wrong. And, , quickly, if you're not careful, and then you're actually losing a lot

Merritt: [00:24:22] of money.

Yeah, I'm with you on that. And it's funny to say like sequential testing is a way to get some of those , the benefits that people tout from Basian testing, but to do it in like the language and the format that you're used to, it's like the stats you grew up with, it's assessing learning

Guido: [00:24:40] college.

Merritt: [00:24:42] And that's a funny way to look at why people do it. Maybe there's just a bunch of inertia behind it. But I think I'm with you on that. And I certainly take some cues from the medical and the research communities that , seem to overwhelmingly favor frequently stats.

Guido: [00:24:58] Yeah. So what would you say , we'll stop at stepping back a bit.

You said that we'd been , you've been doing this for over 10 years now. What are you, are there any insights you think you've gathered in those 10 years? It's that other people, or maybe it's zero or in e-commerce in general? I haven't picked up on yet.

Merritt: [00:25:17] One thing I'm a little bit.

Cynical. Like I've had too many servings of humble pie , to be too overconfident about any single one of my opinions, but usually one of the things that sticks with me to someone, someone asked Daniel Kahneman , he's the thinking fast and slow. He's the granddaddy of biases and heuristics, right? Someone asked him once , if understanding the biases that he did, the Follies and human thinking, if that made him more resilient.

To those biases. And , and he said, no , not really. And I think that's true of us in the CRO community, right? Like we, we tend to look for ways to exploit's maybe, , too loaded of a term, but we look for ways to speak to some of those biases and to like to work within the system of human psychology.

And. You would think that we'd be really good at understanding how our own thinking is wad. And I think we're not. And I think that manifests itself in myriad ways and the way we approach our work. But one of those is sometimes we see what we, we make emotional decisions.

We live in a world of data. And we use that data like a security blanket, right? Like it makes us feel good about the decisions we're making. But I think there's a propensity for us to make emotional decisions still and to use the data as a weapon to rationalize those emotional decisions.

And I've seen it in my own work. I've seen it in other people's work where you take these hard numbers and these stats and you just put enough. Spin on it, enough of a personal interpretation on it. , that's where a lot of like P hacking comes from. And even if you like build up more and more sophisticated, Stats methods to , to make your S Europe to make your , conclusion seem more Bulletproof, like opinions still seeps in.

And , and I think that's something we have to actively combat. Throughout our careers , maybe babysit others , as well. But that's my opinion is that there's way too much opinion in all the work that we do as much as we want to believe that , you, we're using cold numbers to make decisions.

Guido: [00:27:36] Yeah. Yeah. This should be an app that says warns you for all the fellow CS. That's your , All the biases that's going to drive. This is probably not there yet.

In my experience. It does help if you , if you talk a lot with , people doing the same as you do, , but outside of your company , the interest groups , of course nowadays everything align , maybe it's a Facebook group for you or a Slack group for people do zero. Or meetup.com more than way.

And when we're back to doing that , and talk to those people too, to figure out and discuss the experiments that you run with them. I think that always helps me, , to figure those out swore to get a hint as to where I'm going wrong.

Merritt: [00:28:14] Yeah. Yeah. There's another thing that comes to mind. You're talking about insights that , Let's call these opinions that I have, that this was shared maybe a safer way to present that.

I also, I like to say that ideas are a dime, a dozen in, in CRO we learn a lot again about those same biases and the tactics that work and the heuristics that we should evaluate a page with. And the , how design works and how colors work, how words work, and then the impact they have on consumers and people as they experience , different user interfaces.

But at the end of the day, right? , our ability to use those things, to actually make a difference to business metrics is pretty bad. We need to know that stuff and it makes our ideas better. But going from a 25% win rate to a 35% win rate. It's still a 55%. It's still more just for the record.

He is seven out of 10 of your ideas are still not good. So I like, again, the humble pie that I've eaten over the years, I have to remind myself constantly whenever I hear someone sharing an idea that I think is absolutely terrible. Poorly reasoned, like just like in bad taste. Like I have to remind myself that, you know what, my ideas aren't any better.

I've put my weight behind stuff that I thought was sure to win and flopped me no difference. We're just not that good at finding the solution the first time. Not to say that it's not worth putting in the hard work to learn about what tends to work, what should work.

At the end of the day, our ideas just aren't that great. And so like, when I find myself arguing with about what tests should we move forward with? Or like, how should we prioritize this idea? Man, it's , we should just all do ourselves a favor. It acknowledged that

Guido: [00:30:04] yeah, we don't know.

So I have a statement for you then. , if you, if your win rate is 35% or up , you're not testing enough.

Merritt: [00:30:14] Yeah. Yeah. And honestly, most of the organizations that I work with art testing enough, because once they have learned to flex that muscle really well and really quickly, they don't need a, they don't need a consultant or an agency to come in and show them how to do it anymore.

Like they're beyond us. If that's just a, yeah. That's true. Those are just the problems that I work with on a daily basis. But yeah, the mature organizations, those that run a lot of tests, their win rates go way down one in 10, maybe even less than that. And that should tell us something, right?

Guido: [00:30:46] Yeah. I'm asking you, I think that the public number, a booking booking.com always, , using their presentation is one in 10. And I know from , from Microsoft, they do all these automated tests, right? So on their own, there are being search engine and there's one in a couple of thousands , experiments that actually move the needle of words.

Merritt: [00:31:06] And I've heard , People like Tom whistling , say on stage that, our ideas are so bad that we shouldn't even bother with them. Like the future of CRO is letting the machine try every single iteration of an experience and just let the machine do it because, we, we are limited in what we can think might work and , we bring a whole, we bring a whole bunch of baggage behind us.

Of assumptions that may not be right. And so just, letting the machine take care of design is the future of Sierra. I don't know that. I agree with that again. I do think it's worth putting in the hard work to understand how brains work, how interfaces work. What's , what's good design, what's bad design.

What tactics we should try or what other people have tested and works. I do think it's worth like educating ourselves on that, but. Yeah. Especially the more tests you run and the more you reach that local maximum, which is , you're actually optimized the harder those wins , are to find.

Guido: [00:32:04] Yeah. So I run , experiments , with that same domestic a couple of years ago with, , Dan gold sentience AI. , this , evolutionary algorithm, AI AB testing or multi-variate testing platform. And, , but the thing is that , you, you can, yeah, sure. You can feed it all these, but basically it's like a multi-armed bandits, , algorithm , and in, in separate stages.

But the limitation does those machines still have. I think is that , of course they can find the optimum solution of the variance you give it. Where's that, do you think we will still, as Sierra people will still be valuable that way so we can give the machine, maybe the machine can figure it out and do all that and completely automated, but do we, are we still needed a future , to figure out what the machine is going to test

Merritt: [00:32:53] to begin with?

And the more. The more power you give to that machine to just operate independently, right? It's it's literally going to move every pixel to every point on the screen. And it may learn very quickly that way, of what works and where that pixel needs to be. But somebody out there is going to see.

Greek letters were in an English setting, right? Somebody is going to see something that doesn't make any sense that a human could very easily say , hold on. , no. We're not going to try that one. But operate within these boundaries. So yeah, I don't see like the machines taking over completely anytime soon.

They're just not smart enough to get Get the first couple of iterations, right? Like they, they just, they have to be trained and learn. I don't know. Maybe there's some solution out there where you can have the trainer set internally that helps the machine like learn some basics. And then from there it can just go wild.

Yeah. That'd be interesting. But I also wonder , here's the other limitation, right? The , You have to be , really confident that the metric you're optimizing for encompasses everything that you're interested in. And that's even a problem today in our decision-making right. Where you may find something that's ugly and hideous and.

Evokes a, like a gut wrenching response from people, but it's super effective at drawing clicks or getting people to check out or getting people to accidentally enter in their credit card information and buy. And you're just not measuring the metric of like, how do they feel when they see that?

Or like what's their likelihood of coming back six months from now to buy from me again or next year to buy from me again. Whereas those metrics are always seemed to be just a little out of reach and we want to acknowledge them and use them in our decision making models. If you just set a machine to optimize for that stuff, it's.

It's not gonna take every data point in account. I don't know.

Guido: [00:34:51] You probably end up that's what a lot of CEOs are struggling with. You probably won't optimize for lifetime value, but that's the only metric that you can often, you don't have to have that available in Google analytics or whatever you use to analyze your AB test for.

So you have to settle for proxies like conversion rates or average order value or revenue per visitor or whatever are you're , doing it with, but it's very, short-term what's, we're optimizing for right

Merritt: [00:35:14] now. Yeah. Or what if you like , what if you're a luxury brand, and like just

the feeling of it, the feeling of what you're presenting is every bit as important as the , the action that the user takes, right? Like you, you have to prioritize how people experience your brand. And if you just run a test on conversion rate, that's not what you're necessarily optimizing for.

Guido: [00:35:43] Yeah. And so if you're listening to this is basically , the survival instinct of two people working in Ciro, trying to figure out how our job can still exist in 10 years. We're desperately looking for four ways to figure out, okay, it's not going to happen.

Merritt: [00:36:04] How many of us have looked at, or how many of us have tried to dictate something to our phone? And I don't know about you, but for me there are certain words that every single, like every time I say the word com E come on over here, it'll say cone. Even when I'm typing code every single time, every time I see that, I'm like, yeah, no, the machines aren't coming for us anytime soon.

We're good. We're fine.

Guido: [00:36:29] I have that on my main frustration that, that pops up , right now is that I think it's , mainly LinkedIn when I tried to reply to someone , on the message that they sent, I think it's even like the public posts. And I tried to add mention them. And then there's all this big list of people that has the same , starting letter, but it's not a person that's rod, this person let's pause.

I'm like, this should be the Mo the most obvious one. Why is that? Not there. So now we're all safe the next 10 years. We're good. Yeah. Very good. , back to you you're a calculator. You just released a version two. What's going to happen in the next 12 months. Is there going to be your version three?

Is there already a roadmap.

Merritt: [00:37:12] I do have a backlog of ideas and features and stuff that I'd like to work in. I'm really excited about some of the things that came out and B2, some stuff to geek out about. But yeah, next 12 months , I do want to put in support for continuous metrics. It's not that hard.

It's just one of those things that just takes that takes the time to do it and reframe everything and then working that into the user interface. Honestly for me though. I didn't set out to build this tool set. I set up to just understand things better and it took me to , I need it.

I need a tool to be able to do what I want here.

Guido: [00:37:50] That's scratching your own itch, but that's a great way to start crying.

Merritt: [00:37:54] Yeah. And I think there are a number of other tools , that could the valuable in our space. One of the things , one of the things that I've learned is that simulating is it can actually improve your decision-making.

If you can see the way scenarios play out , you can. You see the impact in the weight of what you're doing. I think back to , effect simulation was one of the ways that I got into this in the first place is I wanted to know what happened when people peaked. I wanted to know what the actual impact of peaking was.

And the only way I could figure how to do that was to like simulate a bunch of data and look at different scenarios. What happened to someone peaked after 50% of the sample versus after 10% of the sample. I'm like, I actually thought there maybe was like, This is not done. I am. I thought there was maybe some way that , you can have a rule of thumb on how to peak without messing up your stats.

And that's not the case. They'll do it. But. Yeah. I think there's a lot, there's some tools and stuff that we could add in to different calculators that would give people an ability to simulate data and see how those things play out. I also, personally, from a personal perspective, next 12 months, I do want to get through those Beijing.

I've read so many articles and I dunno. Thick skulled with this stuff, but , yeah. I want to be able to give you a clear answer on , how you use sequential testing Beijing stats next year.

Guido: [00:39:17] Yeah. So to continue on that , any other things you're reading our tip for our listeners , to develop themselves in this , this area?

Merritt: [00:39:28] Yeah, a couple of things. I, , the number of bloggers and content creators out there has just blossomed in the past five years , in terms of experimentation, if you go on medium and even if you only do a free subscription meet the amount of AB testing content on medium is huge. I get a weekly digest and there's new articles every week from people that I've never heard of.

A lot of like more common experimentation blogs are on there. Netflix does their stuff on medium. So I'd say go subscribe to a couple of topics of interest on medium and see what comes out on a weekly or daily digest. The other thing that I think is really astonishing, the good , Alex Burquette , over the past year has been building up some of the content on his personal blog, Alex briquette.com and , his stuff is really good.

It's really thorough and well thought out and , he comes at it from someone who has learned in the trenches. So his blog is a great place for someone to actually get started today in understanding AB testing. Cause it's really spans the breadth. It's comprehensive.

Guido: [00:40:32] Was he doing he's a Ciro specialist, right? What is

Merritt: [00:40:36] he doing today? He, I think he's doing, he's doing is he's like a freelance , today he's doing his own company. But he was, he's been , he's been a writer. He worked for HubSpot. He worked for a CXL. I think he did a bunch of content for CXL and then he's just , he's been put on great stuff.

Guido: [00:40:53] Speaking of people , who should we invite for , for another podcast episode?

Merritt: [00:40:59] Have you spoken to not yet. Okay. You need to get your shots. No , Matt is great. He owns and runs, conduct tricks. He's a, he's very present in the industry and he's just a thoughtful, he's , he's a kind , kind guy, , as, as rough as he is around the edges, it's prickly as he is around the edges.

Sometimes I will tell you that I know he's very smart guy, very willing to , to educate people. And, , he's just a clear thinker about a lot of things.

Guido: [00:41:27] No, we'll, , I'll connect him and , I'm , I'll tell him, you told us , on the shoulders. Good. Thank you so much, , for sharing this , this with us , Meredith things really interesting and definitely looking forward to the basion update.

My, my final question for you , How often do you pick personally, what's your rule for it?

Merritt: [00:41:48] Remember peaking is just when you calculate the statistics. , no, I, I use this method in almost every test that I designed , For clients. And , the only time I don't is when there's a stats engine with the test tool that I trust and that I think is putting out good stuff. , there are some tools out there that I think have good stats methods behind them.

But for the most part, I'm doing all of it offline , using mental calculator. So

Guido: [00:42:12] You can name names, but let's just not name them. The bad ones were the very good ones we want to know.

Merritt: [00:42:19] Google optimize has it BeiGene engine. And did, they're not completely straight forward with it.

And there are a lot of gutches with their stats, but , But I tend to trust their engine. , I trust, optimize these engine as well. I trust the BWS engine. Con conductor X is great on the , machine learning side and I trust their stats engine, the problem with some of the others.

It's not that they're, it's not that their stats are wrong. It's that they're easy to abuse. It's that like you, can't just, I think of in terms of Can I send my client there to get results? Can I share this with them and let them like interpret it as they want to every time. And I don't do that with target.

I don't do that with anything that's frequent. This that doesn't have some adjustment or some sort of mechanism to stop them from interbreeding at a test statistic before the sample size is being collected. Fair enough.

Guido: [00:43:18] Thank you so much. Maritz and , hope to talk to you soon. Thanks. Bye-bye.

‍

Confession time: Sequential testing and why machines won't replace us any time soon

With

Merritt Aho

(

Search Discovery

)

Episode guest

Merritt Aho

Episode host

Guido X Jansen

Shownotes

Book(s) recommended in this episode

Transcript

Friction and Momentum in Brazil vs EU vs USA

The Art and Science of Converting Visitors into Customers

Let's help the world: The Covid19 Conversion Rate Aid Package

Decentralized versus Centralized CRO

How to do server-side experimentation

Holacracy: decentralized management and organizational governance

Browsers, cookies & server-side testing with SiteSpect

Online Influence Book Launch

Confession time: Sequential testing and why machines won't replace us any time soon

With

Merritt Aho

(

Search Discovery

)

Episode guest

Merritt Aho

Episode host

Guido X Jansen

Shownotes

Book(s) recommended in this episode

Transcript

Some other recordings from

:

Some other episodes you'll (probably) also like:

Friction and Momentum in Brazil vs EU vs USA

The Art and Science of Converting Visitors into Customers

Let's help the world: The Covid19 Conversion Rate Aid Package

Decentralized versus Centralized CRO

Some similar episodes you might like:

How to do server-side experimentation

Holacracy: decentralized management and organizational governance

Browsers, cookies & server-side testing with SiteSpect

Online Influence Book Launch

Here, help yourself to a cup of CROppuccino