This is a bonus episode with Emily Robinson (Senior Data Scientist at Warby Parker) and Lukas Vermeer (Director of Experimentation at Booking.com). In her earlier session that day, Emily said that real progress starts when you put your work online for others to see and comment on which in this case was about Github. Someone from the audience wondered how that works out in larger companies where a manager or even a legal department might not be overly joyous about that to say the least so I asked Emily about her thoughts on that.
Emily: [00:00:47] Yes, I should have clarified what I talk about posting things online. It's on personal projects. So sometimes you can clear it with your, the company. so for example, I did a post about some simulation based on problems I was facing at Etsy. And that was fine.
I cleared it with the legal department. I've given presentations about what I've learned about AB testing, the companies I've worked on, but you definitely want to talk with them. And generally they won't. except you posting their data. So projects where you analyzing data is often where I advise to find some personal projects and some public data.
Lukas: [00:01:19] think so. I literally presented something that I posted online in my talk after. And so it's a great setup from Emily because she was literally, she was talking about the tools and then in my talk, I actually use exactly those tools. So yeah, it's possible.
Guido: [00:01:35] It's possible. You can do it.
Lukas: [00:01:39] With property check first and it's probably easier for personal projects for
Guido: [00:01:43] sure.
Yeah. I think, Lucas, SRM, checking for SRM might make you feel good. Like, Oh, I fixed all my issues now with, my sample size because I checked it, with the Lucas stool now. No, I'm all good. Right. But there might still be quite some issues with the quality of your sample. Right. So can you give me the name of the few of the more serious ones and how to cope with them?
Lukas: [00:02:07] So serious problems that would not be captured
Guido: [00:02:09] by SRN? yeah. And then,
Lukas: [00:02:11] huh? You still have well, so in data science, I don't, I don't know if Emily's familiar with this, but there's this concept of type one and type two errors. And then someone added type three, which is a type one. Error is a false, positive tattoo is, or a false negative and type three is you answered the wrong question.
And I think this is a, this is not really a data problem, but it's quite, quite common that people will try and answer a question with the data that they have. So then they'll go, like, I need to answer this question and let me see what I already have. And then they'll come up with an answer based on what they have.
That's not necessarily the answer that people needed. So, so I think that's, that's something I'd watch out for, but it's not really a check that you can, you can run. I mean, to be clear, like SRM is not a panacea, right. It's not a perfect way to go. Yes. This data is certainly correct. And I can now no longer make any mistakes.
So it's definitely not that, What it is. It says it's a very simple check that will capture a whole spectrum of data, quality issues. and so there's really no reason that you wouldn't do this. And I think that's the point of my talk, right? If you do one thing, when you're checking out your experiments, you should be doing an SRM check.
Cause it's so simple and it's, the implications are so broad.
Guido: [00:03:26] And when you figure out that something is wrong, how often do you actually find out.
Lukas: [00:03:30] What's wrong there. We actually have the, this was part of our interview. So when we wrote a paper and Alexandre , I had the main author made a, an interview script.
Of all the questions that we would ask of people that we were interviewing together, they're all data. And one of the questions was how often do you actually discover what is the root cause of the SRM? And they're an interesting distinction emerged, at least in the interviews I was doing, which is people working with clients or agencies would say not always.
But because they are limited by time and they're working with clients. And so at some point, they'll say what we give up. We have no more hours to spend on this. Whereas people, it might my group and basically said, well, 100% of the time we figured out what it is. Sometimes it takes a year. Yup. Right. But we will always, always, always keep doing, because we take this so seriously and we have time to dedicate to this.
This is our work, right. This is our. Baby daddy jobs. So we don't have these constraints of you're working with a client and the client might say, well, you spent 10 hours working on the experiment this week. What did you do? And you go, well, I was trying to figure out why it was broken. That's not really an acceptable answer.
I think in an agency setting.
Guido: [00:04:47] Fair enough. We already have an audience question and you want to, wants to go to the mic? No, I can go on, I have many questions, but it's up to you guys. So, I think it was Lucas. You mentioned two tools out there that do do
Lukas: [00:05:02] a good job in, I didn't, I was very pro, so I've ever purposely steered away from naming and shaming, but I don't, I don't want this to be about, who's doing an SRAM check and who's not, I just want it to be.
I want there to be awareness that this exists and that I think everyone should be doing this. Probably it's easy to check
Guido: [00:05:23] for, to look at your own tool to see, to find out if they do it or not.
Lukas: [00:05:27] Yeah. I mean, I think at this point in industry, it's safe to assume that your, your AB testing platform does not do this.
Yeah. so I mean, as far as I'm aware, there's, there's two, I don't actually use any of those tools. So I can't speak to that their quality, but I think it's safe to say. And certainly once you find one and you're going to find one, and you're going to find something that, that has an SRM, then you know, whether they are telling you about disarray or not.
And then it's a matter of, I think of convincing vendors that this is something that we care about. Ultimately, we're their users, right? We are the people who use our platforms. If we tell the vendors that this is something that is important to us, And I tried to argue with my talk that this should be important to you, right?
They need the quality of your decisions should be the most important thing when you're running experiments. And so we should, we should try and convince the vendors that this is something that we deeply care about. And then, but since it's like trivial to add, I don't see why they couldn't. No. Yes. And then that's also like part of the, the, the plugin that I presented that I'm working on is more of a proof of concept to show how easy it is.
Right. I think on the one hand, I want to make it, I want to provide something that's useful to people so that they can actually use this and they don't have to manually compute a SRM checker, everyone. But I also wanted to show that this can just be simply done and so that no one can say, well, you know, it's complicated.
This is not this easily.
Guido: [00:06:50] Emily, so, in your talk, you spoke about, three different kinds of, data scientists. how do I, how do I create awareness within my company? That there are actually three different people. They shouldn't be hiring. How do you go about doing that?
Emily: [00:07:04] Yeah, one thing I didn't talk about is often for an organization's first data scientists or in a startup.
Someone is playing all three of those roles, but something to think about there is that they're usually not as high a level as you might get an a specialist. And because often that's not what you need. Usually an organization's first data scientists, they're spending a lot of time, depending on how much data engineering has been done, working on some data engineering, setting some tables up, doing some basic analysis that no one's ever looked at before.
And maybe part of that is a little bit of machine learning. A little bit of, you know, how big a set up some experimentation stuff. But I do think for a company's, first, data scientists, first couple it's okay to have some more generalists, but not to expect that person to be, I would say like a, a unicorn who's world-class at all of them, because that's actually not what you need at that point.
Like many, many problems do not need someone with a, with a PhD in AI to solve.
Lukas: [00:07:58] And I can only speak to my own experience, but, but when I started at booking, there was a lot of low hanging fruit. Right. There was lots of stuff that didn't actually require a PhD to figure out. And you had this great example of a lot of data, science, sciences, actually counting and maybe some division and you can go, you can get a little long way with like proper accounting and division.
And so I totally agree with Emily that when you're only starting out with this, having someone who's a bit of a generalist. Maybe like a CRO kind of background in someone who has actually done like lots of different things.
Guido: [00:08:28] I think the story, I think that resonate with me with, with hiring a CRO specialist, which is a weird term because most people are more like generalists.
you need design skills, you need developer skills, you need analytics skills. ideally you'll have some psychology skills. so there's a whole range of things that you need to have, and you might not find that. Single single person with all these skills.
Lukas: [00:08:52] Right? I don't know Emily, but this is actually what attracts me to this crowd, but I don't consider myself a CRO person at all.
But I, when I walk around at this conference, I see a lot of likeminded people. And I think it's because data scientists just like CRO specialists are generalists and they have a wide variety of interests and a data driven decision making is one of them, which is my, one of my. That field, I think Sierra was more interested in the usability side of things or the optimization side of things, but there's a lot of overlap between data science and CRL, for sure.
Guido: [00:09:24] Yup. Yup. Legos, is SRM basically making multi arm bended testing options.
Lukas: [00:09:31] This is the third time I've had this one. so, What the SRM check does is compares to the expected ratio against the observed ratio. Right? But when you, when you were using a bandit, what is the expected ratio? Because the bandit is constantly adjusting the expected ratio,
Guido: [00:09:47] which you shouldn't be doing.
Lukas: [00:09:49] what you told us. If you want to use a standard statistics, you shouldn't be doing that, but if you're using a band and all of that is out the window anyway. and so when you're, when you're doing abandoned, the SRM check no longer works, so don't do it. Don't do it, don't use it. but the standard statistics also no longer works, which is why, when you talk to vendors about, banded algorithms and how they do statistics on top of them, they'll often do something like, waiting or waiting the, the different, samples or the health, some way of segmenting the results, or they have some way of having a fixed control group, but they have some additional method.
Of dealing with the fact that the, the bandit is constantly adjusting the sample ratio, which, which you should do, so.
Guido: [00:10:28] Yep. Okay. Anyone,
Lukas: [00:10:33] everyone is just afraid of us. Yeah. So after you were talking about SRM, we can go up
Guido: [00:10:41] if you guys want to,
Lukas: [00:10:42] Ruben Sierra manager on dialogue. first of all, thank you very much for the great talks this morning.
Are you sure both to get her in the funnel right here. Let's see if you can combine the talks. I've been actively tracking SRM for a while already, and it's hard work when you find it to solve the issue. But the question is, can AI solve that issue someday? That's such a good question.
Emily: [00:11:03] Yeah. I think some of
Guido: [00:11:05] AI machine learning.
Lukas: [00:11:08] Hey, I, to solve SRM,
Emily: [00:11:10] it doesn't fucking have some of these checks automatically, right? Like looking at different breakdowns of browsers and other stuff like that.
Lukas: [00:11:16] So like many, experimentation platforms, we have breakdowns and we do SRM checks on the breakdowns. so that makes finding out if there's a particular segment.
That's problematic makes it very easy. that's actually the second thing, I think on the list of rule of rules of thumb that we suggest, Microsoft has the same thing, right? So it's sort of step one is you look at the overall report and you see if something's clearly broken. Step two is you go into each individual segment and you see whether there's a particular browser or a particular language.
but the question here is whether AI would solve this. So I think like if I look at the state of the playing field out there, The reality is that most platforms don't even check. So I think step one is check and then step two would be once you uncover the issues, if you want to help a user understand the root cause, then segments help.
Maybe you can automatically surface. In the UI segments that are actually problematic. So you could automatically say, well, we found an SRM and maybe it's a internet for six or some seems to be a problem here. So that's something you could do, but that wouldn't, that would be, if we go back to geese talk on, yesterday on Friday, he talks about AI at these different levels where really you start out with rules, spaced.
Right. That's step one then I think, I think that's where we're at. Right? Step one, check, step two. Think of some rules that you could implement that would make it easier for the user to find these problems, but that's a far cry removed from, from my, I am not sure.
Guido: [00:12:45] I, I think, I'm not, I'm not a panelist, but I think my personal answer will be.
Yes, AI will fix it, but they will have wiped out humanity before that. It doesn't really matter.
Emily: [00:12:55] I also think with AI and machine learning, it's often still very helpful to have a human in the loop. So stitch fix is a clothing company that ships you five outfits, and they do it with a combination of algorithms based on what you say you want and sizes, other information you fill out.
But then a human stylist kind of picks the clothes that the algorithm suggests and puts them together and writes a note and all of that and sends it off. And so I think there's. Going to be very few problems where it's only going to be, you know, machine learning or AI or AI. And there's always going to be that, that space for a human, with some domain knowledge to come in and
Lukas: [00:13:27] connect abs absolutely in human, in the loop computing, they also call us.
And I think there's even a feedback loop there where the, what the human does in that loop is not just. Play part in this one recommendation to this one customer, but they also provide more information for the machine learning. So, so whatever, if this is let's take the stitch fix example, right? So if, the, the human goes yesterday, this is a good recommendation computer.
We're going to send this and I'm going to add a note. And then every once in a while they go, this is ridiculous. We're sending this person five pairs of jeans and nothing else. Right then that, that feedback right there, that they now have insight in when the algorithm fails and this can feed back into the machine learning itself.
So the human can say this, this recommendation I'm going to change in this way. And a good recommendation system with, with human in the loop will then adjust, right? It will use that information. So the human in the loop is not just a necessity for the operational element. It's also something that helps the algorithm improve.
Yeah. So these feedback systems, I really like this about the geese talk that he talked about, reinforcement learning, cause that's essentially what you're doing here. Right? You're you're, you're thinking about this in terms of a improving process, continuously improving, the recommendations.
Guido: [00:14:45] Can you have a SRM mismatch on session level, but not on a user level or vice versa?
Lukas: [00:14:50] Yes. Yeah. I mean, so, and Ronnie has mentioned this in some of his papers as well. If you, if you flip a coin. Who gets to see treatment and control at the user level, but then you measure things at the session level. Then any time, more users in treatment will, will generate more sessions. You will essentially have an SRM at the session level.
And ironically, this is often what you want, right? You want your experiment to cause people to have more sessions, but then let's take a simple example. say that I'm measuring a conversion. so whether people purchase something and I am measuring at the session level. So measuring, how many sessions led to a conversion?
imagine I have an experiment that makes people who were not purchasing come back the next day and purchase. Now on the one hand purchases will go up because of it. There, these people are coming back, but sessions goes up by approximately same amount or probably more. And so if you look in that instance at purchases per session, it will go down.
So you sold more, you know, make more money. Customers are probably happy. That's why they're coming back. But you look at the results for your test and your main KPIs going down. So in those cases, I struggled a little bit with a session level metrics in experimentation. but yeah, you can definitely have an SRM.
At that level, you're flipping on the usual.
Guido: [00:16:16] Okay. Any more audience questions F a couple more. where does this with you guys in Harry Potter,
Lukas: [00:16:25] but
Emily: [00:16:28] did you
Lukas: [00:16:28] reference him? I told him to
Guido: [00:16:30] go to that, but you
Lukas: [00:16:33] had, you had this slip there? Yeah. I
Emily: [00:16:35] probably just picked someone's arm
Guido: [00:16:36] Dumbledore, right?
Lukas: [00:16:42] Yeah, yeah,
Guido: [00:16:42] yeah. Already is Dumbledore.
Lukas: [00:16:47] I don't know. It's popular
Emily: [00:16:49] nerd thing, I guess.
Guido: [00:16:50] Yeah. Yeah. Well, we have a similar question here and that's how do you have time? I already have enough time writing papers and doing analysis while keeping up with the latest memes. I don't
Emily: [00:17:02] Twitter,
Guido: [00:17:02] Twitter,
Lukas: [00:17:03] sleep less,
Guido: [00:17:04] sleep less.
Lukas: [00:17:06] Hmm. I mean, do you want like a real aunt?
So it was funny. I'm sorry. I was talking to a bunch of academics, and I said, Oh boy, I would leave. I would really love to join academia at some point. Like when I'm tired of industry and a bunch of years, I want to join academia. So I finally have time to write papers and they looked at me and rolled their eyes and said, you have no idea.
Like we were writing these papers in the evenings. You're laughing. Yeah. Yeah. So, so apparently in academia, it's the same thing. They they'll say you write these papers in your weekends and your afternoon. Yeah. If your time
Guido: [00:17:37] off and teaching teaching during the day.
Lukas: [00:17:41] Yep.
Guido: [00:17:42] one of my final ones. So I'm talking about groups on desks that are not equal. How do you correct for tests? that takes place at the bottom of the page, comparing a group, which doesn't have the treatment at the bottom of the page.
Lukas: [00:17:59] This is a very specific question for them.
Emily: [00:18:03] I think, I think what it's asking is, for example, are you saying like, if the control is triggered when someone visits the page, but the treatment's only triggered when someone goes to the bottom, I have a lot more people in the control and treatment.
I mean, my general answer to that is I always trigger only on people who would have seen the change. So it should be triggering at this same moment for the control and the treatment. so there's like a couple of papers talking about that. I'm also in the idea of like, let's say you're changing the checkout page and you're changing an offer for free shipping.
And it'll only be a change from what's people would have seen for $25 to $35. And so you want to trigger just on that, because otherwise you're adding noise to your experiment. If you're having people enter who would see the same thing in the control.
Lukas: [00:18:44] Yeah. I'm, I'm so torn about this topic. So, so, what, so you say it adds noise.
Why is that a problem? The power goes down, the power of listening. Right? So, so, so the reason I'm torn that, the, you will have less power if you just include everyone in the 600 rest statistical power, right? So you won't be able to pick up smaller effects. So, so the reason that you trigger and the reason that you zoom in is that you want to pick up smaller effects, but it comes at a price and the price is risk, right?
So, so there might be a need to do this, but it is a trade off between that additional power and the risk that you take in, in having a. Bias or missing data in your, your measurements. And I think it's really difficult to give like a general statement of here's how you should make that trade off. Cause it really, really depends on what's going to be in that footer.
Lukas: [00:20:08] Yeah. Yeah. Okay. I was assuming that was a given. Yeah, no, no.
So you should definitely, the triggering should be identical. In the, in the treatment they can control and, and, This is actually one of the things that in the paper and the SRM paper was one of the reasons that SRMs occur is when you use different triggering for a treatment and control. And it was, it was one of those that I had never thought of because that's because in the booking infrastructure, if that happens, because we wrote the API and the infrastructure and such a way that this is this an impossible, essentially.
And, but I talked to partitioners using other platforms and using different implementations and especially. companies using redirect, strategies where this is actually possible that you trigger on. let's say you triggered a control on when you land on the page and the treatment. You redirect, and then you trigger on the page that you land on.
Now that seems like a subtle difference, but it does mean that if the redirect fails, then these people will not be triggered into it into treatment, which then causes an SRM. that's all you want. You want to be absolutely sure that the triggering is identical. Yeah. Control.
Guido: [00:21:10] And, the, the article you wrote, was it already published or was that the one pending publication or the one.
Lukas: [00:21:17] And the, with the paper. Yeah. That's already published. That was, yeah. That was a sample mismatch. The taxonomy and rules of plump. Yeah. I think it was the KDD that we posted that
Guido: [00:21:29] I'll look it up and put it in the show notes for the
Lukas: [00:21:31] nice thank you for, thank you. Yeah. So it's, it's an academic paper, right.
But we, we tried to keep it as. Readable as possible. I don't think it's very, very dense impenetrable material, but you know, I'm obviously biased here.
Guido: [00:21:46] You can start with the intro and the, and the conclusion and see if
Lukas: [00:21:49] can, I mean, the other thing that I presented in the talk was I'm trying to provide more usable, directly usable tools like the plugin, like the website, where you can test for SRM.
So maybe add those to the show notes so people can use them. I'm just trying to get everyone to, to be aware. That this is a, this is a thing because like the triggering is a good example, right? That's something where I'm making sure that triggering is identical in consultant. Treatment is important. And one of the ways that you would figure out that this is not the case is through an SRM.
And again, it's not a panacea, right? It might not, we might not catch it. but it's, That's a start.
Guido: [00:22:24] Yeah. So, moving away from, from SRM, it's your job to teach people at, at booking more about statistics and the way you do this is through stories. so can you tell us what it, what are the one of the Liz additions to your storybook?
So look as well. You can probably tell better yourself, but many it, Lucas is doing this to, have people make it easier to remember
Lukas: [00:22:52] those
Guido: [00:22:52] difficult statistical concepts. and it's not necessarily fun.
Lukas: [00:22:56] but
Guido: [00:22:57] reading the paper, reading academic papers on it, but
Lukas: [00:23:00] I've tended to sorry. So booking, So, so in, in booking experimentation, culture is pretty strong cause we've cared deeply about, the impact that our work has on user experience, which we want to make sure that when we have an idea and we put it on our website, that we actually test that it does the thing that we wanted to do, that it works.
and so I don't have to really do a lot of convincing. Testing is important. but then the next question is how do we actually do this and how do I interpret results? So a lot of the training we do was more around helping people understand and interpret statistics. I think Emily, you do pretty much the same thing.
Right. and I've tried to use stories, but I haven't really added anything to the, this is like you're putting me on the spot and really added any stories. Cause, cause I, I started doing this years ago and I I'm, I'm just rehashing this theme, the same stories.
Guido: [00:23:52] So
Emily: [00:23:52] I can make it a pitch for something else on Twitter for folks to look up, there's a woman named Alison Horst who is going to be, who's just become an artist in residence at our studio, which is a development environment for art.
And she makes drawings actually. So with little monsters and she shows different, some of them are stats related. Some of them are related to functions and are, but they're really fun. Cause she was a teacher and she found like when she had an introduction to programming classes, she'd show. Some code and she was very excited.
She was like, Hey, you see all the cool things that this can do, but the students weren't really getting excited and, you know, and it felt a little bit inaccessible. So she started making these drawings since I really recommend looking them up. Cause it's like a great way. I think, finding different ways to engage people while they're drawing.
Guido: [00:24:34] Yeah. I guess, execution of deals, whereas a couple of those explanations in cartoon form. Yes, you can just, Oh, let's just send them the link. Hopefully they'll get it.
Lukas: [00:24:44] The thing that, the thing that inspired me to start. Started doing this in form of stories was a book called the what is the P value?
Anyway, it's a very short, very thick, thin book, and it's basically a statistician telling stories, explaining basics the physical concept. So that's where I got some of the stories actually talk directly from, from the bookstore has this great example of, one night he's playing basketball with Michael Jordan, which.
Probably not true, but you know, he's, he's the story is he's playing part battle with Michael Jordan and, they do a five quarters. So try to put the ball in the net. and Michael Jordan hits seven out of seven and he hits four out of seven. but that difference is not physically significant. So he's just as good as basketball as Michael Jordan.
And now I like that example because you laugh, right? You go like, that's ridiculous. But that is what people do often when they look at an insignificant result, they say, Oh, well, there's no significant difference. So a and B are the same. And that is literally saying that you're just as good as basketball as Michael Jordan is.
So this book is filled with examples like this. And I hope in the talk, the paper I used in the beginning, the pitfalls of experimenting on the web. The, the, I love that paper for exactly the same reason. Cause it's using actual experiments to tell a story about why this particular problem of SRM is so, so problematic.
And so the, for the, for the people listening to the podcast, that the experiment they ran is a. Essentially showing that thinking about eyeliner makes people lose weight. And this is such a ridiculous conclusion that when you explain the experiment, you understand that this, this conclusion cannot be true.
Therefore there must be something wrong with this experiment. And once you try to sort of set your mind to trying to figure out what's wrong with it, you understand how this would apply to other experiments that you run. And I really love this sort of storytelling and trying to get people to understand statistics rather than throwing, mathematical equations that them, and then just hoping that somehow they will understand.
Emily: [00:26:42] Yeah, I do think it's important for those folks listening to the podcast who weren't at the Lucas's presentation to understand that he was trying to mind people putting on eyeliner and he later admitted. He didn't know how people did it. So he was mining putting on mascara. And
you don't, you don't put on eyeliner, but like waving your hand
Lukas: [00:27:01] in front of your face. Wow today, today I learned I started so, yeah, so I literally do not know how to do this. We should, this is the whole point of that experiment. Right? Then the problem is that the men drop out of the experiment because they don't know how to put on an eyeliner.
And then they, they refuse to participate, which is why the SRM occurs. And the same is true for me. Like I have no idea what eyeliner.
Guido: [00:27:21] Yeah. I feel ankle for a session coming up. Probably not the one in the sauna.
Lukas: [00:27:29] Eyeliner in the sauna. Would it run like this? Is that the one that runs
I'm so confused.
Guido: [00:27:37] Do we have one final audience question or should we wrap it up? Yeah, go ahead.
Lukas: [00:27:45] I'm a new account strategist as the,
Emily: [00:27:46] just to marketing
Lukas: [00:27:48] and Emily I'm just out of curiosity, I'm wondering,
Emily: [00:27:50] what is your main
Lukas: [00:27:52] goal, your main motivation for writing your own
Emily: [00:27:54] book?
Lukas: [00:27:55] What kind of fishing do you feel like you have to share with you with
Emily: [00:27:58] the world?
Yeah, thanks. so for listeners, I'm writing a book called build a career in data science with my coauthor, Jacqueline Nolis that's career advice for aspiring and junior data scientists, and really the motivation Friday folks who who've written a book is it's certainly not financial in terms of the amount of time you spend writing it compared to yeah.
but really it was cause for Jacqueline and I both felt because it's such a new field, you know, it's it's career paths. Aren't really well defined. A lot of folks may be the first data scientist at their company. They don't have someone to look up to. There's a lot of folks wanting to entering the field.
They, they try to Google like how to become a data scientist. There's millions of articles. So it's really being the book that both of us wish we had and, also getting some different perspectives. So at the end of every chapter, we interview at different data scientists because we knew. You know, we we've learned a lot.
We've talked to some folks, but we also wanted to make sure we included a lot of different voices in the data science field.
Guido: [00:28:50] Very good. Thank you. thank you, Emily. Thank you, Lucas.