They reported earlier numbers that showed about 79 % effectiveness even though more data had come in in the meantime and overall effectiveness had dropped to about 69 %.
How big was the dataset that you saw the 80% improvement? Remember this test was done on a relatively small dataset, 28 MB, so it may have reached a diminishing point a lot faster than with a larger dataset.
The graphs tell a slightly different story. I was hoping to find their explanation about why there's no significant difference when clearly TRE was better for about 6 months and then leveled off slightly but still about 25% more effective in the end.
I think you’re mixing effectiveness and efficacy. You’re both correct but talking about different terms minus that the parent comment needs some prior adjustment in the numbers to get the posterior.
EDIT: Just confirm, I think I was incorrect in the message below.
I'm not completely sure I'm correct (please correct me if I'm not!) here but as I understand it the article does not support the claim in the headline. The headline claims that signups increases by 28% in the changed version and that this was all attributable to the change.
It's the second bit that isn't supported, they say that the result was statistical significant but what I understand them as saying was that it was statistically significant that the new variation was better than the control. But it could be better by an amount more or less than 28%, all we know from that is it's almost certainly (95%) at least a little better. We would need to know the number of trails to be able to get a certainty for the amount of improvement.
Could someone with a slightly better understanding of statistics chip in maybe? I could use some more information in my own A/B tests, sometimes I know a change is going to be a pain to maintain so I want to know not just if it is better, but by how much.
They mention this in the paper. The chance of all 7 depts changing their reporting simultaneously is pretty vanishing. Much of the paper is dedicated to understanding the cause of the lack of difference between experimental groups, check it out.
Wow! I wonder though how they account for over fitting of the data. Is it a real solution or a statistical anomaly? I ask because it seems that the progress over the last year(s) has been small increments within the 9-10% range.
Interesting that the percentage difference between the winning solution with 107 blended approaches and the close-by solutions using only one approach was on the order of a fraction of a percent in improvement.
reply