Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Getting statistically useful data out of this will be difficult. A/B tests can tell you which of a set if options performs better for a measurable metric, but not why, and without no visibility into 'metrics' you can't easily measure -- of which there are a lot.


sort by: page size:

With A/B tests, the only decisions you can make are among alternatives that are very close to the base alternative.

Generally, you can only generate data and do experiments based on your present state.


Also one reason a lot of teams can't do more than two options (A, B, C, D, E, F, G, etc. testing) is because you need a TON of traffic for it to be statistically significant.

A/B tests are only useful when you can gather statistically significant amounts of data from them. For a lot of small websites, or infrequently used features on larger websites, that is not the case.

In a way actually! The things that are hard to A/B test get neglected. If you need A/B testing (ie: data) to make any decision, then places where data is purely qualitative - there will be no data to be had and therefore no decision can be made.

The Expedia example really comes to mind too of micro-optimizations and everyone working towards their own org's goal can all make sense individually, but really fail when taken as a whole: https://www.qualtrics.com/blog/upstream-thinking-saved-exped...

Expedia had an example of this: - the sales team wanted sales - the phone support team wanted to turn over calls quickly -

If there were a way to measure confusion, it would be improved and

Can I A/B test whether there will be a 10% increase in sales if I lower prices on Ec2 instances by 5% - YES! Can I A/B test that the presence of 30 service options is the tipping point of spending 1 hour to get something done vs wanting to hire an "AWS"


Maximizing a numerical reward signal is definitely not what we're doing when we do an A/B test.

We collect a variety of metrics. When we do an A/B test, we look at a all of them as a way of understanding what effect our change has on user behavior and long-term outcomes.

A particular change may be intended to effect just one metric, but that's in an all-else-equal way. It's not often the case that our changes affect only one metric. And that's great, because that gives us hints as to what our next test should be.


Yep, that is right. Though you can do multivariate analysis, sticking with plain A/B test is best.

That said, you can (and should) measure performance on multiple benchmarks like clicks, comments, time spent. Gives you a correct picture of tradeoffs.


> On top of that, doing anything economical with the analytics is very rare.

If this was true, then A/B testing would be useless, while it's quite the opposite.


A/B Testing is a way to conduct an experiment. Instrumentation and talking to users is another good way to gain insights, but it not an experiment. They are two different (and often complementary) activities.

Many, many people have successfully used A/B Testing. I've personally used it to great effect several times. I certainly don't make decisions purely based on the statistical results, but I find it to be an extremely useful input to the decision making process. All models are flawed; some are useful.


I think companies run A/B tests because they don't know which option is the worse service.

Some people actually suggest running A/A/B tests just to gauge how much noise is in their numbers, though that requires even more visitors to achieve statistical confidence since they're spread out among more options.

The types of effects you want to measure would take months or years to show and are the combination of many different small decisions. Teams need to apply common sense thinking, empirical data, and the willingness to wrestle with uncertainty. All of that is hard, blindly following a/b test results relieves people of that cognitive burden

> For example, you can see if Group A or Group B from a test are more likely to still use the site 1 year later.

This isn't very feasible on most products and certainly limited by the amount of data collected.


A/B testing has two big and obvious problems. One is that A/B testing can only lead to a local maxima in the best case scenario. The other is that A/B testing, being about statistical hypothesis testing, is prone to interpretation problems, which is why the changes you introduce in the variation have to be small, otherwise you don't know what you're measuring.

In other words, yes it can help you optimize the color or size of a Buy Now button. But it can't help you build a product.


This is something about A/B tests that always bothers me. They always seem to assume that if variation A is best for a proxy stat (signups, views, etc) it's also best for what I'm really interested in (longterm value). I still use them though - not sure there's much you can do about it anyways.

The problem that I have seen in these frameworks that handle A/B testing is that they ignore the basic laws of statistics. All groups are different given large enough n. That's what p is about, do you have enough data to tell the means apart. The closer the means, the more data you need. What you really care about is how large the difference is. That's called effect size, which is really just how far apart the means are in divided by a combined standard deviation for both groups.

I certainly wouldn't be making rash decisions on data without a good effect metric.


I'm confused: surely the right way to compare would be to say the A/B tests pick the best option 100% of the time after 'reaching statistical significance'[1] - this is at least apples to apples.

[1] Like others here, I believe testing until you see significance is Doing It Wrong.


They're only different if you've selected bad metrics. If you've got two different search algorithms, running an A/B test and measuring how often the user selects the first item returned is a good measure for how well your search algorithm is returning the information the customer wanted, which is good customer experience.

I really like the idea of having many different metrics that you automatically track for every test you run. I've been telling people to do that for years, and saying that the fact that standard A/B test frameworks don't is good enough reason to roll your own.

However suppose that you have 20 metrics you are following, and you run 20 tests. The odds are that by chance 4 times you'll have tests showing 99% confidence on random metrics for random results. This is just a side-effect of having many tests and many metrics.

Therefore if you find yourself in that situation, you should be very predisposed to assume that random results you were not expecting on random metrics that seem unconnected to your test really are due to random chance. Because the odds of weird chance results are higher than you would have guessed.


I'm probably just missing something. But for us the purpose of A/B tests isn't really optimization; it's learning. We have a hypothesis about how to improve something and we try it out. The most valuable tests for us are the ones that don't work, because they force us to go back and think things through again.

A magic multivariable optimizer seems fine for the kinds of things a human won't be thinking about (e.g., most interesting tweets this hour and the best ads to show next to them). But from this article I'm not seeing an advantage in using a similar mechanism for testing product hypotheses.

next

Legal | privacy