Hacker Read

pbsdp · 2013-06-30 02:23:53+00:00

Getting statistically useful data out of this will be difficult. A/B tests can tell you which of a set if options performs better for a measurable metric, but not why, and without no visibility into 'metrics' you can't easily measure -- of which there are a lot.

yobbo | karma 968 | avg karma 1.47 · | 2022-07-29 12:57:15

With A/B tests, the only decisions you can make are among alternatives that are very close to the base alternative.

Generally, you can only generate data and do experiments based on your present state.

reply

a13n | karma 3436 | avg karma 3.96 · | 2019-05-27 16:42:05+00:00

Also one reason a lot of teams can't do more than two options (A, B, C, D, E, F, G, etc. testing) is because you need a TON of traffic for it to be statistically significant.

idlewords | karma 28364 | avg karma 6.34 · | 2011-01-08 01:49:25+00:00

A/B tests are only useful when you can gather statistically significant amounts of data from them. For a lot of small websites, or infrequently used features on larger websites, that is not the case.

seadan83 | karma 1071 | avg karma 1.32 · | 2023-12-30 16:25:47

In a way actually! The things that are hard to A/B test get neglected. If you need A/B testing (ie: data) to make any decision, then places where data is purely qualitative - there will be no data to be had and therefore no decision can be made.

The Expedia example really comes to mind too of micro-optimizations and everyone working towards their own org's goal can all make sense individually, but really fail when taken as a whole: https://www.qualtrics.com/blog/upstream-thinking-saved-exped...

Expedia had an example of this: - the sales team wanted sales - the phone support team wanted to turn over calls quickly -

If there were a way to measure confusion, it would be improved and

Can I A/B test whether there will be a 10% increase in sales if I lower prices on Ec2 instances by 5% - YES! Can I A/B test that the presence of 30 service options is the tipping point of spending 1 hour to get something done vs wanting to hire an "AWS"

reply

wpietri | karma 58013 | avg karma 4.11 · | 2012-05-30 13:25:18+00:00

Maximizing a numerical reward signal is definitely not what we're doing when we do an A/B test.

We collect a variety of metrics. When we do an A/B test, we look at a all of them as a way of understanding what effect our change has on user behavior and long-term outcomes.

A particular change may be intended to effect just one metric, but that's in an all-else-equal way. It's not often the case that our changes affect only one metric. And that's great, because that gives us hints as to what our next test should be.

reply

paraschopra | karma 8514 | avg karma 3.81 · | 2009-08-04 12:33:12+00:00

Yep, that is right. Though you can do multivariate analysis, sticking with plain A/B test is best.

That said, you can (and should) measure performance on multiple benchmarks like clicks, comments, time spent. Gives you a correct picture of tradeoffs.

reply

DeusExMachina | karma 7352 | avg karma 7.04 · | 2013-06-05 23:12:49+00:00

> On top of that, doing anything economical with the analytics is very rare.

If this was true, then A/B testing would be useless, while it's quite the opposite.

reply

ryporter | karma 792 | avg karma 2.83 · | 2016-01-10 01:55:15+00:00

A/B Testing is a way to conduct an experiment. Instrumentation and talking to users is another good way to gain insights, but it not an experiment. They are two different (and often complementary) activities.

Many, many people have successfully used A/B Testing. I've personally used it to great effect several times. I certainly don't make decisions purely based on the statistical results, but I find it to be an extremely useful input to the decision making process. All models are flawed; some are useful.

reply

adrianN | karma 29995 | avg karma 2.78 · | 2020-08-12 19:35:28+00:00

I think companies run A/B tests because they don't know which option is the worse service.

chc | karma 20862 | avg karma 3.0 · | 2013-01-03 20:42:00+00:00

Some people actually suggest running A/A/B tests just to gauge how much noise is in their numbers, though that requires even more visitors to achieve statistical confidence since they're spread out among more options.

jvans | karma 635 | avg karma 5.12 · | 2023-04-16 20:23:28

The types of effects you want to measure would take months or years to show and are the combination of many different small decisions. Teams need to apply common sense thinking, empirical data, and the willingness to wrestle with uncertainty. All of that is hard, blindly following a/b test results relieves people of that cognitive burden

epolanski | karma 7020 | avg karma 3.01 · | 2022-07-12 17:25:16

> For example, you can see if Group A or Group B from a test are more likely to still use the site 1 year later.

This isn't very feasible on most products and certainly limited by the amount of data collected.

reply

bad_user | karma 29557 | avg karma 3.94 · | 2015-12-09 00:15:54+00:00

A/B testing has two big and obvious problems. One is that A/B testing can only lead to a local maxima in the best case scenario. The other is that A/B testing, being about statistical hypothesis testing, is prone to interpretation problems, which is why the changes you introduce in the variation have to be small, otherwise you don't know what you're measuring.

In other words, yes it can help you optimize the color or size of a Buy Now button. But it can't help you build a product.

reply

sbov | karma 5450 | avg karma 3.65 · | 2012-03-20 13:29:08

This is something about A/B tests that always bothers me. They always seem to assume that if variation A is best for a proxy stat (signups, views, etc) it's also best for what I'm really interested in (longterm value). I still use them though - not sure there's much you can do about it anyways.

trapper | karma 985 | avg karma 1.77 · | 2009-08-29 08:40:06+00:00

The problem that I have seen in these frameworks that handle A/B testing is that they ignore the basic laws of statistics. All groups are different given large enough n. That's what p is about, do you have enough data to tell the means apart. The closer the means, the more data you need. What you really care about is how large the difference is. That's called effect size, which is really just how far apart the means are in divided by a combined standard deviation for both groups.

I certainly wouldn't be making rash decisions on data without a good effect metric.

reply

moconnor | karma 3481 | avg karma 7.58 · | 2012-06-01 14:16:49+00:00

I'm confused: surely the right way to compare would be to say the A/B tests pick the best option 100% of the time after 'reaching statistical significance'[1] - this is at least apples to apples.

[1] Like others here, I believe testing until you see significance is Doing It Wrong.

reply

noirbot | karma 2158 | avg karma 2.49 · | 2022-07-12 08:51:47

They're only different if you've selected bad metrics. If you've got two different search algorithms, running an A/B test and measuring how often the user selects the first item returned is a good measure for how well your search algorithm is returning the information the customer wanted, which is good customer experience.

btilly | karma 52813 | avg karma 4.93 · | 2012-07-29 14:15:31

I really like the idea of having many different metrics that you automatically track for every test you run. I've been telling people to do that for years, and saying that the fact that standard A/B test frameworks don't is good enough reason to roll your own.

However suppose that you have 20 metrics you are following, and you run 20 tests. The odds are that by chance 4 times you'll have tests showing 99% confidence on random metrics for random results. This is just a side-effect of having many tests and many metrics.

Therefore if you find yourself in that situation, you should be very predisposed to assume that random results you were not expecting on random metrics that seem unconnected to your test really are due to random chance. Because the odds of weird chance results are higher than you would have guessed.

reply

wpietri | karma 58013 | avg karma 4.11 · | 2012-04-20 11:48:44

I'm probably just missing something. But for us the purpose of A/B tests isn't really optimization; it's learning. We have a hypothesis about how to improve something and we try it out. The most valuable tests for us are the ones that don't work, because they force us to go back and think things through again.

A magic multivariable optimizer seems fine for the kinds of things a human won't be thinking about (e.g., most interesting tweets this hour and the best ads to show next to them). But from this article I'm not seeing an advantage in using a similar mechanism for testing product hypotheses.

reply