Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

> It’s astounding the correlation

Can you post your research? How many millions of accounts did you analyse? What tools did you use?



sort by: page size:

> But data seems to confirm it.

Can you share this data. I am curious.


Yev -> that's really interesting. Do you want to publish your data and maybe we can compare notes ;-)

>> In the general case, it's really easy to detect

> Why do you say that? I've worked hard on this problem and would not call it easy.

Are there reference data sets for this problem? If not, you or reddit should publish data sets.


> Why are stats like that not shared? That seems like very valuable information.

Because it's only 163, and data of an unknown quality.


> the data is quite clear.

Please, share it with us then


>> Although these are very imperfect indicators of success, here they are.

There you got your response at the beginning of the post. It seems that you are attacking for the sake of attacking? Sam is providing these data because people asked for them.


> did the recommendations team develop the system specifically with those KPIs in mind?

Yes they did - in fact they had input on defining them and helped in tracking them.

> Did they ever have access to adequate information to truly solve the problem your team needed solved?

They believed so. Their team was also responsible for our company data warehousing so they knew even better than me what data was available. Basically any piece of data that could be available they had access to.

> And was the same result observed for other uses of their recommendation systems?

I did not have first-hand access to the results of their use in other recommendation contexts. As I mentioned in my original post I only had second-hand accounts from other teams that went the same route. They reported similar results to me.


> We have very good data

I am not aware of your username, but who has very good data?


> I don't know, most of the metrics seemed unsurprising. Was there anything you found surprising in these datasets?

Nope, thought it was interesting data worth sharing, even just the fact it exists.

Some of the Fast Facts are mildly interesting: https://www.bgsu.edu/ncfmr/resources/data/fast-facts.html


>> their models have 20,000 vectors in determining credit worthiness. How would you begin to break that down to something explainable?

Well, somehow they decided that their 20k-parameter model is accurate. They should at least be able to explain why they took that decision, even if the model itself is too complex.


> They absolutely sell the multi-feature profile of you from that data

I'd like to learn more. Do you have a source for that?


>I think you're over-estimating the amount of data involved.

The one billion figure was from the article.


> It’s not entirely clear why the University of Washington team gets such a weird result — since their data isn’t public, we can’t check it — but it’s worth noting at least two important issues with their study.

I cannot understand how economic studies are supposed to be credible when the data they use is not provided along with their methodology.

Is this for privacy reasons? If so, surely we can come up with obfuscation standards?


Submission statement: The author argues for decentralising databases in science while maintain consistent file formats.

I am not sure about this. In genetic epidemiology we have three databases for genome wide association summary data called “sumstats”. Each has it’s own way of formatting the data and are in various states of maintenance. GWAS Atlas is no longer receiving many of the latest summary statistics. While MRC’s IEU database stores their fairly up to date sumstats as a very different file format (custom VCF) which is fairly simple to convert to a more standard format, but is confusing for the less tech savy users.

This arguably a pretty centralised system already. But it is already very difficult to be sure you have the most recent (and best quality) sumstats for a particular phenotype. Decentralisation would make this much worse! On the other hand, centralisation risks the singular database become unmaintained due to funding constraints and because the whole thing is likely managed by a single post-doc who needs to move on from their job every 3 years so that they have a chance for career progression.


> I store anonymous metrics that I use to gauge usage levels

How valuable has that data proven to be? Is it worth collecting?


> Any stories of inadequate examples or meaningless correlations?

A customer's marketing group was tying visitor data to geodemographic data. They put together a database with tons of variables, went searching, and found a multiple regression with a Pearson coefficient of 0.8+, a low p, decided to rewrite personas, and started devising new tactics based on the discovery.

Fortunately, they briefed the CEO and the CEO said that the dimensions in question (I honestly don't remember what they were) didn't make intuitive sense, and demanded more details before supporting such a major shift in tactics. More research was done, and this time somebody remembered that this was a product where the customers aren't the users, so they need to be treated separately. And it turned out the original analysis (done without fancy analytics) was very close to correct.

If the CEO hadn't been engaged during that meeting, they would've thrown away good tactics on a simple mistake. The regression was "reliable" by most statistical measures, but it was noise.

A similar example holds for validity, where I saw a team make wonderfully accurate promotion response models, but they only measured to the first "conversion" instead of measuring LTV. And after several months of the new campaign, it turned out that the new customers had much higher churn, so they weren't nearly as valuable as the original customers.

> Care to elaborate how how to be more sure of reliability and validity?

I'm not a statistician or an actuary. I'm a guy who took four stat classes during undergrad. I know just enough to know that I don't know that much.

Disclaimer aside: my biggest rules of thumb are to make sure that you're measuring the thing you want to measure (not a substitute), to make sure the statistical methods you're using are appropriate for the data you're collecting, and to make sure you understand the segmentation of your market.


> Not saying its a bad idea, but I'm still struggling to see the full logic of it.

You don't have to see the full logic of it! There's data collected and analyzed by people who have studied the problem!


> The math on calculating all this is extremely complex.

Hum, I would be more reassured by past statistics than a probability evaluation. Did they happen to have loss data since their creation?


>I've only taken a quick look at the data, but the problem doesn't seem to be focused on their core competencies, but instead is much more general

How can you tell? All the features are completely anonymized.

next

Legal | privacy