Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

I willing to take a punt and guess that the models they’re using are using short term/isolated data, not 10 years worth of your entire browsing history.


sort by: page size:

They also state their data is 2 months old.

I tried it, and the results aren't up to par with whatever I'm searching. They'll be forced to relinquish data like all the other companies when they get big enough.

Not saying this destruction is happening or not, but their data to support their claims is dreadful. All circumstantial viewing of links and pages, google news trends, and Alexa profiles. Sadly, these days claims that may be correct can be ignored when the supporting data supplied is weak (and vice versa).

Sorry, I wasn't clear. I meant, is there any reason to believe that they'd not filter out those kinds of exchanges before updating the model? I am absolutely certain they don't train on user data indiscriminately because so much of it would be garbage.

Yeah, but they're still providing a dataset that's just plain bad. It's hardly relevant how many sites link to some other site, if it's dead.

The Flash viewer and ridiculous TOS doesn't give me much confidence in their data.

It’s bad data. They changed the way they collect data when the price of API access went up. The drop you see in July 2023 isn’t real.

Just playing devil's advocate here: They mention "aggregate user behavior", so they could be building a large, aggregated Markov chain that stores no user data whatsoever -- just site transition data for the world.

I haven't read in-depth analysis of how they do their stuff, though.


How do you expect to get that data? It's buried in their Mediamath and Optimizely accounts.

From a research standpoint this data set is much less interesting than a bunch of students/faculty/bots/apps clicking and surfing their way around the whole Internets.

It doesn't seem like even with analytics they know

It just gives a false confidence in bullshit.

Most internet analytics and the billions spent on data mining is all for naught.


Not talking about raw data. Clearly that isn’t needed for the web. But my point is that it got thrown out with the bath water.

Who the hell paid 60M for their garbage data? This can't be real. Also another reason why scraping content to use for models has to be free for everyone or incumbents are gonna pull stuff like this.

If that's what's actually happening on the back end, they're doing a pretty bad job with my recommendations. What I suspect is really happening, though, is that the data collection and analytics are being used to optimize revenue, whether it be from advertising, minimizing production costs or ascertaining just how much bullshit users will put up with, and not to build or release higher quality products. That, or the data is sold.

They aren't good. Hence the dragnet approach of collecting all data, and then waiting for Google or some such entity to come up with the research for mining methodology.

Notice how the actual data on these pages is exactly the same and from 2022-2023?

This is what I mean by making up links that don't work. The links further down the page don't even go back that far.

You ought to inspect the links before posting them as proof. By not doing so, you demonstrate the limitations and fallacies of humans putting to much faith in these tools.

This graph made 7 years ago would appear to contradict the numbers ChatGPT gave you: https://www.reddit.com/r/dataisbeautiful/comments/3s3c8o/rel...

Are you ready to rethink your exuberance for ChatGPT?

---

I performed two searches, both from image search, took 1 minute to find a good graph. I suspect ChatGPT took longer to write the combined, inaccurate responses, based on my experience at how slow it streams the answers back.

1. windows version usage by year, 2000 - 2010

2. operating system usage by year, 2000 - 2010


I also do not trust their data as much.

I understand how Compete works. I'm just showing you one immediate example of how comically wrong their data often is, both in absolute value and in misreporting trends. This has always been my experience with their data on both small and large sites, unless you use their paid product that allows you to self-report. Kind of reminiscent of BBB/DUNs/Yelp type protection rackets actually.

This is no surprise to me and I wager that far worse is happening with your data.
next

Legal | privacy