Hacker Read

bd_at_rivenhill · 2013-04-02 13:42:26+00:00

I'm seeing a fair few duplicates in the results, probably need to work on the algorithm for filtering these out.

conradludgate | karma 1079 | avg karma 2.55 · | 2023-05-06 07:11:51

During search, we do remove duplicates. It's not a bad idea though and I'll see how we can support it

michaelhart | karma 339 | avg karma 3.39 · | 2010-11-09 18:41:50+00:00

I just modified the way the queries run; duplicates should be MUCH less common now.

staunch | karma 28228 | avg karma 4.34 · | 2011-09-24 02:04:51+00:00

We're still testing this out with real data, but it looks like it's actually quite useful to have duplicates in the database. The key is how to return results to someone coming along later. We're working on that now.

We committed upfront to not letting the site become overrun with "Yahoo Answers" style duplicated/low quality stuff. We'd much rather delete useless stuff than get an extra page view or two.

reply

gpm | karma 17771 | avg karma 3.89 · | 2022-08-30 17:40:59

Huh, there's a ton of duplicates in the data set... I would have expected that it would be worthwhile to remove those. Maybe multiple descriptions of the same thing helps, but some of the duplicates have duplicated descriptions as well. Maybe deduplication happens after this step?

http://laion-aesthetic.datasette.io/laion-aesthetic-6pls/ima...

reply

turingbook | karma 2378 | avg karma 6.74 · | 2016-11-22 15:40:42+00:00

Sorry for the replication. It seems that HN should have a better duplicates removal algorithm.

mfisher87 | karma 332 | avg karma 2.18 · | 2016-11-30 23:36:45

Thanks! It wouldn't be so bad if there weren't so many duplicates :)

nonethewiser | karma 6061 | avg karma 1.96 · | 2024-02-02 10:56:58

This is an interesting dataset. I suspect the main things that get removed are A) politics B) duplicates.

michaelmior | karma 5938 | avg karma 1.89 · | 2021-04-20 23:50:58+00:00

Looks very cool! A couple things however that I noticed. There seem to be a lot of duplicates which clutter the search results. Also, the first thing I looked for was available, but the correct result didn't appear until the second page.

londons_explore | karma 35497 | avg karma 2.72 · | 2023-02-01 07:56:11

Removing duplicates and very near duplicates ought to be one of the first things done to any training dataset...

I wonder why this wasn't done? Too computation heavy?

reply

d99kris | karma 1897 | avg karma 6.24 · | 2011-05-14 14:52:51+00:00

Yes, I'm observing the same. The duplicate entries differ by their number of points it seems.

clone1018 | karma 794 | avg karma 4.49 · | 2011-12-31 14:46:04+00:00

It shouldn't allow duplicates, I'll work on the rest.

okal | karma 2002 | avg karma 7.12 · | 2012-02-14 23:21:21+00:00

Thanks :) Duplicate entries are not allowed, but I'm still trying to figure out the best way to handle them. If you come across one, simply flag the last to be entered so I can manually delete them. SICP also got added a couple of times.

bovermyer | karma 7932 | avg karma 3.1 · | 2015-03-18 13:29:06+00:00

Very interesting concept. I signed up; we'll see where it goes.

One thing to note, though - on the left I'm seeing a number of duplicated entries, with the duplicates immediately after the original. I'm using Firefox, if that matters.

reply

minimaxir | karma 67739 | avg karma 7.48 · | 2015-01-16 23:29:13

You have a lot of duplicate entries there.

yojo | karma 2151 | avg karma 5.12 · | 2022-08-30 20:55:14

I notice a surprising number of duplicates. E.g. if I sort by aesthetic, there’s the same 500x500 Tuscan village painting multiple times on the first page of results.

Presumably it wouldn’t be so hard to hash the images and filter out repeats. Is the idea to keep the duplicates to preserve the description mappings?

reply

tallanvor | karma 4647 | avg karma 3.31 · | 2013-07-18 11:00:55+00:00

Duplicate removal generally happens after the initial count is received. That's why you can get major variations in the count and the actual number of results.

aliakhtar | karma 226 | avg karma 0.91 · | 2014-11-07 01:59:43+00:00

Cool idea, but there seem to be a few duplicates for Java. When I search 'Java', I see 3-4 duplicate results, only the first one has any companies listed, and the rest seem to be empty?

avereveard | karma | avg karma · | 2021-03-21 04:06:15+00:00

Probably the result of some early attempt at either sorting or removing duplicates

PaulHoule | karma 78160 | avg karma 2.48 · | 2023-02-28 10:51:15

dang, you need a better clustering engine for duplicates. Try LLM and DBSCAN!