Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

I'm seeing a fair few duplicates in the results, probably need to work on the algorithm for filtering these out.


sort by: page size:

During search, we do remove duplicates. It's not a bad idea though and I'll see how we can support it

I just modified the way the queries run; duplicates should be MUCH less common now.

We're still testing this out with real data, but it looks like it's actually quite useful to have duplicates in the database. The key is how to return results to someone coming along later. We're working on that now.

We committed upfront to not letting the site become overrun with "Yahoo Answers" style duplicated/low quality stuff. We'd much rather delete useless stuff than get an extra page view or two.


Huh, there's a ton of duplicates in the data set... I would have expected that it would be worthwhile to remove those. Maybe multiple descriptions of the same thing helps, but some of the duplicates have duplicated descriptions as well. Maybe deduplication happens after this step?

http://laion-aesthetic.datasette.io/laion-aesthetic-6pls/ima...


Sorry for the replication. It seems that HN should have a better duplicates removal algorithm.

Thanks! It wouldn't be so bad if there weren't so many duplicates :)

This is an interesting dataset. I suspect the main things that get removed are A) politics B) duplicates.

Looks very cool! A couple things however that I noticed. There seem to be a lot of duplicates which clutter the search results. Also, the first thing I looked for was available, but the correct result didn't appear until the second page.

Removing duplicates and very near duplicates ought to be one of the first things done to any training dataset...

I wonder why this wasn't done? Too computation heavy?


Yes, I'm observing the same. The duplicate entries differ by their number of points it seems.

It shouldn't allow duplicates, I'll work on the rest.

Thanks :) Duplicate entries are not allowed, but I'm still trying to figure out the best way to handle them. If you come across one, simply flag the last to be entered so I can manually delete them. SICP also got added a couple of times.

Very interesting concept. I signed up; we'll see where it goes.

One thing to note, though - on the left I'm seeing a number of duplicated entries, with the duplicates immediately after the original. I'm using Firefox, if that matters.


You have a lot of duplicate entries there.

I notice a surprising number of duplicates. E.g. if I sort by aesthetic, there’s the same 500x500 Tuscan village painting multiple times on the first page of results.

Presumably it wouldn’t be so hard to hash the images and filter out repeats. Is the idea to keep the duplicates to preserve the description mappings?


Duplicate removal generally happens after the initial count is received. That's why you can get major variations in the count and the actual number of results.

Cool idea, but there seem to be a few duplicates for Java. When I search 'Java', I see 3-4 duplicate results, only the first one has any companies listed, and the rest seem to be empty?

Probably the result of some early attempt at either sorting or removing duplicates

dang, you need a better clustering engine for duplicates. Try LLM and DBSCAN!
next

Legal | privacy