Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

But doesn't that just mean the closure of duplicates is irrelevant? You're already getting a never-ending stream of the same basic low quality beggings. So the two possibilities are - filter them or go away.


sort by: page size:

It would only be useful to me if it removed duplicates. Otherwise, what's the point? It's just seeing the same thing twice.

It would be nice if 5x duplicates didn't show up in the RSS feed.

By deduplication, do you mean that you want articles on the same subject to be filtered out or do you just want the same article not to be displayed multiple times?

> By deduplication, do you mean that you want articles on the same subject to be filtered out or do you just want the same article not to be displayed multiple times?

I just want the same article not displayed multiple times. For example, if you subscribe to multiple feeds from a newspaper, some articles which fit more than one category will appear multiple times.

Articles on the same subject would be handled by 'grouping', according to what I wrote above.


Meh. It's just a method of filtration.

When you can only tackle so many things there is 0 point in having a bug tracker that will only inflate and with little in the way of automatic triage determining the priority it is a borderline impossible task to wade through them.

Allowing duplicates is fine, the important stuff comes up again whilst the unimportant bits die off.


One of the major problems is removing duplicate or near duplicate content like images, text etc....

During search, we do remove duplicates. It's not a bad idea though and I'll see how we can support it

While intended to agree the duplicates need to be easily identifiable and preferably filterable by quality for bulk downloads.

I also don't understand why they're doing it. On my long-running full history duplicates cost me a factor of 2 in space, and presumably the same in search time (but still perceptually instant).

Without questioning this line of thought, it seems like deduplicating by lowercasing and perhaps removing dots is a good choice, but stripping +suffixes seems likely to generate more user annoyance than it prevents. If I filter based on those suffixes and you send me mail and strip the suffix, I'm going to be pissed.

No, networks are dumb. They do not detect duplicates that would require vast storage! Unless you have a tcp (eg http) proxy on the route you will be the one filtering the duplicates.

I'm seeing a fair few duplicates in the results, probably need to work on the algorithm for filtering these out.

We're still testing this out with real data, but it looks like it's actually quite useful to have duplicates in the database. The key is how to return results to someone coming along later. We're working on that now.

We committed upfront to not letting the site become overrun with "Yahoo Answers" style duplicated/low quality stuff. We'd much rather delete useless stuff than get an extra page view or two.


With single stream everything is contaminated.

And then many of the categories simply aren't worth anything, sorted or not.


+100

Which is exactly why static analysis tools that force you to do something need to be shot. Static analysis tools that inform you about a possible duplicate are totally fine. Give me an option to disable that particular instance.

Co-incidentally, micro-services do away with such problems in many cases due to the fact that code is "separate" and thus analyzers and sticklers don't find the "duplicates" and you can write beautifully simple code. Unfortunately it has the opposite problem then of leading to things like this Netflix architecture https://res.infoq.com/presentations/netflix-chaos-microservi... but for something simple like a personal blog (yes I exaggerate - slightly)

In the end I think the only solution is to have the right people and stay small enough to keep the right culture. That probably goes against all your metrics and growth goals of the company of course.


1. Why isn't there a duplicate filter that catches this?

2. Seems like a partial consequence of the higher churn rate modifications.


The removing duplicates section is the kind of thing that has me worried about technologies like this

Are you saying this is a bad thing? Personally I think it's a good thing that the duplicate detector is easy to get around. Allows for submissions to get multiple chances.

I would recommend a user curated feature to that marks 'overlaps' as duplicates that then get sent to some human to moderate. Similar to user curated spam or inappropriate material.
next

Legal | privacy