Hacker Read

noncoml · 2023-06-07 19:49:49

And then half of the websites are not working because they think you are a crawler/malicious user.

UncleMeat | karma 13669 | avg karma 2.5 · | 2020-01-23 15:55:13

And then crawlers completely mess up your data. And it’s another thing to maintain and process.

lmz | karma 1632 | avg karma 1.74 · | 2022-02-18 06:57:03

Until you get misbehaving users crawling every page.

duaneb | karma 6551 | avg karma 1.79 · | 2012-08-11 04:27:52+00:00

Not to mention if you were malicious, the crawler would probably be inferior to:

    while :; curl "http://www.ycombinator.com"; done

That said, there are many crawlers out there, many of them probably more sophisticated in their ability to ruin someone else's day. Unless you're releasing an exploit, malicious users probably know how to abuse the internet more than you do.

JoshTriplett | karma 44606 | avg karma 4.76 · | 2016-01-20 17:34:26

Crawling sites actively hostile to your crawler seems like it has both useful and shady applications. What kinds of things do you use that for?

alecco | karma 8121 | avg karma 3.01 · | 2017-07-23 18:50:39

They kill your bandwidth. For a client's catalog site we discovered crawlers were more than half of the used bandwidth costs.

joosters | karma 15062 | avg karma 4.78 · | 2013-05-02 12:07:43+00:00

Take THAT, evil web crawlers! Maybe they are trying to get poorly-written spiders to crash when they hit the site?

loloquwowndueo | karma 3657 | avg karma 3.17 · | 2021-08-01 08:22:38

Indeed - about 39% of hits to my site come from crawlers.

How do I know? Thanks to GoAccess!! ;)

reply

tomc1985 | karma 11141 | avg karma 2.76 · | 2020-01-22 02:53:54+00:00

I used to work for a data-scraping firm and very often we would accidentally knock many web sites offline when we pointed our crawlers at them.

I'd love to agree with you, but the crawler problem is 100x worse today than it was a decade ago

reply

tleb_ | karma 356 | avg karma 1.96 · | 2020-11-18 09:59:14

We should probably classify the crawler identifying problem as impossible and move along. Less resources wasted and easier automation for everyone. Assuming a crawler is malicious is narrow-minded.

Nextgrid | karma 26661 | avg karma 2.96 · | 2020-06-12 11:30:33+00:00

Here's an example of things you can do against malicious crawlers: http://www.hackerfactor.com/blog/index.php?/archives/762-Att....

jdavis703 | karma 7391 | avg karma 2.93 · | 2017-04-21 17:17:51+00:00

It might be against the terms of service of the website you're crawling, which puts you in violation of the Computer Fraud and Abuse Act (i.e. you're considered to be "hacking" them).

Exuma | karma 2984 | avg karma 2.41 · | 2021-03-27 01:08:15+00:00

Agree... this entire argument seems to think I give a rats ass about 9000 different crawlers that give me literally zero benefit and only waste server resources. Most of those crawlers are for ad-soaked piss poor search engines. I'd rather just block them all and allow crawlers that don't know how to behave.

zarzavat | karma 7157 | avg karma 3.52 · | 2024-05-05 16:53:12

Clearly lots of people are trying but they have very strict anti-crawler measures that often trigger for me (a human) if I browse too quickly.

est | karma 8357 | avg karma 2.2 · | 2018-01-19 01:47:58+00:00

I dont stop crawlers, I only randomly feed damaged/wrong data to crawlers.

I especially loving doing this for e-commerce sites. Now the table has turned. Try guess which fraction of your scrapped data were wrong.

reply

chomp | karma 5209 | avg karma 5.85 · | 2020-09-30 14:36:41+00:00

I used to work at a webhosting company, crawlers were an enormous amount of our traffic. Some of our users, their only traffic was crawler traffic.

boxed | karma 2915 | avg karma 1.94 · | 2022-06-30 14:07:32

I've put in a hard block for all crawlers on all pages. Works for my scenario I think. Hopefully they don't lie in their user agent. Then it's going to be really bad.

grishka | karma 11238 | avg karma 3.26 · | 2020-12-25 21:46:41+00:00

Still, you keep saying all that as if most websites even notice that they're being crawled, and that their operators are very aware exactly when by whom they're crawled. Like as if the admin gets a notification every time a crawler comes by or something, with precise details about it. I don't think it's nearly as serious as you're trying to make it look.

yogo | karma 687 | avg karma 1.61 · | 2013-09-24 17:41:34+00:00

What about the crawling? How much traffic does that generate? I'm just wondering if you are hitting one or more of the sites very hard and they complained to DO about it.

achairapart | karma 2112 | avg karma 4.61 · | 2020-07-22 10:57:33+00:00

Also take a look at things like browser extensions. It's not uncommon that they phone home every URL you visit, then crawl them for whatever reason.