Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

And then half of the websites are not working because they think you are a crawler/malicious user.


sort by: page size:

And then crawlers completely mess up your data. And it’s another thing to maintain and process.

Until you get misbehaving users crawling every page.

Not to mention if you were malicious, the crawler would probably be inferior to:

    while :; curl "http://www.ycombinator.com"; done
That said, there are many crawlers out there, many of them probably more sophisticated in their ability to ruin someone else's day. Unless you're releasing an exploit, malicious users probably know how to abuse the internet more than you do.

Crawling sites actively hostile to your crawler seems like it has both useful and shady applications. What kinds of things do you use that for?

They kill your bandwidth. For a client's catalog site we discovered crawlers were more than half of the used bandwidth costs.

Take THAT, evil web crawlers! Maybe they are trying to get poorly-written spiders to crash when they hit the site?

Indeed - about 39% of hits to my site come from crawlers.

How do I know? Thanks to GoAccess!! ;)


I used to work for a data-scraping firm and very often we would accidentally knock many web sites offline when we pointed our crawlers at them.

I'd love to agree with you, but the crawler problem is 100x worse today than it was a decade ago


We should probably classify the crawler identifying problem as impossible and move along. Less resources wasted and easier automation for everyone. Assuming a crawler is malicious is narrow-minded.

Here's an example of things you can do against malicious crawlers: http://www.hackerfactor.com/blog/index.php?/archives/762-Att....

It might be against the terms of service of the website you're crawling, which puts you in violation of the Computer Fraud and Abuse Act (i.e. you're considered to be "hacking" them).

Agree... this entire argument seems to think I give a rats ass about 9000 different crawlers that give me literally zero benefit and only waste server resources. Most of those crawlers are for ad-soaked piss poor search engines. I'd rather just block them all and allow crawlers that don't know how to behave.

Clearly lots of people are trying but they have very strict anti-crawler measures that often trigger for me (a human) if I browse too quickly.

I dont stop crawlers, I only randomly feed damaged/wrong data to crawlers.

I especially loving doing this for e-commerce sites. Now the table has turned. Try guess which fraction of your scrapped data were wrong.


I used to work at a webhosting company, crawlers were an enormous amount of our traffic. Some of our users, their only traffic was crawler traffic.

I've put in a hard block for all crawlers on all pages. Works for my scenario I think. Hopefully they don't lie in their user agent. Then it's going to be really bad.

Still, you keep saying all that as if most websites even notice that they're being crawled, and that their operators are very aware exactly when by whom they're crawled. Like as if the admin gets a notification every time a crawler comes by or something, with precise details about it. I don't think it's nearly as serious as you're trying to make it look.

What about the crawling? How much traffic does that generate? I'm just wondering if you are hitting one or more of the sites very hard and they complained to DO about it.

Also take a look at things like browser extensions. It's not uncommon that they phone home every URL you visit, then crawl them for whatever reason.
next

Legal | privacy