Not to mention if you were malicious, the crawler would probably be inferior to:
while :; curl "http://www.ycombinator.com"; done
That said, there are many crawlers out there, many of them probably more sophisticated in their ability to ruin someone else's day. Unless you're releasing an exploit, malicious users probably know how to abuse the internet more than you do.
We should probably classify the crawler identifying problem as impossible and move along. Less resources wasted and easier automation for everyone. Assuming a crawler is malicious is narrow-minded.
It might be against the terms of service of the website you're crawling, which puts you in violation of the Computer Fraud and Abuse Act (i.e. you're considered to be "hacking" them).
Agree... this entire argument seems to think I give a rats ass about 9000 different crawlers that give me literally zero benefit and only waste server resources. Most of those crawlers are for ad-soaked piss poor search engines. I'd rather just block them all and allow crawlers that don't know how to behave.
I've put in a hard block for all crawlers on all pages. Works for my scenario I think. Hopefully they don't lie in their user agent. Then it's going to be really bad.
Still, you keep saying all that as if most websites even notice that they're being crawled, and that their operators are very aware exactly when by whom they're crawled. Like as if the admin gets a notification every time a crawler comes by or something, with precise details about it. I don't think it's nearly as serious as you're trying to make it look.
What about the crawling? How much traffic does that generate? I'm just wondering if you are hitting one or more of the sites very hard and they complained to DO about it.
reply