Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

I wouldn't be surprised if a lot of the pageviews are bots/crawlers. They don't hover over links.


sort by: page size:

It's not uncommon for lightly trafficked sites to have 90% or more of their resources taken up by bots of various kinds. Remember Google isn't the only search engine and there are many crawlers that aren't search engines at all.

I have no insider knowledge, but I'm going to guess that they even crawl using actual Chrome browsers once in a while... and penalize deviations from what you serve to the GoogleBot... and also factor in response times for either.

I don't think any single person can predict/know what's actually going on at this point.


Probably it's just a search bot like Googlebot.

I think it's more likely that they have hidden links and references to these URLs scattered across their "normal" pages. A human would never click on them, but any bot that's just blindly following links would stumble across them, which alerts Yelp that it's likely a bot accessing the site.

Maybe the page content is cloaked so that googlebot sees completely different text to human visitors.

These don't seem like crawlers, but rather proxies requesting information on behalf of a user (perhaps with that user agent that is in the UA string). Those are "Google bots", but they aren't the Googlebot.

Maybe a different user agent or something? I'm sure that the companies writing these articles do want them to have good SEO and therefore allow crawling them for all sorts of bots, so that they'd show up on search engines.

So it looks like Google grabbed about 40 links before giving up? I wonder what a good "score" is? At first, I'd guess less is better, but too few might be running the risk of throwing out potentially good pages. Too many, and the bot is just wasting effort. The 40 score could vary as well based on parallel conditions assuming many bot instances are sharing a task pool. Be sure to post the Microsoft results if/when they crawl you.

I'm curious as to whether you considered if this would have any effect on PageRank for content you host?

This type of behavior - using javascript to present the user with links different than the googlebot found - seems like the type of thing they frown on.


They are known to crawl using human-like user agents instead of the typical Googlebot one precisely to counter this (weak) effort at playing the system. I'm surprised WSJ is surprised by the outcome here.

Yeah, just like Google bot crawls the web putting enormous load on people's servers (G-Bot can pull hundreds of thousands of pages per day from a single server), while being very sensitive to automated searches and banning your ass (IP) in a heart beat. They sure don't want to be crawled by anyone.

I would think most bots wouldn't have JavaScript and therefore not register on GA. I think a lot of people are bored and do click the articles on the New page. That said, even clicks from the front page are often not actual readers (as you can tell by many of the comments).

I'm 99% sure I've encountered a Googlebot crawling pages with the UA of a regular browser, presumably for exactly this purpose.

There are site viruses that serve up normal content to normal users, and so go undetected, but serve up spam like that only to GoogleBot, since they're designed to affect PageRank etc. of specific sites.

That was likely the case.


Or the site is bad. Some sites show some content to Googlebot, and other content to mere mortals.

I'm pretty sure that's just to see if the site is serving different content to mobile vs desktop. I think they also sometimes hit pages with no mention of googlebot.

> These scam sites load megabytes of junk, load slowly, have text interpersed with ads and modals that render right on top of them

Only if you're not googlebot. The crawler sees a much nicer site.


Yeah, MSNbot isn't very friendly, so they somewhat bring it on themselves. Both it and Yahoo's bot are much worse than Google at predicting likely page updates, as well--- on one of my sites that has a blank robots.txt, MSNbot accounts for >10x as many hits as GoogleBot, yet Google still manages to keep its index just as up to date.

Some people have had success with crawl-delay, though there are occasional reports of MSNbot not honoring that either.


I doubt it. Google's crawling bots (like most search engine bots) follow a published standard for access rules for bots (robots.txt).
next

Legal | privacy