I don't think they do any such thing, if anything they are rotating IPs/user agents to avoid being limited or blocked.
Google requires sites to send the crawler the same content as someone clicking a link on a Google results page would see, so even if some sites get creative covering it up with blurred boxes and similar dark patterns, the data is there in the markup.
Hmm. True. Though obviously crawlers can be allowed different access.
Doesn't Google have a policy that sites must allow access from Google links in order to have content indexed? See for example the New York Times and their paywall, which is famously circumvented via Google.
That seems to indicate it's not that clear cut. My suspicion is they haven't really thought about it yet.
I actually think the crawlers are more sophisticated than that. Recently I got an email from Google saying the text on my website was not big enough when browsing my site with a low resolution - which is why they choose to lower my rank on the results page. Clearly they have pretty advanced tests when it comes to accessibility of text, not only the content.
Google's crawler gets preferential treatment on many sites. Not necessarily because of private deals, but because it drives traffic from search results.
Quote via [2]
--- start quote ---
When Mr. Maril started researching how sites treated Google’s crawler, he downloaded 17 million so-called robots.txt files — essentially rules of the road posted by nearly every website laying out where crawlers can go — and found many examples where Google had greater access than competitors.
ScienceDirect, a site for peer-reviewed papers, permits only Google’s crawler to have access to links containing PDF documents. Only Google’s computers get access to listings on PBS Kids. On Alibaba.com, the U.S. site of the Chinese e-commerce giant Alibaba, only Google’s crawler is given access to pages that list products.
A number of paywalled sites allow Google through either because it's in their self interest to do so or because they have a business arrangement with them. This has been going on for a long time. It used to be that Google would penalize a site in its search results for showing their web crawler something other than what a user would see but that appears to have been relaxed in recent years, at least for the larger paywall sites.[1]
[1] I believe that's the case... I don't recall in earlier years paywall pages ranking so high in the search results but my memory could be faulty on this.
Because due to a black hat trick called cloaking [1], Google penalizes pages that show different content to users and crawlers. So web sites serve a free version of their content to Google.
Blame the site operators. They probably have specifically allowed the search engine web crawlers to index their sites, while paywalling everybody else.
One possible solution would be for you to use the same user-agent string that the search engines are using, thus you will see the same content.
The next test could be: Does google crawl hidden text (display:none, very small, very transparent colored text)?
My guess is they do crawl it because it can have legitimate uses, but if there is to much of them on a page then they give it a lower ranking.
I wonder if it is Google's visual site previews/thumbnails that you get when you click on the arrow at the side of a search result, that are doing this.
Perhaps Google fetches the crawled page from the cache and then renders that for the previews?
I'm inclined to agree with you. It's B.S. when sites allow Google and / or other search engine spiders a full view to get the content indexed, then proceed to deny access and hide said content behind a paywall when human beings discover it via said search engines.
In fact, I thought Google had a policy that sites must show the same content to end users as to the web crawler. Anyone else recall their position on it? Are these policies actively enforced? If not actively enforced, why not?
It's not "their way of doing things", it is just the way the web works (per HTTP spec). Any crawler would have done the same thing in that situation -- that fact that it happened to be Google is merely coincidental. Given the scale at which they operate, you can't expect Google or any other web-scale crawler to be mind-readers.
The reason for this being that Google results penalize sites that show content to the crawler and then hide it from people who visit the page (ie non-subscribers). If they want it in the index, it has to be actually viewable.
The alternative would be to not show paywalled content to Google's crawlers, and have them entirely left out of search results.
Isn't it actually against Google's policy to show different content to the crawler than to actual users? I might be misremembering, but I thought at one point this practice could get you delisted.
They use results for terms users entered to Google to crawl pages that are not in their index (torsorophy example which is not an artificial one) therefore enriching their index based on google's results, incresing their depth.
As for ranking, it is more blurry, but When you record users clicks, which directly correlates with ranking, it starts stinking.
Google requires sites to send the crawler the same content as someone clicking a link on a Google results page would see, so even if some sites get creative covering it up with blurred boxes and similar dark patterns, the data is there in the markup.
reply