Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

I don't think they do any such thing, if anything they are rotating IPs/user agents to avoid being limited or blocked.

Google requires sites to send the crawler the same content as someone clicking a link on a Google results page would see, so even if some sites get creative covering it up with blurred boxes and similar dark patterns, the data is there in the markup.



sort by: page size:

Hmm. True. Though obviously crawlers can be allowed different access.

Doesn't Google have a policy that sites must allow access from Google links in order to have content indexed? See for example the New York Times and their paywall, which is famously circumvented via Google.

That seems to indicate it's not that clear cut. My suspicion is they haven't really thought about it yet.


Doesn't Google penalize sites that show something different to the crawler than they do to a regular user?

I actually think the crawlers are more sophisticated than that. Recently I got an email from Google saying the text on my website was not big enough when browsing my site with a low resolution - which is why they choose to lower my rank on the results page. Clearly they have pretty advanced tests when it comes to accessibility of text, not only the content.

Ya, the page when you click through has to be the same as that indexed by google, I think it might be referred to as cloaking?

Google's crawler gets preferential treatment on many sites. Not necessarily because of private deals, but because it drives traffic from search results.

Quote via [2]

--- start quote ---

When Mr. Maril started researching how sites treated Google’s crawler, he downloaded 17 million so-called robots.txt files — essentially rules of the road posted by nearly every website laying out where crawlers can go — and found many examples where Google had greater access than competitors.

ScienceDirect, a site for peer-reviewed papers, permits only Google’s crawler to have access to links containing PDF documents. Only Google’s computers get access to listings on PBS Kids. On Alibaba.com, the U.S. site of the Chinese e-commerce giant Alibaba, only Google’s crawler is given access to pages that list products.

--- end quote ---

[1] https://www.nytimes.com/2020/12/14/technology/how-google-dom...

[2] https://daringfireball.net/linked/2020/12/14/wakabayashi-goo...


Because google will ban the page if the crawled content is not the same as the displayed content when you click their link.

A number of paywalled sites allow Google through either because it's in their self interest to do so or because they have a business arrangement with them. This has been going on for a long time. It used to be that Google would penalize a site in its search results for showing their web crawler something other than what a user would see but that appears to have been relaxed in recent years, at least for the larger paywall sites.[1]

[1] I believe that's the case... I don't recall in earlier years paywall pages ranking so high in the search results but my memory could be faulty on this.


I'm fairly sure Google's search crawler already uses a masked UA, to detect when pages serve it different content than they do to users.

Because due to a black hat trick called cloaking [1], Google penalizes pages that show different content to users and crawlers. So web sites serve a free version of their content to Google.

[1] https://support.google.com/webmasters/answer/66355?hl=en


Blame the site operators. They probably have specifically allowed the search engine web crawlers to index their sites, while paywalling everybody else.

One possible solution would be for you to use the same user-agent string that the search engines are using, thus you will see the same content.

https://developers.google.com/search/docs/advanced/crawling/...


The next test could be: Does google crawl hidden text (display:none, very small, very transparent colored text)? My guess is they do crawl it because it can have legitimate uses, but if there is to much of them on a page then they give it a lower ranking.

I wonder if it is Google's visual site previews/thumbnails that you get when you click on the arrow at the side of a search result, that are doing this.

Perhaps Google fetches the crawled page from the cache and then renders that for the previews?


How Google displays their own search results (tables or divs) is irrelevant to how their crawler interprets everyone else's pages.

I'm inclined to agree with you. It's B.S. when sites allow Google and / or other search engine spiders a full view to get the content indexed, then proceed to deny access and hide said content behind a paywall when human beings discover it via said search engines.

In fact, I thought Google had a policy that sites must show the same content to end users as to the web crawler. Anyone else recall their position on it? Are these policies actively enforced? If not actively enforced, why not?

Archive.is bypass: http://archive.fo/9v3dm

edit: Bypass didn't work out well in this case, the NYT page is too slow to load or render in time and much of the images ended up blank.


But how is Google getting headers from the users of the sites, it should be from their crawler

It's not "their way of doing things", it is just the way the web works (per HTTP spec). Any crawler would have done the same thing in that situation -- that fact that it happened to be Google is merely coincidental. Given the scale at which they operate, you can't expect Google or any other web-scale crawler to be mind-readers.

The reason for this being that Google results penalize sites that show content to the crawler and then hide it from people who visit the page (ie non-subscribers). If they want it in the index, it has to be actually viewable.

The alternative would be to not show paywalled content to Google's crawlers, and have them entirely left out of search results.


Isn't it actually against Google's policy to show different content to the crawler than to actual users? I might be misremembering, but I thought at one point this practice could get you delisted.

They definitely used results from Google.

They use results for terms users entered to Google to crawl pages that are not in their index (torsorophy example which is not an artificial one) therefore enriching their index based on google's results, incresing their depth.

As for ranking, it is more blurry, but When you record users clicks, which directly correlates with ranking, it starts stinking.

next

Legal | privacy