They get fed into a web crawler and then into a giant hopper whence they become the backbone of that shiny "No Code" technology you've been hearing about.
Their crawlers are more advanced now. They already handle SPAs with # URLs and content that’s only fetched once the JS is executed, and likely use some OCR and more powered by AI magic.
So a "legit" crawler has no recourse but to resort to implementing technology that'll circumvent anti-crawler code? Can't say I didn't try to go legit :)
Huh this was a really interesting write up on semi-obscured code. I had never seriously thought about crawling through popular sites code like that, I'm definitely going to have to give it a go!
They probably have a semi-automatic data collection system where you can type in a name, and it collects all sorts of public data in real time. To a non-technical person, this can look like they have a file on you, when in reality this crawler just compiled all the data in a few minutes.
The idea of building a distributed crawler that runs on user’s browser sounds fascinating!! Now that is better than the user burning power to mine bitcoins solving stupid puzzles.
A web spider that's run by out sourcing work to China. We could have hundreds of thousands people going to flash websites and entering the contents into a database. Would work flawlessly.
Pure speculation here: crawlers go for bulk and likely don't care if they pick up garbage. There may be another level of email harvesting that goes to the level that you suggest, but in seeing all the conventions that are used like the one you show, such code would have to cover lots of them. Return might be very low.
Like everyone and their brother has a web spider. And some of them are VERY badly designed. We block them when they use too many resources, although we'd rather just let them be.
Can't speak for the op but we have APIs and move the ones scraping and reselling our content to APIs. The majority are just a worthless suck on resources though.
Any idea how Google digests\consumes web pages while crawling? Does it take out all the html and store just the plain text? If this is the case then can you share some more info on how they are doing it?
I think there is no way they are going to scrap the websites as there are millions of them with each having their own structure.
reply