Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

They get fed into a web crawler and then into a giant hopper whence they become the backbone of that shiny "No Code" technology you've been hearing about.


sort by: page size:

The way they're spliced in means they don't have to crawl anything. When you surf the Web you're crawling for them.

The same way Google handles it when they crawl these things

Their crawlers are more advanced now. They already handle SPAs with # URLs and content that’s only fetched once the JS is executed, and likely use some OCR and more powered by AI magic.

They use distributed web crawlers to crawl 100s of billions of web pages. Probably one of the following options:

1) Built their own crawlers.

2) Using an Apache Nutch/Heritrix cluster in a colo facility.

3) Use 3rd party services like mixnode.


So a "legit" crawler has no recourse but to resort to implementing technology that'll circumvent anti-crawler code? Can't say I didn't try to go legit :)

Huh this was a really interesting write up on semi-obscured code. I had never seriously thought about crawling through popular sites code like that, I'm definitely going to have to give it a go!

They kill your bandwidth. For a client's catalog site we discovered crawlers were more than half of the used bandwidth costs.

They probably have a semi-automatic data collection system where you can type in a name, and it collects all sorts of public data in real time. To a non-technical person, this can look like they have a file on you, when in reality this crawler just compiled all the data in a few minutes.

That's how a web crawler works, though, right?

The idea of building a distributed crawler that runs on user’s browser sounds fascinating!! Now that is better than the user burning power to mine bitcoins solving stupid puzzles.

I'm curious how they are crawling and indexing.

A web spider that's run by out sourcing work to China. We could have hundreds of thousands people going to flash websites and entering the contents into a database. Would work flawlessly.

Are there any details about how they're crawling the web? I've never encountered the BraveBot User-Agent and never heard of such crawler.

I'm guessing the work involves a web-crawler.

I used to work at a webhosting company, crawlers were an enormous amount of our traffic. Some of our users, their only traffic was crawler traffic.

Pure speculation here: crawlers go for bulk and likely don't care if they pick up garbage. There may be another level of email harvesting that goes to the level that you suggest, but in seeing all the conventions that are used like the one you show, such code would have to cover lots of them. Return might be very low.

Like everyone and their brother has a web spider. And some of them are VERY badly designed. We block them when they use too many resources, although we'd rather just let them be.

Can't speak for the op but we have APIs and move the ones scraping and reselling our content to APIs. The majority are just a worthless suck on resources though.


I mean, they did write their own crawler and have huge financial incentives to respect it.

What isn't known is if those sites will still have their content possibly included in training corpuses from CommonCrawl or ThePile, etc.


Any idea how Google digests\consumes web pages while crawling? Does it take out all the html and store just the plain text? If this is the case then can you share some more info on how they are doing it?

I think there is no way they are going to scrap the websites as there are millions of them with each having their own structure.

next

Legal | privacy