Hacker Read

vanusa · 2021-12-01 11:55:50

They get fed into a web crawler and then into a giant hopper whence they become the backbone of that shiny "No Code" technology you've been hearing about.

wmf | karma 46152 | avg karma 2.46 · | 2013-04-24 20:21:32+00:00

The way they're spliced in means they don't have to crawl anything. When you surf the Web you're crawling for them.

altdataseller | karma 1954 | avg karma 1.86 · | 2023-02-02 17:56:43

The same way Google handles it when they crawl these things

kizer | karma 259 | avg karma 0.64 · | 2021-05-15 17:27:28+00:00

Their crawlers are more advanced now. They already handle SPAs with # URLs and content that’s only fetched once the JS is executed, and likely use some OCR and more powered by AI magic.

tconaugh | karma 3 | avg karma 0.6 · | 2017-08-22 20:47:11

They use distributed web crawlers to crawl 100s of billions of web pages. Probably one of the following options:

1) Built their own crawlers.

2) Using an Apache Nutch/Heritrix cluster in a colo facility.

3) Use 3rd party services like mixnode.

reply

palakchokshi | karma 655 | avg karma 1.85 · | 2014-03-31 18:42:33+00:00

So a "legit" crawler has no recourse but to resort to implementing technology that'll circumvent anti-crawler code? Can't say I didn't try to go legit :)

marak830 | karma 1598 | avg karma 1.77 · | 2016-05-31 05:59:56

Huh this was a really interesting write up on semi-obscured code. I had never seriously thought about crawling through popular sites code like that, I'm definitely going to have to give it a go!

alecco | karma 8121 | avg karma 3.01 · | 2017-07-23 18:50:39

They kill your bandwidth. For a client's catalog site we discovered crawlers were more than half of the used bandwidth costs.

smartician | karma 1063 | avg karma 4.29 · | 2013-06-11 21:36:43+00:00

They probably have a semi-automatic data collection system where you can type in a name, and it collects all sorts of public data in real time. To a non-technical person, this can look like they have a file on you, when in reality this crawler just compiled all the data in a few minutes.

saraid216 | karma 6617 | avg karma 1.54 · | 2013-06-06 17:11:26+00:00

That's how a web crawler works, though, right?

reacharavindh | karma 6222 | avg karma 4.57 · | 2021-12-27 09:00:40

The idea of building a distributed crawler that runs on user’s browser sounds fascinating!! Now that is better than the user burning power to mine bitcoins solving stupid puzzles.

noman-land | karma 2887 | avg karma 2.91 · | 2021-09-06 22:23:42

I'm curious how they are crawling and indexing.

stillmotion | karma 525 | avg karma 3.0 · | 2008-02-27 17:39:33+00:00

A web spider that's run by out sourcing work to China. We could have hundreds of thousands people going to flash websites and entering the contents into a database. Would work flawlessly.

kaivi | karma 717 | avg karma 4.81 · | 2023-04-27 18:48:44

Are there any details about how they're crawling the web? I've never encountered the BraveBot User-Agent and never heard of such crawler.

amelius | karma 42902 | avg karma 1.63 · | 2022-04-23 20:18:22

I'm guessing the work involves a web-crawler.

chomp | karma 5209 | avg karma 5.85 · | 2020-09-30 14:36:41+00:00

I used to work at a webhosting company, crawlers were an enormous amount of our traffic. Some of our users, their only traffic was crawler traffic.

wglb | karma 64679 | avg karma 7.38 · | 2015-04-29 13:26:46+00:00

Pure speculation here: crawlers go for bulk and likely don't care if they pick up garbage. There may be another level of email harvesting that goes to the level that you suggest, but in seeing all the conventions that are used like the one you show, such code would have to cover lots of them. Return might be very low.

chewmieser | karma 853 | avg karma 3.88 · | 2021-11-05 07:53:13

Like everyone and their brother has a web spider. And some of them are VERY badly designed. We block them when they use too many resources, although we'd rather just let them be.

Can't speak for the op but we have APIs and move the ones scraping and reselling our content to APIs. The majority are just a worthless suck on resources though.

reply

brianjking | karma 833 | avg karma 1.3 · | 2023-08-24 16:48:03

I mean, they did write their own crawler and have huge financial incentives to respect it.

What isn't known is if those sites will still have their content possibly included in training corpuses from CommonCrawl or ThePile, etc.

reply

mandeepj | karma 2707 | avg karma 1.48 · | 2015-05-12 13:13:16+00:00

Any idea how Google digests\consumes web pages while crawling? Does it take out all the html and store just the plain text? If this is the case then can you share some more info on how they are doing it?

I think there is no way they are going to scrap the websites as there are millions of them with each having their own structure.

reply