Hacker Read

0x6c6f6c · 2017-03-31 20:44:51+00:00

Might actually look into doing a quick script and checking this out.. It'd be interesting to look at. Plus I may only be able to scrape public projects, where private repositories may account for some large amount of the real number since those are typically paid for (except for Student accounts, afaik).

wkat4242 | karma 10400 | avg karma 2.0 · | 2023-09-01 05:19:20

Interesting! I've been making some scripts privately for specific tasks, like an Nvidia video card I wanted to buy. However many sites are really hostile to scraping these days. I'll give it a try,.

lippihom | karma 43 | avg karma 0.9 · | 2023-08-18 03:27:25

Very cool - I assume most of the data is gathered through scraping? Quite a few of the sources you're pulling from don't look like they have APIs available.

mistermann | karma 4634 | avg karma 0.56 · | 2018-03-22 00:02:39

This is brilliant, how did you do it, scraping or via an API? I'd love to do something similar for different subject areas.

soultrees | karma 638 | avg karma 3.94 · | 2023-10-13 13:57:01

I’m interested in hearing more about this. It might not pertain to the data I’m thinking specifically as it would be scraped but finding out where the holes are and creating proprietary data around that would be a consideration. However, I’m interested in diving in deeper on alternative data if you have some solid sources I can brush up on.

skinnymuch | karma 2879 | avg karma 0.75 · | 2022-01-23 21:53:44

Any public code of the scraping? I’ve been meaning to do something similar.

cylo | karma 2591 | avg karma 20.9 · | 2022-02-22 13:34:39

Thanks. Absolutely, please feel free to scrape. Though that's always a brittle solution. I've added a JSON API to my todo list for it.

fallenatreus | karma 5 | avg karma 0.36 · | 2021-01-10 22:08:14+00:00

Awesome, what's the data source? are you using some kind of API or just scraping?

ushtaritk421 | karma 97 | avg karma 1.76 · | 2021-11-07 15:25:12

Here's a project that's been in the news recently that relies heavily on scraped data. http://www.thebillionpricesproject.com/

fireworks10 | karma 206 | avg karma 4.29 · | 2017-09-19 18:52:45

Awesome. Curious, what are you using to scrape sources, and for social counts?

mapster | karma 355 | avg karma 0.48 · | 2017-11-14 22:24:06+00:00

Mind if I ask what info/data you are scraping and for what ends?

bromquinn | karma 33 | avg karma 1.18 · | 2020-11-20 18:31:13

This is slick! I could see using this for one of my own projects. How do you get the data? Are you using the official apis or are you 'scraping'?

simonw | karma 58201 | avg karma 7.31 · | 2020-10-09 22:41:40+00:00

That's really smart. I found the scraping action here: https://github.com/quacs/quacs-data/blob/master/.github/work...

Looks like it's even tracking covid data released by the school: https://github.com/quacs/quacs-data/commits/master/covid.jso...

reply

hedora | karma 21378 | avg karma 2.8 · | 2020-02-08 15:34:55+00:00

I’ve implemented a few scrapers for various purposes, and run bulk data analysis over the results.

It was years ago, and it was not very difficult.

I’d guess this would be an easy summer intern project for a bright undergrad.

reply

welanes | karma 1138 | avg karma 4.74 · | 2019-10-29 19:14:13

Sure. I'm thinking a certain amount of cloud scraping / API request credits in the free tier and any number above that falls into a paid tier.

vkb | karma 1536 | avg karma 10.74 · | 2015-07-21 12:12:18

Nice writeup. I may have missed this, but how did you get the data? Did you scrape the site?

tagfolder | karma 81 | avg karma 4.76 · | 2015-10-17 11:58:35+00:00

yes, I have similar experience, I just developed own python scraping solution to automate the process. But many of clients want script plus data and willing to pay only $15.

Ayesh | karma 3416 | avg karma 3.83 · | 2023-08-11 07:01:28

I have a couple of similar scrapers as well. One is a private repo that I collect visa information off Wikipedia (for Visalogy.com), and GeoIP information from MaxMind database (used with their permission).

https://github.com/Ayesh/Geo-IP-Database/

It downloads the repo, and dumps the data split by the first 8 bytes of the IP address, and saves to individual JSON files. For every new scraper run, it creates a new tag and pushes it as a package, so the dependents can simply update them with their dependency manager.

reply

BbzzbB | karma 1608 | avg karma 3.06 · | 2021-10-12 15:22:29

Thanks I'll look into that. Deterring the casual scraper would be the goal basically, make them work enough that it's not worth the hassle in respect to the price for legitimate access as a motivated and technical person with a lot of time would always get to extract data which is shown client-side.

_jal | karma 13880 | avg karma 4.13 · | 2020-08-19 00:14:40+00:00

In another life when I was a consultant, I did work for a startup that scraped public data for exactly this sort of use.

They still exist and use nonpublic sources now, but built a nice subscription business scraping lis pendens notices and such. No reason someone else can't do the same.

reply