Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Might actually look into doing a quick script and checking this out.. It'd be interesting to look at. Plus I may only be able to scrape public projects, where private repositories may account for some large amount of the real number since those are typically paid for (except for Student accounts, afaik).


sort by: page size:

Interesting! I've been making some scripts privately for specific tasks, like an Nvidia video card I wanted to buy. However many sites are really hostile to scraping these days. I'll give it a try,.

Very cool - I assume most of the data is gathered through scraping? Quite a few of the sources you're pulling from don't look like they have APIs available.

This is brilliant, how did you do it, scraping or via an API? I'd love to do something similar for different subject areas.

I’m interested in hearing more about this. It might not pertain to the data I’m thinking specifically as it would be scraped but finding out where the holes are and creating proprietary data around that would be a consideration. However, I’m interested in diving in deeper on alternative data if you have some solid sources I can brush up on.

Any public code of the scraping? I’ve been meaning to do something similar.

Thanks. Absolutely, please feel free to scrape. Though that's always a brittle solution. I've added a JSON API to my todo list for it.

Awesome, what's the data source? are you using some kind of API or just scraping?

Here's a project that's been in the news recently that relies heavily on scraped data. http://www.thebillionpricesproject.com/

Awesome. Curious, what are you using to scrape sources, and for social counts?

Mind if I ask what info/data you are scraping and for what ends?

This is slick! I could see using this for one of my own projects. How do you get the data? Are you using the official apis or are you 'scraping'?

That's really smart. I found the scraping action here: https://github.com/quacs/quacs-data/blob/master/.github/work...

Looks like it's even tracking covid data released by the school: https://github.com/quacs/quacs-data/commits/master/covid.jso...


I’ve implemented a few scrapers for various purposes, and run bulk data analysis over the results.

It was years ago, and it was not very difficult.

I’d guess this would be an easy summer intern project for a bright undergrad.


Sure. I'm thinking a certain amount of cloud scraping / API request credits in the free tier and any number above that falls into a paid tier.

Nice writeup. I may have missed this, but how did you get the data? Did you scrape the site?

yes, I have similar experience, I just developed own python scraping solution to automate the process. But many of clients want script plus data and willing to pay only $15.

I have a couple of similar scrapers as well. One is a private repo that I collect visa information off Wikipedia (for Visalogy.com), and GeoIP information from MaxMind database (used with their permission).

https://github.com/Ayesh/Geo-IP-Database/

It downloads the repo, and dumps the data split by the first 8 bytes of the IP address, and saves to individual JSON files. For every new scraper run, it creates a new tag and pushes it as a package, so the dependents can simply update them with their dependency manager.


Thanks I'll look into that. Deterring the casual scraper would be the goal basically, make them work enough that it's not worth the hassle in respect to the price for legitimate access as a motivated and technical person with a lot of time would always get to extract data which is shown client-side.

In another life when I was a consultant, I did work for a startup that scraped public data for exactly this sort of use.

They still exist and use nonpublic sources now, but built a nice subscription business scraping lis pendens notices and such. No reason someone else can't do the same.

next

Legal | privacy