Might actually look into doing a quick script and checking this out.. It'd be interesting to look at. Plus I may only be able to scrape public projects, where private repositories may account for some large amount of the real number since those are typically paid for (except for Student accounts, afaik).
Interesting! I've been making some scripts privately for specific tasks, like an Nvidia video card I wanted to buy. However many sites are really hostile to scraping these days. I'll give it a try,.
Very cool - I assume most of the data is gathered through scraping? Quite a few of the sources you're pulling from don't look like they have APIs available.
I’m interested in hearing more about this. It might not pertain to the data I’m thinking specifically as it would be scraped but finding out where the holes are and creating proprietary data around that would be a consideration. However, I’m interested in diving in deeper on alternative data if you have some solid sources I can brush up on.
yes, I have similar experience, I just developed own python scraping solution to automate the process. But many of clients want script plus data and willing to pay only $15.
I have a couple of similar scrapers as well. One is a private repo that I collect visa information off Wikipedia (for Visalogy.com), and GeoIP information from MaxMind database (used with their permission).
It downloads the repo, and dumps the data split by the first 8 bytes of the IP address, and saves to individual JSON files. For every new scraper run, it creates a new tag and pushes it as a package, so the dependents can simply update them with their dependency manager.
Thanks I'll look into that. Deterring the casual scraper would be the goal basically, make them work enough that it's not worth the hassle in respect to the price for legitimate access as a motivated and technical person with a lot of time would always get to extract data which is shown client-side.
In another life when I was a consultant, I did work for a startup that scraped public data for exactly this sort of use.
They still exist and use nonpublic sources now, but built a nice subscription business scraping lis pendens notices and such. No reason someone else can't do the same.
reply