Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

I crawled reddit in several topics.

It's supported through their api.



sort by: page size:

Probably with the official Reddit API? There are several libraries for it.

Yeah no need to scrape Reddit, their content is accessible via their API.

This is such a cool idea! Does it scrape reddit, or use the API?

Isn't that the non-public internal reddit API? They have proper APIs for other things, I think.

Data is available for at least the first three of those (reddit isn't "officially" available but there are 3rd party dumps).

Why not load it into OpenSearch or some such?


Not to worry, you can still search google/etc., which scrapes Reddit, or Reddit itself through its native search. No need to use an API explicitly to access content like that.

Reddit's APIs are not user-based from what I can tell.

but it's not really necessary for Reddit. their API is fairly robust and there are numerous options for scraping the site.

Reddit have a Public API, thats how all these unofficial clients work. Don't know for sure if this site is using it. (It probably is)

By the way, isn't there an open-source desktop-native client for reading (and searching through) Reddit?

This doesn't address your question directly, but are you aware that Reddit provides an API? Why not use it instead of scraping?

Reddit is a website. Just make normal browser requests. You don't have to use their "sanctioned" "API."

reddit has a JSON api you can access with your credentials.

Not particularly. The one thing that helps but most people don't know about reddit is the fact that adding a .json to the end of each url displays the content of that page in json format.

for example: reddit.com/r/funny.json

This make crawling/fetching content from reddit much more trivial than old school web crawling.


They are fine with it as long as you abide to their terms, they have a subreddit dedicated to reddit development and the reddit api which has discussion of scraping: http://www.reddit.com/r/redditdev

There's an open-source Reddit client out there, if yc provided some kind of API (you don't want to do page scraping on the iphone) it would probably be pretty easy to adapt it.

I built redditsearch.io -- It uses the pushshift API for the back-end. It was thrown together and barely works but hey, I'm just one guy maintaining this as a labor of love. :)

Why not use PRAW? It's very mature, useful library using the Reddit API.

The reddit api was never very good. It's easier just to scrape the site.
next

Legal | privacy