Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Why not use PRAW? It's very mature, useful library using the Reddit API.


sort by: page size:

Reddit has a pretty decent API. PRAW is the most commonly used library for it (in Python), but there's https://github.com/not-an-aardvark/snoowrap if you're set on JS too.

read the api documentation for reddit, follow the rules, use python+praw to make requests. Writing a script with BeautifulSoup or similar seems like a headache.

Awesome. Didn't know about reddit API wrapper praw! Thanks a ton, will give it a try.

One question: How does the locally run python script feedback the scraped data to bot user account app?


Right! I used the official Reddit API. I created an APP, got the credentials for the API. Then used the Python library PRAW to consume the API. https://praw.readthedocs.io/en/latest/

It took me 36 hours to collect the 4M posts. Reddit API returns results in batches of 100 results, and then sleeps for 2 seconds.

You can find some more details on how it was built here https://blog.valohai.com/machine-learning-pipeline-classifyi...

I can publish on Github the repository that runs two commands to collect the data if your are interested.


Probably with the official Reddit API? There are several libraries for it.

Why not just port to the Reddit codebase? The functionality seems similar.

the base reddit api is pretty good - just tack a .json onto any url for an example

Reddit's API is free for my use cases. From their documentation: "Our API allows free access to moderators and developers creating these tools for non-commercial use cases."

Why not use libredd.it? It's a privacy friendly frontend for Reddit without any JS. You can host it yourself or run it off a docker container.

Isn't that the non-public internal reddit API? They have proper APIs for other things, I think.

This is such a cool idea! Does it scrape reddit, or use the API?

> Additionally, its trivial to stream and analyze the entirety of Reddit comments and posts in real time (check out PRAW if you like Python).

Chunks of it. Don’t think praw and the api can do all live


I thought Reddit was rewritten in Rails, but after some research I realized that I am wrong. Its Python.

I'd use the reddit/HN codebase, or a reddit/HN clone.

Why not just use the reddit codebase?

Reddit has an API and many clients.

Not the worst idea.... Build an API that sits as a middleman to reddits api, it's paid for... But caches requests to cut costs.

Possibly use some amount of web scraping too


Reddit's APIs are not user-based from what I can tell.

Reddit have a Public API, thats how all these unofficial clients work. Don't know for sure if this site is using it. (It probably is)
next

Legal | privacy