Just append .json to any Reddit URL and you'll get a full dump of that page, we'll see if they get rid of this feature as well. Way easier than scraping.
The JSON endpoints are pretty neat, you can just take any reddit page, be it subreddit, submission comments or username, and append `/.json` to the URL and tadaa, JSON data. Although like you said, I'm pretty sure they will axe it at some point, and I'm honestly surprised that they didn't already.
I can imagine however, that old.reddit.com scraping/parsing layers will pop up left and right as soon as that happens (and those in turn could be the final nail in the coffin for old.reddit.com)
You can also tack ".json" to the end of most Reddit URLs to get it in JSON format which is much easier to work with when scraping content. It's hardly documented anywhere online as well but interesting that they offer both.
Reddit used to be really easy to scrape. I haven't really had the need to do it after the API changes drama, but the trick of appending `.json` to the URL apparently still works, example:
Reddit is not that bad with regards to scraping. You can append `.json` to any URL and it will return a JSON representation of this page: https://www.reddit.com/.json
As for API tokens, that's unfortunately just the current trend of basically every other site with user-generated content. It's unfair to single out Reddit in particular when nearly everyone else does this too.
For both Reddit and HN there are complete dumps that you can download easily within much less than a day. I have worked with both, the reddit dump from pushshift ist quite big (https://files.pushshift.io/reddit/ several TB?). Scraping HN completely from the API is much, much smaller, around a few GB if I remember correctly.
But of course this is still bullshit and doesn't really use that data.
Not particularly. The one thing that helps but most people don't know about reddit is the fact that adding a .json to the end of each url displays the content of that page in json format.
for example:
reddit.com/r/funny.json
This make crawling/fetching content from reddit much more trivial than old school web crawling.
To be fair, Reddit is actively trying to prevent bots. How do I know? I scrape Reddit threads directly via old.reddit.com URLs and even the most sophisticated scraping tools like BrightData, Undetected Playwright (and Puppeteer), and others just don't work on Reddit threads anymore as of a few months ago.
I now have to use .json at the end of the URL to get the content, but I suspect that'll stop working at some point.
Not to worry, you can still search google/etc., which scrapes Reddit, or Reddit itself through its native search. No need to use an API explicitly to access content like that.
Given Reddit's inability to keep their website functioning (unless you use the far superior old.reddit.com) I find it hard to believe they would be able to stop a motivated developer from scraping the whole site.
reply