Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Just append .json to any Reddit URL and you'll get a full dump of that page, we'll see if they get rid of this feature as well. Way easier than scraping.


sort by: page size:

The JSON endpoints are pretty neat, you can just take any reddit page, be it subreddit, submission comments or username, and append `/.json` to the URL and tadaa, JSON data. Although like you said, I'm pretty sure they will axe it at some point, and I'm honestly surprised that they didn't already.

I can imagine however, that old.reddit.com scraping/parsing layers will pop up left and right as soon as that happens (and those in turn could be the final nail in the coffin for old.reddit.com)


You can also tack ".json" to the end of most Reddit URLs to get it in JSON format which is much easier to work with when scraping content. It's hardly documented anywhere online as well but interesting that they offer both.

Reddit used to be really easy to scrape. I haven't really had the need to do it after the API changes drama, but the trick of appending `.json` to the URL apparently still works, example:

https://www.reddit.com/r/TheWire/comments/1aqvtoz/every_year...

Not sure if this only works on posts or also entire subreddits.


Yeah no need to scrape Reddit, their content is accessible via their API.

but it's not really necessary for Reddit. their API is fairly robust and there are numerous options for scraping the site.

Reddit is not that bad with regards to scraping. You can append `.json` to any URL and it will return a JSON representation of this page: https://www.reddit.com/.json

As for API tokens, that's unfortunately just the current trend of basically every other site with user-generated content. It's unfair to single out Reddit in particular when nearly everyone else does this too.


The reddit api was never very good. It's easier just to scrape the site.

For both Reddit and HN there are complete dumps that you can download easily within much less than a day. I have worked with both, the reddit dump from pushshift ist quite big (https://files.pushshift.io/reddit/ several TB?). Scraping HN completely from the API is much, much smaller, around a few GB if I remember correctly.

But of course this is still bullshit and doesn't really use that data.


You can just add a .json at the end of any url on Reddit to get a JSON object of the page.

https://www.reddit.com/r/todayilearned/comments/3zfadv/til_t...

Maybe this will help?


Not particularly. The one thing that helps but most people don't know about reddit is the fact that adding a .json to the end of each url displays the content of that page in json format.

for example: reddit.com/r/funny.json

This make crawling/fetching content from reddit much more trivial than old school web crawling.


To be fair, Reddit is actively trying to prevent bots. How do I know? I scrape Reddit threads directly via old.reddit.com URLs and even the most sophisticated scraping tools like BrightData, Undetected Playwright (and Puppeteer), and others just don't work on Reddit threads anymore as of a few months ago.

I now have to use .json at the end of the URL to get the content, but I suspect that'll stop working at some point.


This doesn't address your question directly, but are you aware that Reddit provides an API? Why not use it instead of scraping?

You can also use .json (for almost every Reddit URL, actually!)

Not to worry, you can still search google/etc., which scrapes Reddit, or Reddit itself through its native search. No need to use an API explicitly to access content like that.

This is such a cool idea! Does it scrape reddit, or use the API?

As far as I'm aware, Reddit still allows you to append .json to any of their pages and you get the results as a nicely formatted json document.

No LLM required.


Errr... why not just use their official tool?

https://www.reddit.com/settings/data-request

I sent in a request and a few days later had a bunch of .csv files containing everything I've done with my account.

OK, CSV isn't JSON - but it's pretty easy to parse or import into a database of your choice.


Given Reddit's inability to keep their website functioning (unless you use the far superior old.reddit.com) I find it hard to believe they would be able to stop a motivated developer from scraping the whole site.

It would be great for someone to scrape Reddit and expose that information in a format compatible with the official API.

So if you call /get-comments/1234 it scrapes post 1234 and returns the JSON object exactly as the official API does.

Then third party clients can just point to this endpoint.

next

Legal | privacy