Hacker Read

eulers_secret · 2023-06-09 17:44:53

Just append .json to any Reddit URL and you'll get a full dump of that page, we'll see if they get rid of this feature as well. Way easier than scraping.

lvncelot | karma 1460 | avg karma 5.84 · | 2023-07-12 02:06:00

The JSON endpoints are pretty neat, you can just take any reddit page, be it subreddit, submission comments or username, and append `/.json` to the URL and tadaa, JSON data. Although like you said, I'm pretty sure they will axe it at some point, and I'm honestly surprised that they didn't already.

I can imagine however, that old.reddit.com scraping/parsing layers will pop up left and right as soon as that happens (and those in turn could be the final nail in the coffin for old.reddit.com)

reply

protonscientist | karma 9 | avg karma 2.25 · | 2022-12-31 22:00:50

You can also tack ".json" to the end of most Reddit URLs to get it in JSON format which is much easier to work with when scraping content. It's hardly documented anywhere online as well but interesting that they offer both.

ailef | karma 784 | avg karma 3.21 · | 2024-03-11 06:42:55

Reddit used to be really easy to scrape. I haven't really had the need to do it after the API changes drama, but the trick of appending `.json` to the URL apparently still works, example:

https://www.reddit.com/r/TheWire/comments/1aqvtoz/every_year...

Not sure if this only works on posts or also entire subreddits.

reply

diminoten | karma 1949 | avg karma 0.55 · | 2019-10-13 22:17:15+00:00

Yeah no need to scrape Reddit, their content is accessible via their API.

bshipp | karma 1738 | avg karma 4.26 · | 2021-12-16 18:39:26

but it's not really necessary for Reddit. their API is fairly robust and there are numerous options for scraping the site.

selfhoster11 | karma 5178 | avg karma 1.98 · | 2021-04-29 11:44:52+00:00

Reddit is not that bad with regards to scraping. You can append `.json` to any URL and it will return a JSON representation of this page: https://www.reddit.com/.json

As for API tokens, that's unfortunately just the current trend of basically every other site with user-generated content. It's unfair to single out Reddit in particular when nearly everyone else does this too.

reply

ok123456 | karma 2762 | avg karma 2.22 · | 2023-04-19 10:47:48

The reddit api was never very good. It's easier just to scrape the site.

tomthe | karma 1202 | avg karma 11.56 · | 2022-10-21 05:22:39

For both Reddit and HN there are complete dumps that you can download easily within much less than a day. I have worked with both, the reddit dump from pushshift ist quite big (https://files.pushshift.io/reddit/ several TB?). Scraping HN completely from the API is much, much smaller, around a few GB if I remember correctly.

But of course this is still bullshit and doesn't really use that data.

reply

kauegimenes | karma 284 | avg karma 3.89 · | 2016-01-05 19:16:52+00:00

You can just add a .json at the end of any url on Reddit to get a JSON object of the page.

https://www.reddit.com/r/todayilearned/comments/3zfadv/til_t...

Maybe this will help?

reply

rezashirazian | karma 1033 | avg karma 3.12 · | 2016-09-16 15:03:59+00:00

Not particularly. The one thing that helps but most people don't know about reddit is the fact that adding a .json to the end of each url displays the content of that page in json format.

for example: reddit.com/r/funny.json

This make crawling/fetching content from reddit much more trivial than old school web crawling.

reply

65 | karma 1627 | avg karma 5.57 · | 2024-04-30 15:23:06

To be fair, Reddit is actively trying to prevent bots. How do I know? I scrape Reddit threads directly via old.reddit.com URLs and even the most sophisticated scraping tools like BrightData, Undetected Playwright (and Puppeteer), and others just don't work on Reddit threads anymore as of a few months ago.

I now have to use .json at the end of the URL to get the content, but I suspect that'll stop working at some point.

reply

nandemo | karma 4159 | avg karma 2.37 · | 2017-06-07 08:09:40+00:00

This doesn't address your question directly, but are you aware that Reddit provides an API? Why not use it instead of scraping?

cballard | karma 1147 | avg karma 3.98 · | 2016-03-28 22:14:55

You can also use .json (for almost every Reddit URL, actually!)

constantly | karma 1089 | avg karma 3.57 · | 2023-06-14 13:13:51

Not to worry, you can still search google/etc., which scrapes Reddit, or Reddit itself through its native search. No need to use an API explicitly to access content like that.

gitgud | karma 5259 | avg karma 2.23 · | 2018-08-21 03:26:27

This is such a cool idea! Does it scrape reddit, or use the API?

boolemancer | karma 222 | avg karma 2.77 · | 2023-06-14 12:29:20

As far as I'm aware, Reddit still allows you to append .json to any of their pages and you get the results as a nicely formatted json document.

No LLM required.

reply

edent | karma 28585 | avg karma 9.66 · | 2023-06-09 11:40:57

Errr... why not just use their official tool?

https://www.reddit.com/settings/data-request

I sent in a request and a few days later had a bunch of .csv files containing everything I've done with my account.

OK, CSV isn't JSON - but it's pretty easy to parse or import into a database of your choice.

reply

sebzim4500 | karma 5679 | avg karma 2.5 · | 2023-05-10 14:59:48

Given Reddit's inability to keep their website functioning (unless you use the far superior old.reddit.com) I find it hard to believe they would be able to stop a motivated developer from scraping the whole site.

KMnO4 | karma 3772 | avg karma 4.51 · | 2023-05-31 18:58:12

It would be great for someone to scrape Reddit and expose that information in a format compatible with the official API.

So if you call /get-comments/1234 it scrapes post 1234 and returns the JSON object exactly as the official API does.

Then third party clients can just point to this endpoint.

reply