Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
Reddit has a new AI training deal to sell user content (www.theverge.com) similar stories update story
33 points by namanyayg | karma 2286 | avg karma 3.43 2024-02-18 14:58:23 | hide | past | favorite | 20 comments



view as:

This means Reddit is selling the intellectual contribution of its users, on its face.

If the comments and posts are artwork, or music, would that also be legitimate usage by reddit according to their Terms of Service?

"The user is the product." Appears to be true again, now there.


> If the comments and posts are artwork, or music, would that also be legitimate usage by reddit according to their Terms of Service?

I don't see why they wouldn't be.

In any case, I'll be spending this week scrambling my old comments & probably deleting my submissions. Reddit's sufficiently user hostile at this point that I can find other places to entertain myself online.


All your comments are crawled aready, there’s no point in deleting a few miliseconds after posting.

What about the works that their users have stolen and (re)posted without permission?

When you post something on Reddit, you give Reddit a license to do more or less whatever it wants with the thing posted.

https://www.redditinc.com/policies/user-agreement

> When Your Content is created with or submitted to the Services, you grant us a worldwide, royalty-free, perpetual, irrevocable, non-exclusive, transferable, and sublicensable license to use, copy, modify, adapt, prepare derivative works of, distribute, store, perform, and display Your Content


> When Your Content is … submitted to the Services, you grant us a worldwide

From the way this is worded it sounds like if someone else posts (submits) your content you are somehow granting this despite being unaware.


IANAL but I think the "you grant us" part is what is key. If a license is not yours to grant, then that license is not theirs to receive.

The models trained there on are rotten fruits and will be destroyed under court order. Frankly, a cartel with the power to change their countries laws would be wise to spend a few billion building out a state of the art ai lab and blatantly training their models on everything they can download. Everyone would use it. Would the us actually invade ($country) because hyper realistic Taylor swift and Mickey Mouse porn is a tap and swipe away?

"Your Content," not "your content." It is legalese jargon, defined as: "Content created with or submitted to the Services by you or through your Account"

And: "By submitting Your Content to the Services, you represent and warrant that you have all rights, power, and authority necessary to grant the rights to Your Content"

Of course, how many people posting memes on Reddit have the rights to do so? How many know they're in violation of the user agreement in doing so, hah.


Wow, didn’t see that coming…. (lol)

Hopefully, the dataset will be pirated and released into the de-facto public domain, since it was always considered by the users to be a commons, not an intellectual slave farm.


What's to pirate? Anyone can crawl reddit now, right?

Nope, not in bulk. They have antiscraper limits in their api. But we could web scrape it with an AI agent looking for quality training data…

It would be a TOS violation, so technically piracy I think still.


People probably just do it the old fashioned way with a botnet doing scraping under the rate limit.

Show HN: I built a botnet to scrape Reddit and why I used OrbitDB and rust


This just shows why scraping content to use for models has to be free for everyone or incumbents are gonna pull stuff like this. Open source has to be able to compete with big tech, who can train all their stuff on their users data, as per their tos and big rights holders.

I seriously doubt that adding reddit posts is going to increase the output quality of any LLM

It's been a big part of the training data since the start, as glitch tokens demonstrate


Legal | privacy