Hacker Read

revelio · 2023-03-31 08:06:07

Verbosity isn't the same thing as precision. The judgement is both vague and inconsistent, as GDPR related things always are. They don't want OpenAI using personal data for training, which it doesn't do anyway, unless they mean the entire original training set which - as they themselves note - they can't prove contains personal data of Italians (which they mean is unclear due to vagueness), but at the same time they are banning it for not collecting enough personal data.

johndhi | karma 1032 | avg karma 2.41 · | 2023-04-06 11:33:24

This is a really interesting thread to read as a lawyer who spends his days (recently) considering how existing laws apply to AI training and operation.

I'll start by describing my understanding of the allegation, and then talk to some points raised.

I don't read Italian, but the English-language statement by Garante (Italy) is here. [0]. It appears to allege both that the original training (creation of the model) and the ongoing improvement via customer Inputs was done without a "legal basis." GDPR requires companies to have a "legal basis" for processing personal data before they do so -- whether that be from consent, or a legitimate interest (very vague), or for a task carried out in the public interest.

Italy isn't going into details, but saying they're banning while they investigate.

Some interesting questions raised in the thread: does the model actually contain personal data? Probably yes. Even if data is converted to integers during training, since those integers can be reversed into personal data, GDPR counts it as "pseudonymized" not "anonymized," and therefore subject to GDPR.

Did OpenAI actually not get consent? It seems to me they did obtain consent for using user inputs for training, since their terms of service plainly state they will use inputs for training data. But GDPR likes to demand higher standards of consent (checkboxes) and as far as I know OpenAI didn't use checkboxes.

I think this is pretty interesting to watch play out.

[0] - https://www.gpdp.it/web/guest/home/docweb/-/docweb-display/d...

reply

nostrademons | karma 78749 | avg karma 5.45 · | 2023-06-14 09:18:48

OpenAI's trying to argue that line, but given that most of OpenAI's models are trained on scraped data that they had no legal right to other than existing court precedents on scraped public data, they don't have much of a leg to stand on.

tarruda | karma 2401 | avg karma 4.66 · | 2023-12-06 02:10:25

> That's not right, and I don't think the courts will rule in their favor.

Especially if you consider that all of OpenAI code training data comes from open Github repositories.

reply

memory_grep | karma 61 | avg karma 2.44 · | 2019-02-26 16:07:55+00:00

> Due to our concerns about malicious applications of the technology, we are not releasing the trained model.

Doesn't that go against the mission of OpenAI? I thought they were about making technology publicly accessible to everyone so that it can't be abused by only a few people. This makes them seem more like a business with proprietary data.

reply

EntrePrescott | karma 205 | avg karma 1.39 · | 2023-03-21 03:46:50

Indeed they have a problem there.

For that matter, I don't see how "OpenAI" could even try to legally enforce its terms against competitors training their models using the output of "OpenAI" models… at least not without being laughed out of the court room at best or ending up having to pay enormously much more themselves at worst, given how "OpenAI" blatantly disregard any licenses on the content and data they use to train their models.

reply

ang_cire | karma 727 | avg karma 1.92 · | 2023-10-06 10:10:09

> Can you point me to where OpenAI admitted they illegally acquired copies of works for training data?

Yes. It's on the first page of the linked PDF:

> For this response, we draw on our experience in developing cutting-edge technical AI systems, including by the use of large, publicly available datasets that include copyrighted works.

They then go on to try to claim this is just how everyone does it, as though that makes it okay, including making copies of the copyrighted works:

> Modern AI systems require large amounts of data. For certain tasks, that data is derived from existing publicly accessible “corpora” (singular: “corpus”) of data that include copyrighted works. By analyzing large corpora (which necessarily involves first making copies of the data to be analyzed)...

> at this point you’re being deliberately obtuse

No, I just suspect we fundamentally disagree about the value and importance of many technologies.

reply

leononame | karma 790 | avg karma 4.27 · | 2024-03-20 15:17:26

Or maybe there just isn't evidence they've used it? Whereas for Google, we do have evidence they used that data.

I mean, I'm all for these companies disclosing their training data. But just assuming OpenAI isn't fined because of their site is (unless you know more) pure speculation. Same goes for the sibling comment suggesting Mistral should also be looked at

reply

yoav | karma | avg karma · | 2024-05-12 11:51:05

Additionally it’s wild they think their content hadn’t already been scraped and trained on.

They were fine with humans training on this data and selling their expertise to employers, but they’re not fine with a higher order way to consume their contribution and are removing it so smaller open source competitors to open ai can’t train on the same data openai trained on.

reply

cyanydeez | karma 1718 | avg karma 0.75 · | 2023-05-27 19:27:05

There's no good faith world where OPENAI trained only on legally available works

The only valid arguments is whether their model or it's output is itself protected legally.

reply

kweingar | karma 1976 | avg karma 5.03 · | 2023-12-28 07:48:24

The thing that bothers me about the whole situation is that OpenAI prohibits using its model output as training data for your own models.

It seems more than a bit hypocritical, no? When it comes to their own training data, they claim to have the right to use any/all of humanity’s intellectual output. But for your own training data, you can use everything except for their product, conveniently for them.

reply

ParetoOptimal | karma 1473 | avg karma 1.3 · | 2024-01-11 08:22:25

Does anyone really trust openai isn't training on their data given their views on copyright?

It would make more sense for them to just train on it anyway.

reply

alerighi | karma 1784 | avg karma 2.9 · | 2023-07-21 15:01:17

In fact they can't (both Facebook and OpenAI) train their models without asking permission. Just wait for someone to start raising this concern. The EU is working on regulating these kind of aspects, for example this is not compliant at all with the GDPR (unless you train only on data that doesn't contain personal data, that is more rare than you would think).

ilaksh | karma 9227 | avg karma 1.28 · | 2023-08-11 13:07:11

One challenge is that to get large enough custom datasets you either need a small army or a very strong existing model. Which means that you probably have to use OpenAI. And using OpenAI to generate training material for another model violates their terms.

Has anyone taken them to court about this? Do we all just decide it's not fair and ignore it?

reply

7bit | karma 219 | avg karma 1.0 · | 2024-05-10 08:34:16

No, I'm not talking about the training data. I'm talking about restrictions set in place by their creators. Topics that give me warnings for violations against their terms of services.

blibble | karma 9766 | avg karma 3.04 · | 2023-12-14 11:37:26

> Yet I believe OpenAI isn't using data from Dropbox to train their models without users' consent.

why? they trained on my code without my consent, why is user data any different?

training is either fair use or it isn't

and high growth Silicon Valley companies aren't known for the adherence to the spirit of the law

reply

rubendv | karma 9 | avg karma 3.0 · | 2023-03-21 08:44:00

I guess it is in OpenAI's best interest to downplay the memorization aspect in favor of the logical reasoning angle. If it turns out that GPT is memorizing and reproducing copyrighted data, it could land them in legal trouble.

j2kun | karma 6207 | avg karma 3.38 · | 2023-12-14 15:12:36

Isn't there sufficient evidence to conclude that OpenAI is training on things they know they probably shouldn't be training on? Like the full text of novels that are not publicly available on the internet? I don't think it's that unreasonable to extrapolate that they would blur the lines of legal agreements to get access to private user data they shouldn't have.

redox99 | karma 2736 | avg karma 3.25 · | 2023-07-21 12:35:09

It's so hypocritical, it's insane.

"Yes, we train our models on a good chunk of the internet without asking permission, but don't you dare train on our models' output without our permission!"

And OpenAI also has a similar restriction.

reply

oh_sigh | karma 6383 | avg karma 0.89 · | 2023-02-04 10:35:37

Why not? Open AI used data that they didn't receive permission from the author to train their models.