Verbosity isn't the same thing as precision. The judgement is both vague and inconsistent, as GDPR related things always are. They don't want OpenAI using personal data for training, which it doesn't do anyway, unless they mean the entire original training set which - as they themselves note - they can't prove contains personal data of Italians (which they mean is unclear due to vagueness), but at the same time they are banning it for not collecting enough personal data.
This is a really interesting thread to read as a lawyer who spends his days (recently) considering how existing laws apply to AI training and operation.
I'll start by describing my understanding of the allegation, and then talk to some points raised.
I don't read Italian, but the English-language statement by Garante (Italy) is here. [0]. It appears to allege both that the original training (creation of the model) and the ongoing improvement via customer Inputs was done without a "legal basis." GDPR requires companies to have a "legal basis" for processing personal data before they do so -- whether that be from consent, or a legitimate interest (very vague), or for a task carried out in the public interest.
Italy isn't going into details, but saying they're banning while they investigate.
Some interesting questions raised in the thread: does the model actually contain personal data? Probably yes. Even if data is converted to integers during training, since those integers can be reversed into personal data, GDPR counts it as "pseudonymized" not "anonymized," and therefore subject to GDPR.
Did OpenAI actually not get consent? It seems to me they did obtain consent for using user inputs for training, since their terms of service plainly state they will use inputs for training data. But GDPR likes to demand higher standards of consent (checkboxes) and as far as I know OpenAI didn't use checkboxes.
I think this is pretty interesting to watch play out.
OpenAI's trying to argue that line, but given that most of OpenAI's models are trained on scraped data that they had no legal right to other than existing court precedents on scraped public data, they don't have much of a leg to stand on.
> Due to our concerns about malicious applications of the technology, we are not releasing the trained model.
Doesn't that go against the mission of OpenAI? I thought they were about making technology publicly accessible to everyone so that it can't be abused by only a few people. This makes them seem more like a business with proprietary data.
For that matter, I don't see how "OpenAI" could even try to legally enforce its terms against competitors training their models using the output of "OpenAI" models… at least not without being laughed out of the court room at best or ending up having to pay enormously much more themselves at worst, given how "OpenAI" blatantly disregard any licenses on the content and data they use to train their models.
> Can you point me to where OpenAI admitted they illegally acquired copies of works for training data?
Yes. It's on the first page of the linked PDF:
> For this response, we draw on our experience in developing cutting-edge technical AI systems, including by the use of large, publicly available datasets that include copyrighted works.
They then go on to try to claim this is just how everyone does it, as though that makes it okay, including making copies of the copyrighted works:
> Modern AI systems require large amounts of data. For certain tasks, that data is derived from existing publicly accessible “corpora” (singular: “corpus”) of data that include copyrighted works. By analyzing large corpora (which necessarily involves first making copies of the data to be analyzed)...
> at this point you’re being deliberately obtuse
No, I just suspect we fundamentally disagree about the value and importance of many technologies.
Or maybe there just isn't evidence they've used it? Whereas for Google, we do have evidence they used that data.
I mean, I'm all for these companies disclosing their training data. But just assuming OpenAI isn't fined because of their site is (unless you know more) pure speculation. Same goes for the sibling comment suggesting Mistral should also be looked at
Additionally it’s wild they think their content hadn’t already been scraped and trained on.
They were fine with humans training on this data and selling their expertise to employers, but they’re not fine with a higher order way to consume their contribution and are removing it so smaller open source competitors to open ai can’t train on the same data openai trained on.
The thing that bothers me about the whole situation is that OpenAI prohibits using its model output as training data for your own models.
It seems more than a bit hypocritical, no? When it comes to their own training data, they claim to have the right to use any/all of humanity’s intellectual output. But for your own training data, you can use everything except for their product, conveniently for them.
In fact they can't (both Facebook and OpenAI) train their models without asking permission. Just wait for someone to start raising this concern. The EU is working on regulating these kind of aspects, for example this is not compliant at all with the GDPR (unless you train only on data that doesn't contain personal data, that is more rare than you would think).
One challenge is that to get large enough custom datasets you either need a small army or a very strong existing model. Which means that you probably have to use OpenAI. And using OpenAI to generate training material for another model violates their terms.
Has anyone taken them to court about this? Do we all just decide it's not fair and ignore it?
No, I'm not talking about the training data. I'm talking about restrictions set in place by their creators. Topics that give me warnings for violations against their terms of services.
I guess it is in OpenAI's best interest to downplay the memorization aspect in favor of the logical reasoning angle. If it turns out that GPT is memorizing and reproducing copyrighted data, it could land them in legal trouble.
Isn't there sufficient evidence to conclude that OpenAI is training on things they know they probably shouldn't be training on? Like the full text of novels that are not publicly available on the internet? I don't think it's that unreasonable to extrapolate that they would blur the lines of legal agreements to get access to private user data they shouldn't have.
"Yes, we train our models on a good chunk of the internet without asking permission, but don't you dare train on our models' output without our permission!"
reply