Training a model essentially creates a derivative work from the input data. GPL licensed code requires keeping the same license for derivative work. BSD-style licenses at the very least require retaining the copyright statement.
It feels pretty clear to me that the trained model and code generated from it would have to adhere to the original license.
In a way ML is a bit like lossy compression. Would you think JPEG encoding and decoding an image strips the result of its original copyright?
> You make GPL code, a make an AI that learns from GPL code, shouldn't its output be GPL licensed as well?
If the output (not just the model) can be determined to be a derivative work of the input, or the model is overfit and regurgitating training set data, then yes. It should. And a court would make the same demands, because fair use is intransitive - you cannot reach through a fair use to make an unfair use. So each model invocation creates a new question of "did I just copy GPL code or not".
Yes, sometimes code is returned that is a verbatim reproduction of the training data. This can be prevented if need be.
What I really don't understand is how some people are complaining about GPL'ed code being used for training.
What's the difference between a machine looking at the code and learning from it and a human being doing the same. As long as the code isn't patented, there's no reason why I shouldn't be able to look at GPL'ed code and implement the idea using my own code.
In other words, is - according to those who think using GPL'ed code for ML training - every implementation a derived work if I looked at GPL'ed code that implemented the same algorithm? Where's the line that separates plagiarism from original work? Is there even such a line? Does it matter whether the GPL'ed code is encoded in human neurons or network weights after looking at it and if so, why?
> IMHO models trained on not-properly-licensed (pirated) data should at the very least not be copyrightable and should be public domain.
My understanding is that ML model weights cannot be copyrighted as an original creative work. They are trade-secrets and protected through contracts but once leaked to third parties it’s not a copyright violation to use/distribute.
Whether the model is actually a derivative work of the training data is another interesting question.
When a model is trained on 'all-rights-reserved' content like most image datasets, the community say it's fair game. But when it's 'just-a-few-rights-reserved' content like GPL code, apparently the community says that crosses a line?
A) This is just taking divided opinion and treating it like a person with a contradictory opinion (as others have noted).
B) Nothing about GPL makes it "less copyrighted". Acting like a commercial copyright is "stronger" because it doesn't immediately grant certain uses is false and needs to be challenged whenever the claim is made.
C) If anything, I suspect image generation is going to be practically more problematic for users - you'll be exhibiting a result that might be very similar to the copyrighted training, might contain a watermark, etc. If you paste a big piece of GPL'd copy into your commercial source code, it won't necessarily be obvious once the thing is compiled (though there are whistleblowers and etc, don't do it).
> ...the jurisprudence here – broadly relied upon by the machine learning community – is that training ML models is fair use.
If you train az ML model on GPL code, and then make it output some code, would that not make the result a derivative of the GPL licensed inputs?
But I guess this could be similar to musical composition. If the output doesn't resemble any of the inputs, or contains significant continous portions of them, then it's not a derivative.
It'd be nice to see some proof here. Copyright is not absolute and does not extend, for example, to things that have no creativity in them. There are only so many ways to write a for loop or an if condition. Training an ML model from a large body of code IMHO violates copyright no more than any of us reading code and learning from it, as long as GH Copilot doesn't spit out code that's exactly the same as something already existing.
You said that you're "not so sure that a trained model does (or even should) be subject to the copyright of the inputs."
You missed my point. I'm not saying that the model is subject to the copyright of the inputs; I'm saying that the model's outputs are, which is entirely different. We say that the output of a compiler is still subject to the copyright of the inputs, so why not this?
They only claimed that training the model was fair use. What about its output? I argue that its output is still affected by the copyright of its inputs, the same way the output of a compiler is affected by the copyright of its inputs.
If GPT-3 can learn in context it means both the training set and the prompt could be in copyright violation. So even a clean model, trained on licensed data, cannot guarantee there will be no copyright issue.
> If a ML model spits out snippets of copyrighted material and you try to monetize those then that’s clearly infringement.
If it spits it out because it was in the training set or prompt, yes. If it spits it out and it was in the training set (or prompt), it is at least difficult to make the case that it is not infringement (if it was the prompt, then you basically have the same problem as any other case where you are proving that production of something that matches something someone else has a copyright on was independent, since copyright only protects against copying, not coincidence.) If it spits it out but it was not in the training set or prompt then clearly it was not infringement.
> But if it ingests copyrighted material and then spits out entirely new content influenced by the originals… how is that an issue?
Because a non-perfect mechanical copy is still a copy violating copyright if there is no license, and a derivative work produced by a human using AI as a tool is still an infringing derivative work if there is no license. While copyright protects against perfect copies, it protects against imperfect copies and derivative works as well. (Except where exceptions like Fair Use apply.)
If it's public domain, then no copyright is violated. I'm not talking about public-domain data; the G-G-GP specifically mentioned the possible legal interpretation that training on large amounts of publicly visible (but not public domain) data is itself a copyright violation.
Even if it does violate copyright, I think there is a strong argument for adjusting copyright law to allow training ML algorithms with any data, no matter the source.
In cases like this, the benefits of the technology are so huge, and the downsides to the original code author so tiny. Was he actually going to license that code to someone who said 'nah, I'll recreate it with copilot instead'?
Very few people successfully license anything less than 100,000 lines of code anyway.
There's no such thing. Without a license you can't enforce any restrictions.
AI training is basically just building a very complex Markov chain, that's obviously not copyright violation because the output product doesn't contain the input - only data about it. If your text has been copied then please point to it in these weights here.
In what context? You are planning on commercializing Copilot and in that case the calculus on whether or not using copyright protected material for your own benefit changes drastically.
I don't see why this would be a copyright violation anymore than somebody learning something from multiple sources and reformulating what they learned into an answer to a question. As long as it isn't explicitly reciting its training data, there shouldn't be an issue of copyright.
> even if training a dataset is fair use, distributing the result is copyright infringement
I would be inclined to agree that the current situation (ie reproducing training examples verbatim) violates copyright. On the other hand, I'm not so sure that a trained model does (or even should) be subject to the copyright of the inputs.
Of course I acknowledge that the latter view is controversial and also that such issues are so new that they haven't had a chance to be meaningfully addressed by either the courts or the legislature yet.
As an example of a similar situation, see (https://www.thiswaifudoesnotexist.net/) which was trained entirely on copyrighted artwork. Note that there are at least three distinct issues here - training the model, distributing the model itself, and distributing the output of the model.
> I would want my license to make that part clearer.
But again, GitHub's argument here is that the license is completely irrelevant because it doesn't apply in the first place. Thus they won't care one bit about any clarifications you make one way or the other.
> It’s unlikely that model weights can be copyrighted, as they’re the result of an automatic process.
If they can’t for that reason alone, then the model is a mechanical copy of the training set, which may be subject to a (compilation) copyright, and a mechanical copy of a copyright-protected work is still subject to the copyright of the thing of which it is a copy.
OTOH, the choices made beyond the training set and algorithm in any particular training may be sufficient creative input to make it a distinct work with its own copyright, or there may be some other basis for them not being copyright protected. But the mechanical process one alone just moves the point of copyright on the outcome, it doesn’t eliminate it.
I've also read GPL code and that doesn't make anything I write GPL. It matters if the code was substationally copied or not.
So I think I would apply the same rules to AI. In general not all code produced is infringing on the copyright of all authors of training data. However there have been some clear cases of copying (GPL license text and a matrix multiplication routine for example) that do appear as copyright violations.
And yes, the implication is that a different less explicit prompt could still emit copyrighted code.
reply