Assuming ML models are causal, then bits of GPL code that fall out of the model have to have the color GPL, because the only way they could've gotten there was to train the ML using GPL-colored bits. It seems to me like the answer here is pretty obvious, it doesn't really matter how you copy a work.
Yes, sometimes code is returned that is a verbatim reproduction of the training data. This can be prevented if need be.
What I really don't understand is how some people are complaining about GPL'ed code being used for training.
What's the difference between a machine looking at the code and learning from it and a human being doing the same. As long as the code isn't patented, there's no reason why I shouldn't be able to look at GPL'ed code and implement the idea using my own code.
In other words, is - according to those who think using GPL'ed code for ML training - every implementation a derived work if I looked at GPL'ed code that implemented the same algorithm? Where's the line that separates plagiarism from original work? Is there even such a line? Does it matter whether the GPL'ed code is encoded in human neurons or network weights after looking at it and if so, why?
It’s not possible to have a license over an ML model trained on other peoples’ works, since such models are uncopyrightable. They’re more like a phone book; a collection of facts trained by an entirely un-creative process. https://news.ycombinator.com/item?id=36691050
This hasn’t been proven in court, but it seems the most likely outcome.
> ...the jurisprudence here – broadly relied upon by the machine learning community – is that training ML models is fair use.
If you train az ML model on GPL code, and then make it output some code, would that not make the result a derivative of the GPL licensed inputs?
But I guess this could be similar to musical composition. If the output doesn't resemble any of the inputs, or contains significant continous portions of them, then it's not a derivative.
Well yes, if you’re asked to memorize someone’s code, you’ll get surprisingly far too. The fact that models can do this isn’t evidence of anything. It’s a capability (or “tendency” if you overfit too much).
I think it’s pretty obvious that if you train a model to reproduce existing work in its entirety, it fails the “sufficiently transformative” test and thus loses legal protection.
But there’s nothing stopping you from re-coding existing implementations of GPL’ed code. I used to do it. And your new code has your own chosen license, even if your ideas came from someone else. Are you sure the same logic shouldn’t apply to models?
> You make GPL code, a make an AI that learns from GPL code, shouldn't its output be GPL licensed as well?
If the output (not just the model) can be determined to be a derivative work of the input, or the model is overfit and regurgitating training set data, then yes. It should. And a court would make the same demands, because fair use is intransitive - you cannot reach through a fair use to make an unfair use. So each model invocation creates a new question of "did I just copy GPL code or not".
Generally a ML model transforms the copyrighted material to the point where it isn't recognizable, so it should be treated as its own unrelated work that isn't infringing or derivative. But then you have e.g. GPT that is reproducing some (largeish) parts of the training set word-for-word, which might be infringing.
Also I don't think there have been any major court cases about this, so there's no clear precedent in either direction.
I'm not a lawyer and I can't say how existing copyright law applies to this situation, but, how is taking images and feeding them into an ML model different from taking library code and including it in your software?
In both cases, you take a series of bytes (the image data / the library source code) that is ultimately crucial to the functioning of your software, combine it with your own original code you wrote (training / compilation), and end up with a new output (the trained model / the binary executable) that is distinct from any of the original sources.
If you use a GPL'd library in your software, then it's uncontroversial to say that you have to follow the terms of the GPL. You can't say "well actually, the compiler is just reading your source code and learning what sort of binary it should produce, just like a human learns by studying source code, so I actually don't have to follow your licensing terms". No one would buy that. You clearly used that library, so you have to obey whatever terms come along with it.
Why is it fine to ignore the licensing terms for image data you incorporate into your software, but not third-party source code that you incorporate?
The fun question is anyway if a ML model is copyright protectable. Probably not as it is produced by an algorithm (which even is GPL'ed). So the only tool would have been watermarking and pulling NDA type clauses, however a Google form seems not the best way in the first place also it is close to impossible to identify the leak (if they are not as stupid as it seems). Or am I missing anything? One backdoor would be if they included copyrighted material in the training and show how this can be extracted from the model. Maybe it the whole stunt was about trying out how the legal system works in those cases :)
Training a model essentially creates a derivative work from the input data. GPL licensed code requires keeping the same license for derivative work. BSD-style licenses at the very least require retaining the copyright statement.
It feels pretty clear to me that the trained model and code generated from it would have to adhere to the original license.
In a way ML is a bit like lossy compression. Would you think JPEG encoding and decoding an image strips the result of its original copyright?
It'd be nice to see some proof here. Copyright is not absolute and does not extend, for example, to things that have no creativity in them. There are only so many ways to write a for loop or an if condition. Training an ML model from a large body of code IMHO violates copyright no more than any of us reading code and learning from it, as long as GH Copilot doesn't spit out code that's exactly the same as something already existing.
I largely agree with you, but I think there is one question that hasn't been addressed yet: Are the weights learned by an LLM a derivative work?
When a person learns from GPL code this question doesn't arise. The state of a person's brain is outside of copyright. But is the state of an LLM also outside of copyright or outside of the terms covered by the GPL? I'm not sure.
An LLM can not only emit source code derived from code published under the GPL, it can also potentially execute a program and could therefore be considered object code.
This isn't necessarily a problem as long as the model isn't distributed and does not include any AGPL code.
It's a good article (perhaps a little slow to get going), and an interesting topic.
I suppose today, with the surges of interest in ML-derived content, the concept of "color" of bits is more relevant than ever.
There's the quote "Machine learning is money laundering for bias" (consider, eg, that if you ask an image generator model to create a "woman," it will very likely be a white woman), and I suppose the same is true for copyright, and other things. Basically it adds a layer of plausible deniability, and some could argue changes the "color of the bits."
As the article discusses, the law often _does_ care about the color/provenance of bits, though CS people are more prone to take a "data is just data" approach.
Humans used to learn to code from copyrighted works (textbooks) without much reference to OSS or Free Software. Similarly, teaching ML models to code from copyrighted works isn't going to violate copyright more frequently than a human might; and detecting exact copies should be pretty easy by comparing with the corpus used to train it. Software houses already have to worry about infringement of snippets, and things like Codex are just one more potential source.
> you can't make these machine things without literally feeding this copyrighted information into them, therefore they do contain a copy.
They don't necessarily do. Think about that. You can take some copyrighted material and transform the information contained in it (for instance a fictional book). You can then write a summary. The summary contains information that was present in the original but it has been transformed and hence it's not a copy. The ML model contains information that has been generalized by some degree. So it's just a grey area IMO.
This was a lot of words to say that LLMs aren't human. I agree they aren't human. Why do you think that training on GPL code that the model does not redistribute is a violation of copyright law?
reply