The claim that most people training models make is that what they are doing is sufficiently transformative that it counts as fair use, and doesn't require a license. That means putting something in a software license that prohibits model training wouldn't do anything.
In this case, what the model is doing is clearly (to me as an non-lawyer) not transformative enough to count as fair use, but it's possible that the co-pilot folks will be able to fix this kind of thing with better output filtering.
Developers had a similar response where they don't think that training a model using their copyrighted work is fair use, especially for commercial use.
They'd expect that the model authors would need to comply with the open source license like for any other use.
This is basically the debate here, it's same for artists or developers.
The question is if it makes sense to allow someone to use your copyrighted work to train a model that they'd then use for commercial purposes, without needing a license agreement.
Aren't they claiming that it's fair use? IANAL, but wouldn't that make the licence irrelevant if training AI/ML models was found to be fair use? And if not, it's a licence violation anyway?
In what context? You are planning on commercializing Copilot and in that case the calculus on whether or not using copyright protected material for your own benefit changes drastically.
Ensuring a model never outputs copyrighted content is unimportant and tangential. It's irrelevant. You don't look for a way to make humans output no copyrighted content, you address each time they do case by case.
A model training being rendered fair use doesn't mean any of its output can be used for whatever regardless.
Then there is the argument that the rules around fair use aren't even reached because the training of the model doesn't even do anything that requires a fair use exemption.
I don’t think it makes sense for both model builders and the model’s users to separately obtain licenses for the same works used in the training set.
A model trained on several copyrighted data sources cannot somehow be used in a way depending on a subset of those sources.
So all parameters of usage and compensation should be settled by contract between the model builder and copyrighted data supplier, before the copyrighted material is used.
Or to put it simply: using copyrighted material to create a model would NOT be considered fair use.
That’s it. That’s the standard. No complicated new laws required.
Model builders obtain permission to use copyrighted material from copyright holders based on any terms both agree to.
Terms might involve model usage limits, term limits, one time compensation, per use compensation, data source credits, or anything else either party wants.
The likely result will be some standard sets of terms becoming popular and well known. But nobody has to agree to anything they don’t want to.
Whether or not copyright applies at all to model training is an entirely open question, and where rulings have come down, it's likely closer to these situations being fair use (e.g. the Google Book's case, which was ruled transformative and not a direct replacement for the works in question).
The reality is, these models don't copy or distribute anything directly, which makes applying copyright a bit of a stretch. Many people feel like it is a use that should have some sort of IP law applying to it, which is why I think there's some chance that courts or legislators will decide to screw the letter of existing law and just wedge new interpretations in, but it's not super simple: they'd have to thread the needle and not make things like search illegal, and that's tricky. Besides that, these models are out there, they're useful, and if they're ruled infringing they'll just be distributed illegally anyways.
I don't envy the people who will have to decide these cases, I suspect what's better for the world overall is to leave the law as-is and clarify that fair use holds (nobody will stop publishing content or code just because AI is slurping it up, a few weirdos like the article author excepted), but there are going to be a lot of pissed off people either way...
I firmly believe that training models qualifies as fair use. I think it falls under research, and is used to push the scientific community forward.
I also firmly believe that commercializing models built on top of copyrighted works (which all works start off as) does not qualify as fair use (or at least shouldn't) and that commercializing models build on copyrighted material is nothing more than license laundering. Companies that commercialize copyrighted work in this manner should be paying for a license to train with the data, or should stick to using the licenses that the content was released under.
I don't think your example is valid either. The reason that AI models are generating content similar to other people's work is because those models were explicitly trained to do that. That is literally what they are and how they work. That is very different than people having similar styles.
They only claimed that training the model was fair use. What about its output? I argue that its output is still affected by the copyright of its inputs, the same way the output of a compiler is affected by the copyright of its inputs.
The theory used here though is that fair use permits transformative use. Normally copilot output should be sufficiently transformative. This particular situation appears to be a sort of bug (overfit).
Note that AI models already work somewhat similarly to how humans work (becoming bug-for-bug compatible at times even :-P ). We may need laws to be amended in the opposite direction even, else it might become illegal for humans to learn too.
I am pretty sure that it is not established law, but I am pretty sure that that is how it will work out. US provisions for fair use make training models likely OK, and the EU is carving out exemptions for it. See https://valohai.com/blog/copyright-laws-and-machine-learning... for more.
The question of whether the output of the model itself counts as a derivative work, though, is rather more complex. In the case of Github Copilot it has proven very adept at spitting out large chunks of clearly copyrighted code with no warning that it has done so. And lawsuits are being filed over this.
But in the case of the visual artwork, I'm pretty sure that it is going to be ruled not derivative. Because while it is imitative, you cannot produce anything that anyone can say is a copy of X.
But as ML continues to operate, we'll get cases that are ever closer to the arbitrary line we are trying to maintain about what is and is not a copyright violation. And I'm sure that any criteria that the courts try to put down is not going to age well.
So we either 1) declare this to not be copyright infringement, or 2) basically outright outlaw training of interesting ML models.
I personally believe that (1) is the way to go, and I find the whole outrage about Copilot to be essentially akin to collectively shooting ourselves in the foot in the long run.
I think we shouldn't forget why copyright exists in the first place. To quote the copyright clause of the US constitution (emphasis mine):
> "*To promote the Progress of Science and useful Arts*, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries."
If anything we already went too far with how draconian modern copyright enforcement is, up to the point that it often actively hampers our progress. Please don't make it even worse!
Conflating training a model with human learning is wrong.
When training a model you are deriving a function that takes some input and produces an output. The issue with copyright and licensing here is that a copy is made and reproduced numerous times when training.
The model is not walking around a museum where it is an authorized viewing. It is not a being learning a skill. It is a function.
The further issue is that it may output material that competes with the original. So you may have copyright violation in distribution of the dataset or a model's output.
I think for code it will be pretty hard, unless the writer gives explicit permission to train the model on the copyrighted data, like the case for Copilot in GitHub.
And I don't think it is fair use at all, imagine a company like OpenAI train the model on its own internal docs and code, then you'll be able to ask the model to replicate ChatGPT and copilot, or even closed software like Photoshop.
In this case, what the model is doing is clearly (to me as an non-lawyer) not transformative enough to count as fair use, but it's possible that the co-pilot folks will be able to fix this kind of thing with better output filtering.
reply