This is true - afaik there’s been no specific rulings on whether training models on copyright material is a violation. But to my mind it harkens back to stuff like xerox and such where the tool itself isn’t the violating thing it’s the use of the tool. Likewise, derivative works are often largely reproductions with minor variations and are protected under fair use. A model that takes enormous amounts of data and distills it into a tiny vector representation way below the information theoretic levels for any meaningful fidelity and mixes and overlaps data in a way that the original data isn’t plausibly stored in the model… I’m definitely not going to wager my life that’s fair use, but I would wager my company on it.
That's right: the model is definitely capable of creating things that are clearly a derivative work of what they were trained on. But this still leaves two questions:
* Does the model require a copyright license? Personally I think it's very likely a derivative work, but that doesn't necessarily mean you need a license. The standard way this works in the US is the four factors of fair use (https://copyright.columbia.edu/basics/fair-use.html) where Factor 1 is strongly in favor of the model being unrestricted while 2-4 are somewhat against (and in some cases 4 is strongly against).
* Is all output from the model a derivative work of all of the input? I think this is pretty likely no, but unclear.
* Does the model reliably only emit derivative works of specific inputs when the user is trying to get it to do that? Probably no, which makes using one of these models risky.
Training a model isn’t making a copy for your own use, it’s not making a copy at all. It’s converting the original media into a statistical aggregate combined with a lot of other stuff. There’s no copy of the original, even if it’s able to produce a similar product to the original. That’s the specific thing - the aggregation and the lack of direct reproduction in any form is fundamentally not reproducing or copying the material. The fact it can be induced to produce copyright material, as you can induce a Xerox to reproduce copyright material, doesn’t make the original model or its training a violation of copyright. If it’s sole purpose was reproduction and distribution of the material or if it carried a copy of the original around and produced it on demand, that would be a different story. But it’s not doing any of that, not even remotely. All this said, it’s a highly dynamic area - it depends on the local law, the media and medium, and the question hasn’t been fully explored. I’m wagering though when it comes down to it, the model isn’t violating copyright for these reasons, but you can certainly violate copyrights using a model.
For sure, that could be an instance of infringement depending on how it is used. But that's a minuscule percentage of the output and still might be fair use (read the decision in Authors Guild, Inc. v. Google, Inc.). But even if that instance is determined to be infringement, it doesn't mean the process of training models on copyrighted work is also infringement.
I think this should generally be true. The aggregation performed by model training is highly lossy and the model itself is a derived work at worst and is certainly fair use. It may produce stuff that violates copyright, but the way you use or distribute the product of the model that can violate copyright. Making it write code that’s a clone of copyright code or making it make pictures with copy right imagery in it or making it reproduce books etc etc, then distributing the output, would be where the copyright violation occurs.
It is derivative. There’s then a question of whether the derived work is sufficiently transformative to be fair use, which depends on what the model is outputting.
But yeah that’s separate from the question of whether you properly licensed the data to train on in the first place.
A big chunk of the computing community seems to approach licensing as “I can see it, so I can use it.” (See GPL code used where it shouldn’t be.)
I don't agree that a model is a derivative work, and I think a judge would likely agree with me. I think you need to be able to show those major copyrightable elements of the original work are actually present in the allegedly derivative work, something that is very non-trivial with even the most transparent of models like Stable Diffusion - scientists doing intensive analysis of the SD model were only able to find around a hundred instances of reproduced images from the source material out of several hundred thousand attempts.
That said, it definitely would be copyright infringement to download a bunch of copyrighted material and actually use it in some way, for example to train a model. Luckily, in most jurisdictions it is recognised that this is the case and so governments have specifically carved out exceptions to copyright law for this process (known as text and data mining or TDM). This includes the UK, the EU, Japan, and China. In the US, there is no specific law addressing the issue yet, but many companies are doing it in the US (and have been doing it for many years) with the presumption of legality based on the Google v Author's Guild and Google v Perfect 10 rulings. Basically, they are acting under the assumption that it is fair use, which I think is a ~reasonable assumption and I think would be held up by the US Supreme Court if they wanted to take it.
You mean like how the model itself is a derivative work of tons of copyrighted content? If the original model can sidestep the issue of being trained on copyrighted content, then it should be fair game to train a new model off of a copyrighted model.
I am pretty sure that it is not established law, but I am pretty sure that that is how it will work out. US provisions for fair use make training models likely OK, and the EU is carving out exemptions for it. See https://valohai.com/blog/copyright-laws-and-machine-learning... for more.
The question of whether the output of the model itself counts as a derivative work, though, is rather more complex. In the case of Github Copilot it has proven very adept at spitting out large chunks of clearly copyrighted code with no warning that it has done so. And lawsuits are being filed over this.
But in the case of the visual artwork, I'm pretty sure that it is going to be ruled not derivative. Because while it is imitative, you cannot produce anything that anyone can say is a copy of X.
But as ML continues to operate, we'll get cases that are ever closer to the arbitrary line we are trying to maintain about what is and is not a copyright violation. And I'm sure that any criteria that the courts try to put down is not going to age well.
They only claimed that training the model was fair use. What about its output? I argue that its output is still affected by the copyright of its inputs, the same way the output of a compiler is affected by the copyright of its inputs.
I firmly believe that training models qualifies as fair use. I think it falls under research, and is used to push the scientific community forward.
I also firmly believe that commercializing models built on top of copyrighted works (which all works start off as) does not qualify as fair use (or at least shouldn't) and that commercializing models build on copyrighted material is nothing more than license laundering. Companies that commercialize copyrighted work in this manner should be paying for a license to train with the data, or should stick to using the licenses that the content was released under.
I don't think your example is valid either. The reason that AI models are generating content similar to other people's work is because those models were explicitly trained to do that. That is literally what they are and how they work. That is very different than people having similar styles.
I think you're saying any work created by a model trained on copyrighted data is a derivative work of that copyrighted data.
But this can't be right, it is inconsistent with how copyright has worked so far. Artists and musicians and engineers all learn from each other and have seen and learned from, "trained on" many other examples of works from their field. Even when works are clearly inspired by other works we tend not to give them the legal status of derivative work.
You're suggesting we treat models with a much stricter copyright regime than has previously existed.
My personal opinion is that the source data does not exist inside the model, so the model does not in and off itself comprise of a copyright violation.
It is also not a derivative work, as it is not recognizable as any of the works it was trained on.
However, if the output it produces is close enough to existing copyrighted works, than that output cannot be used without a license.
That seems fair, and just like how we would judge a human being.
Testing the fair use argument for training won't necessarily answer the question about copyright laundering.
You could easily make the case that training is fair use, but that doesn't have to imply the model's output is non-infringing.
For example, it seems reasonable to train a model by feeding copyrighted texts and images, and that model could be useful for analyzing the content, finding facts, or detecting features. But we're in murky waters when the model also starts outputting the original content (be it verbatim or "derived").
Not all that different from human learning: you can study and learn from publicly available books but that doesn't grant you the right to recite their contents and claim it as your own, original work.
There are two sides here, is what goes in fair use (training data), is how the output is used fair use (people using the output).
On the output side, I can for example (1) use a copyrighted image in an educational presentation (2) retype a novel for typing practice.
On the input side, this is where more of the debate is happening. Is "learning" from copyrighted material fair use?
It is within the realm of possibilities for a generative model to produce something that could violate a copyright, depending on how it is used, without being trained on copyrighted material. It's similar to the idea that given enough monkeys and enough time, you could reproduce a Shakespeare play.
It's not simply a given that using copyright material to train a model is copyright violation.
In my view it isn't. No one image contributes a significant amount, and the process the machine is doing it analogous to that a human does when the human learns.
If the model produces content that would be a copyright violation in any other context, it doesn't stop being a copyright violation regardless of any of these decisions. No one has ever disagreed with that; it's the functional abolition of copyright and if you're okay with that, then you're not arguing for this stuff, you're arguing for the abolition of copyright.
The argument is that if you trained the model with copyrighted data, and then you or someone else separately used the model to generate novel media which was not legally similar enough to any copyrighted work to make it a copyright violation, that that isn't violating copyright, it's fair use. Basically, using it to make your own original content is legal, and using it to create an unauthorised reproduction of a copyrighted work is illegal. Just like all other software.
Unless the resulting "derivative work" is sufficiently transformative. Which, i would argue, training an AI/ML is.
Therefore, using a training dataset does not constitute copyright violation.
If the AI outputted an exact copy (or a close enough copy, that the laymen would agree it's a copy), then that particular instance of the AI's output is in violation of copyright. The AI model itself violate any copyright.
reply