* Does the model itself violate copyright?
* Does the output of the model violate copyright?
I don't know how you could make an argument that the ingestion of information into a model through a training procedure in order to create something that can generate truly unique outputs isn't transformative of the original works. The legal standard for a new work to be considered a copyright violation of an original work is "substantial similarity". I don't know how you can make an argument that a generative model is "substantially similar" to thousands of original works...
Honestly, I'm not even sure if "fair use" comes into play for the model itself. In order for fair use to come into play, the model has to be deemed to be violating some copyright. Only once it is found to be violating does "fair use" come into play in order to figure out if it is illegal or not.
The second question is the one where fair use is likely to come into play more. And this question has to be asked of each output. The model's legality only becomes an issue here if, like Napster, you can't argue that the model has much point other than violating copyright. Napster didn't violate copyright (the code for Napster wasn't infringing on anything), but it enabled the violation of copyright and didn't have much point other than that.
I don't think you can make that argument though. I use ChatGPT most days, and I've never gotten copyrighted material out of it. I could ask it to write me some Disney fan fiction, which would violate a copyright. And I think there is a valid legal question here about who is responsible for preventing me from doing that. This is where I think the gray area is.
If the model produces content that would be a copyright violation in any other context, it doesn't stop being a copyright violation regardless of any of these decisions. No one has ever disagreed with that; it's the functional abolition of copyright and if you're okay with that, then you're not arguing for this stuff, you're arguing for the abolition of copyright.
The argument is that if you trained the model with copyrighted data, and then you or someone else separately used the model to generate novel media which was not legally similar enough to any copyrighted work to make it a copyright violation, that that isn't violating copyright, it's fair use. Basically, using it to make your own original content is legal, and using it to create an unauthorised reproduction of a copyrighted work is illegal. Just like all other software.
I think this should generally be true. The aggregation performed by model training is highly lossy and the model itself is a derived work at worst and is certainly fair use. It may produce stuff that violates copyright, but the way you use or distribute the product of the model that can violate copyright. Making it write code that’s a clone of copyright code or making it make pictures with copy right imagery in it or making it reproduce books etc etc, then distributing the output, would be where the copyright violation occurs.
Copyright covers "derivative works." Verbatim is absolutely not a requirement for infringement.
If you take a copyrighted image and modify it, even to the point where it's unrecognizable, if the image is being used in the same way (i.e., isn't a "transformative use"), then it's still a derivative work.
Yes, you are likely to get away with it if you're not caught. But that doesn't mean what you're doing is considered fair use, just that you won't get sued.
Thing is, every piece of text generated by ChatGPT is incrementally using every character of training data. So legally speaking, everything it produces is arguably a derivative work of ALL of the training data.
Generative AI isn't even a legal gray area; under current law, there's no blanket exception for "how much" of a copyrighted work is used. At best there's a fair use _guideline_ that lists, as one of four criteria, the amount and nature of the copyrighted work used. But really it's the entirety of millions of copyrighted works being used to generate the models, and those works _can_ be reproduced verbatim in many cases, proving that the works are encoded into the model.
Generative AI is only permitted because there's big money behind it along with associated lobbyists. And there are many in-flight lawsuits trying to shut down both GPT and various art-generating AIs.
Maybe they'll change the law. Maybe courts will side with the AI companies. But until then, it seems obvious to me that anyone arguing that generative AI based on models built with copyrighted works is completely legal is using motivated reasoning.
That's right: the model is definitely capable of creating things that are clearly a derivative work of what they were trained on. But this still leaves two questions:
* Does the model require a copyright license? Personally I think it's very likely a derivative work, but that doesn't necessarily mean you need a license. The standard way this works in the US is the four factors of fair use (https://copyright.columbia.edu/basics/fair-use.html) where Factor 1 is strongly in favor of the model being unrestricted while 2-4 are somewhat against (and in some cases 4 is strongly against).
* Is all output from the model a derivative work of all of the input? I think this is pretty likely no, but unclear.
* Does the model reliably only emit derivative works of specific inputs when the user is trying to get it to do that? Probably no, which makes using one of these models risky.
They only claimed that training the model was fair use. What about its output? I argue that its output is still affected by the copyright of its inputs, the same way the output of a compiler is affected by the copyright of its inputs.
For sure, that could be an instance of infringement depending on how it is used. But that's a minuscule percentage of the output and still might be fair use (read the decision in Authors Guild, Inc. v. Google, Inc.). But even if that instance is determined to be infringement, it doesn't mean the process of training models on copyrighted work is also infringement.
Isn’t it just fair use? Reading the four factor test for fair use it seems like these generative models should be able to pass the test, if each artwork contributes only a small part to a transformative model that generates novel output. The onus will be on demonstrating that the model does not reproduce works wholesale on demand, which currently they sometimes still do.
Arguably also, the copy is achieved at generation time, not training time, so the copyright violation is not in making the model or distributing it, but in using it to create copies of artworks. The human artist is the same: in their brain is encoded the knowledge to create forbidden works, but it is only the act of creating the work which is illegal, not the ability. The model creators might still be liable for contributory infringement though.
Anyway, I reject the notion that any use of unlicensed copyrighted works in training models is wrong. That to me seems like the homeopathic theory of copyright, it’s just silly. If copyright works that way we might as well put a cross over AGI ever being legal.
My personal opinion is that the source data does not exist inside the model, so the model does not in and off itself comprise of a copyright violation.
It is also not a derivative work, as it is not recognizable as any of the works it was trained on.
However, if the output it produces is close enough to existing copyrighted works, than that output cannot be used without a license.
That seems fair, and just like how we would judge a human being.
Training a model isn’t making a copy for your own use, it’s not making a copy at all. It’s converting the original media into a statistical aggregate combined with a lot of other stuff. There’s no copy of the original, even if it’s able to produce a similar product to the original. That’s the specific thing - the aggregation and the lack of direct reproduction in any form is fundamentally not reproducing or copying the material. The fact it can be induced to produce copyright material, as you can induce a Xerox to reproduce copyright material, doesn’t make the original model or its training a violation of copyright. If it’s sole purpose was reproduction and distribution of the material or if it carried a copy of the original around and produced it on demand, that would be a different story. But it’s not doing any of that, not even remotely. All this said, it’s a highly dynamic area - it depends on the local law, the media and medium, and the question hasn’t been fully explored. I’m wagering though when it comes down to it, the model isn’t violating copyright for these reasons, but you can certainly violate copyrights using a model.
Courts won't buy this, at least not completely. Copyright infringement liability accrues to the entire value chain - model users and developers alike. So the only way for a model to see copyrighted data during training and be non-infringing is if there's no conceivable way to reproduce training data.
I don't think fair use saves us either, at least not the model authors, because the whole selling point of these models is to replace artists. Yes, artists can use them as tools, but that is far less lucrative. The valuations and hype being thrown around specifically come from, among other things, being able to cut the creative class out of their own business. No court is going to look at that and say "ok, yeah, sure, it's perfectly fine for you to be using other people's work to train on".
An individual using an AI art generator may still wind up getting novel output that isn't obviously infringing to anything in the dataset. If that's the case then they probably haven't infringed copyright. Or at least, it'd be difficult to make a case around it.
> The aggregation performed by model training is highly lossy and the model itself is a derived work at worst and is certainly fair use.
Lossy or not, the training data provides value. If all the various someones had not spent time making all the stuff that ends up as training data, then the model it trains would not exist.
If you are going to use someone else's work in order to make something that you are going to profit off of, I believe that original author should be compensated. And should also be able to decide they don't want their work used in that way.
Note that I'm not talking about what existing copyright law says; I'm talking about how I believe we should be regulating this new facet of the industry.
> Making it write code that’s a clone of copyright code or making it make pictures with copy right imagery in it or making it reproduce books etc etc, then distributing the output, would be where the copyright violation occurs.
How is the end-user supposed to know this? Do we seriously believe that everyone who uses generative AI is going to run the output through some sort of process (assuming one even exists) to ensure it's not a substantial enough copy of something some copyrighted work? I certainly don't think this is going to happen.
Regardless, copyright is about distribution. If the a model trained on copyrighted material is considered a copy or derived work of the original work, then distributing that model is, in fact, copyright infringement (absent a successful fair use defense). I'm not saying that's the case, or how a court would look at it, but that's something to consider.
But can you seriously deny that everything the model generates is a derivative of the inputs? And if it's a derived work, it might not constitute "fair use" of the input materials. That depends on.... well, I'm not a copyright lawyer, so I won't attempt to specify, but I don't think we have any clear answers yet.
This is true - afaik there’s been no specific rulings on whether training models on copyright material is a violation. But to my mind it harkens back to stuff like xerox and such where the tool itself isn’t the violating thing it’s the use of the tool. Likewise, derivative works are often largely reproductions with minor variations and are protected under fair use. A model that takes enormous amounts of data and distills it into a tiny vector representation way below the information theoretic levels for any meaningful fidelity and mixes and overlaps data in a way that the original data isn’t plausibly stored in the model… I’m definitely not going to wager my life that’s fair use, but I would wager my company on it.
Sure, if you build your own model, train it on copyrighted works, then use it to create art; or if you use someone else's model which properly license its copyrighted sources, and use that to create art. In both cases your output is a new creative work sufficiently different from its parents to not constitute infringement and enjoy its own copyright protection.
However, the model creator/distributor will never be able to claim fair use on the model itself, which is choke full of unlicensed material and can only exist if trained on such material. It's not really a subtle or particularly difficult legal distinction, in traditional terms it's like an artistic collage (model output) vs a database of copyrighted works (trained model).
The trained model is not a sufficiently different work that stands on its own, in fact it is just a compressed algorithmic representation of the works used to train it, legally speaking it is those works.
I firmly believe that training models qualifies as fair use. I think it falls under research, and is used to push the scientific community forward.
I also firmly believe that commercializing models built on top of copyrighted works (which all works start off as) does not qualify as fair use (or at least shouldn't) and that commercializing models build on copyrighted material is nothing more than license laundering. Companies that commercialize copyrighted work in this manner should be paying for a license to train with the data, or should stick to using the licenses that the content was released under.
I don't think your example is valid either. The reason that AI models are generating content similar to other people's work is because those models were explicitly trained to do that. That is literally what they are and how they work. That is very different than people having similar styles.
There are two sides here, is what goes in fair use (training data), is how the output is used fair use (people using the output).
On the output side, I can for example (1) use a copyrighted image in an educational presentation (2) retype a novel for typing practice.
On the input side, this is where more of the debate is happening. Is "learning" from copyrighted material fair use?
It is within the realm of possibilities for a generative model to produce something that could violate a copyright, depending on how it is used, without being trained on copyrighted material. It's similar to the idea that given enough monkeys and enough time, you could reproduce a Shakespeare play.
The idea that models can't be copyrighted isn't far fetched. The basic idea is that models are created by an automated process not by a person.
The courts have already upheld that AI generated output is not copyrightable for this exact reason.
So if you do not buy that it applies to models then you would have to explain the difference between the process which outputs bits into a model's layers (aka training) and the process which takes bits into the input layer and then dumps out the subsequent bits of the output layer (inference /generation).
Then explain why that distinction is different in regards to the applicability of copyright.
But that's also not how copyright works. At least in the Unites States, the protections offered by copyright center around reproduction, performance, and derivative works[1].
If the AI models are reproducing copyrighted works, then that's a problem. And it does look like there are some examples where that might be happening beyond notions of fair use. But slupring up copyrighted content to train a model seems to fall under allowed use.
* Does the model itself violate copyright? * Does the output of the model violate copyright?
I don't know how you could make an argument that the ingestion of information into a model through a training procedure in order to create something that can generate truly unique outputs isn't transformative of the original works. The legal standard for a new work to be considered a copyright violation of an original work is "substantial similarity". I don't know how you can make an argument that a generative model is "substantially similar" to thousands of original works...
Honestly, I'm not even sure if "fair use" comes into play for the model itself. In order for fair use to come into play, the model has to be deemed to be violating some copyright. Only once it is found to be violating does "fair use" come into play in order to figure out if it is illegal or not.
The second question is the one where fair use is likely to come into play more. And this question has to be asked of each output. The model's legality only becomes an issue here if, like Napster, you can't argue that the model has much point other than violating copyright. Napster didn't violate copyright (the code for Napster wasn't infringing on anything), but it enabled the violation of copyright and didn't have much point other than that.
I don't think you can make that argument though. I use ChatGPT most days, and I've never gotten copyrighted material out of it. I could ask it to write me some Disney fan fiction, which would violate a copyright. And I think there is a valid legal question here about who is responsible for preventing me from doing that. This is where I think the gray area is.
reply