This is an abhorrent response from LAION, I could image such from a large corporate, but a non-profit generating 'models for democratization' (their words). I guess they think that others data should therefore be included without restriction? My hope is that this is just over zealous 3rd party IP lawyers, not their actual stance on the matter.
LAION just adds metadata to the publicly available content. It does not have the content. Sending them a legal take-down request would use their resources for a false claim, so it makes sense for them to seek damages.
Not sure if it is ethical or not, but it technically makes sense, this is all I'm saying.
What's in question is simply a dataset with links to the images, with associated alt tags. The copyright owner came along and complained that they were violating their copyright. LAION was not.
It's not an abhorrent response, it's what should be an acceptable response to having to deal with bullshit legal threats from ignorant people. If the copyright owner doesn't want their images linked to, they should take them down from the web or perform such gymnastics such as restricting their serving to their own web server that refuses to serve the images unless certain conditions are met (an HTTP referrer match, a username/password is required, etc.)
> complained that they were violating their copyright. LAION was not.
Actually, thats not clear cut. Copyright isn't a binary thing, its a license with a bunch of terms that could be anything. Just because its on the internet, it doesn't mean you can do what you want with it. the chances of you getting caught and told off are very slim, but it doesn't mean its right or indeed legal.
> it's what should be an acceptable response to having to deal with bullshit legal threats from ignorant people.
again, its not bullshit. The person created those pictures, and has the right to distribute them as they see fit. Now, if you want to allow people to use your pictures in anyway, then you can. There is nothing stopping you.
It is almost perfectly clear cut. LAION is linking to the images, not redistributing them or storing them. A link. In a text file. To photos. Served from a web server. Presumably, the artist's web server or portfolio site hosted by a company they do business with.
>The person created those pictures, and has the right to distribute them as they see fit.
Yeah, they do have the right to distribute them as they see fit. So, if they don't want them linked to, they can put them behind a paywall.
It's a bullshit legal threat. The artist has a complete knuckle dragging understanding of copyright and the web.
Again its not clear cut. the license could be something like: "you are able to view these images as part of this website, anything else is prohibited" or "direct linking to these images is not allowed, these images are licensed only fir use as part of this website" or something along those lines.
Now, enforcing that is pretty hard. I'm also willing to bet that he didn't make any such license. But to say its a clear cut legal slam dunk isn't correct.
You own the copyright to your comment here (congrats!)
That does not prevent me from including a URI to the comment in a list, along with metadata like "comments that misunderstand copyright law".
The photographer posted photos on a website with the express intent of distributing them there. Neither he, nor you, nor I can retroactively decide we don't like the way someone linked to our publicly-posted content.
He literally sued someone for including a URI in list. That deserves the smack own he got.
Google lost a lawsuit for including a (defamatory) URI in a list: https://www.bbc.com/news/world-us-canada-65340444
Yes, copyright is different, however you can lose trials for "promoting copyright infringement", e.g. The Pirate Bay, Gary Bowser. There's a decent shot at arguing the LAION list promotes copyright infringement ((after AI got smacked down a bit more)).
You're assuming that using images or text to train AI is copyright infringement. The difference is also that The Pirate Bay was assisting copyright infringement of non-publicly available works. You have to purchase games, movies, software, and Pirate Bay gave you a directory of other people to commit copyright infringement with. They were assisting. Gary Bowser literally resold complete copies of copyrighted works.
For either your Google or TPB references to make any sense, you have to be asserting that the photographer was engaged in illegal (defamatory or copyright-violating) behavior by posting on his own website, and that as the party breaking the law as well as the party being wronged, he has a claim against the list that included his own posting of his own content.
It's very... odd. I can't imagine that anyone serious would argue a set of URIs for legally-published content could be legal until hit hits some critical mass (or, worse, until the wrong person downloads it) and then it becomes illegal.
It's nonsensical. If you don't want your content indexed and listed, don't publish it. If you want it seen by people by mostly ignored by well-behaved crawlers, use noindex. It is insane to just publish world-readable stuff, expect to profit from Google/etc indexing it, but being claiming harm and liability for its inclusion in a list.
This is akin to going to Google, or more accurately Open Street Map since we want to keep the metaphor, and threatening them with legal action for creating a map with the address of your house on it and noting that there's a priceless mural which you originally advertised and made publicly available elsewhere.
Since people can take the map, find your address on it, and take pictures of the priceless mural, which they can then use to train a monkey to incorporate the style of the mural into other murals that the monkey creates, thus violating your copyright. And the murals that the monkey produces are not copyrightable.
But also, yes, in that instance, if it's public and the door's open. Your analogy isn't a good one though. I like mine better, to illustrate the hoops the artist had to jump through.
Yes, actually. If your door is open the public and your documents are laying on your desk, then I can certainly walk in and take pictures of your documents with my phone.
Of course there are restrictions like if your office is already in a private building, probably not. But if it’s facing a public street and I just walk in, how would I know you didn’t leave it open all the time?
If you had some “no tresspassing” “private property” signs up and the door was open, then your scenario is less likely to work.
Or if your documents are locked up in a file cabinet, then I can’t break into it.
Reminds me of all those open source maintainers, putting their work out there only to be contacted with demands by entitled end users. Except in this case those end users want you to pay because they looked at your code..
Is it legal though? Article mentioned forbidding the use of crawlers. Besides, it talks a lot how they only link to the image... but that's not how AI works, right? Maybe after incorporating the photo into its weights the model no longer needs the original, but it doesn't seem like an explicitly legal thing.
What this organization is doing is making a database of links to pictures with metadata that includes what is in the pictures. AIs can use this to train. The organization is not using it to train AI nor do they host the image.
There's a lot of debate going on about whether training AI on such data is copyright infringement or not.
This is of course _immediately_ important, as a pressing legal question, but also longer-term less interesting. The interesting question, IMHO, is whether we should amend copyright to deal with this! As I read it, the spirit of copyright is to give creators safeguards against their work being misused, so they aren't cagey about sharing it, in limited contexts. The public has some, limited, rights to such published work, in return.
The question is, should we allow people to publish their work, while also disagreeing to it being used for AI training. I think, resoundingly, yes. The constructs of copyright just don't make sense in a world where a machine can perfectly recreate the human skill, if only it can consume all of its data.
As a painter, say, it won't bother me that someone might use my paintings to learn, over the course of many years, and maybe even surpass my current level of skill - by the time they do I will be either better, or retired. But if a machine can do this over weeks or months and take my job away, sure I won't like it.
The alternative IMHO is just more paywalls, as people realise they don't want to be unpaid AI trainers and labellers.
Good point. The entire concept of copyright was, frankly, invented out of thin air. That’s not a bad thing, to be clear - that’s how pretty much all laws get invented. The point is just that you can’t analyze existing copyright law like it’s a sacred text that has the One True Answer to whether AI training without permission is ethical or not. If we as a society decide that we don’t want to allow it, then we can simply write a law that says that. What current laws say is irrelevant.
At least in the US though, meaningful action on major legislation like this has become virtually impossible, which is why everyone is staking their hopes on the interpretation of existing copyright law.
> As I read it, the spirit of copyright is to give creators safeguards against their work being misused, so they aren't cagey about sharing it, in limited contexts
"misused"? "Someone else making money off it" more like it. Not so they aren't cagey about sharing it, but aren't basically ripped off.
Of course, nowadays people have gone all crazy. I had a discussion with a dude who seriously thought that if we could trace the invention of the wheel or the alphabet back in time, the living descendents of those inventors should own these inventions!
The right to make money off of it is only one form of misuse that copyright protects. My source code that I publish under the Apache 2.0 license gives me the right to defend against other forms of misuse than just making money off of it. Some of which are:
* Publishing the code without giving me attribution.
* Shipping software to end users without giving them the source code.
I care much more about those than I do whether someone is making money off of it.
Sure. But again, that's not why copyright was invented. The application of your terms is based on the same "I own ideas" meme that Disney used to increase their ownership of movies and characters pretty much indefinitely for decades.
In my opinion you are both wrong. You agree with Disney on a premise that I reject.
Attribution should imo be covered by some other law. It's a different thing. It's lying about facts.
In theory each of the things I care about could be covered by different laws or regulations. However they aren't. The only tool I have for that is copyright. If you find a route to getting those alternative methods established I'm all ears but as it stands such a route does not exist. Copyright is what you have so copyright is what you have to use.
I contribute media to Wikipedia and honestly between people using media and not giving attribution as the default, which is all they are asked to, and AI/ML companies feeling entitled to rip all data and use it for their profit because they allege copyright simply does not apply to their money printing, I've been thinking more and more about just not publishing anything anymore. These behaviors are simply spitting people in their faces and telling them they actually should be fine with that.
Same goes for open source honestly. (I've burned out as a maintainer a few years ago and for the most part stopped doing code things publicly after that)
> But if a machine can do this over weeks or months and take my job away, sure I won't like it.
You make a good point which often overlooked in the debate about training data. Machine Models operate on a scale widely different from human learners, they need not sleep, they can operate in a widely parallel fashion, and therefore need not curate their own "knowledge" into a niche.
"Scale of action" (for lack of a better phrase) makes a big difference, for example quoting a passage from a book is fine, quoting five hundred passages from a book is not at all the same thing. Taking a few still frames of a Hollywood movie is one thing, screen-capping every single frame is very different.
On the other hand, if we could legally train AI with a large number of actual books, it might be a lot more useful. (And if the AI can cite sources, it might even drive book sales.)
Their lawyers responded to say that the pictures are not stored and so they can't delete them and said if you try and make an unjustified copyright claim we can recoup the cost.
He made some kind of claims in a letter that as far as I can tell isn't in the article - I guess he pushed forwards with making the claim about copyright and/or demanding they delete his pictures? What did he send?
They responded saying well we told you it was unjustified and now we've incurred costs so as previously stated we're charging them to you.
> while LAION continues to use his pictures.
Are they using his pictures? What they have is a link to the picture, right?
This is actually a good analogy. TPB is just linking to copyrighted materials. It’s the user who is committing the infringement by downloading, using, reproducing, etc the material. In this case they’re both linking to the material and controlling the application that’s downloading, using, reproducing, etc the material. That seems like a material difference to me. (Pun intended)
Even the film industry might have had trouble getting the Pirate Bay guys if the pictures they were linking to were hosted by the film industry themselves.
If we want to go down the rabbit hole where a hyperlink in a text file to a publicly accessible copyrighted photo on an artist's portfolio site/web server is "copyright infringement", or more accurately "assisting in making copyrighted content available", where the artist themselves is the one making those images is available publicly, then we're officially in clown world.
It would be more like abetting copyright infringement by making it easier for scrappers to find content to train their models on. LAION isn’t posting a single link on their twitter account for their followers. It’s clear what this dataset is being used for.
Artists put content on the internet so you can look at it and share it with your friends and say “wow cool art” not so you can run it through an art generating model to put them out of job. That we never anticipated this possibility means you call fallback on the “fair use” argument but legal does not mean ethical, especially when the law has yet to catch up to changing times.
That assumes that using copyrighted works to train AI is copyright infringement. And that simply pasting a link to a publicly available file that the artist themselves made available is "abetting." I'm betting that using copyrighted works to train "AI" is fair use, just like a human learning from a photo would be fair use.
You’re right, I can’t claim that it is. I understand that the models do not contain the original work.
But they are not simply sharing a single link and the “intent” is different. I personally feel that we should not treat human learning and AI learning the same. Unlike humans, the AI is essentially immortal and infinitely copiable.
The issue isn’t having a link the issue is selling a dataset to people with the intent for them to preform copyright infringement.
Selling pipes, cellphones, igniters, and black powder is perfectly legal. Selling pre assembled pipe bombs filled with black powder and a remote detonators is going to very quickly get you into a great deal of legal trouble.
First the actual use for training AI isn’t needed for copyright protection to kick in, simply downloading his images to build the training set is already copyright infringement.
2. The artist disabled crawling on their website so it’s accessible but not fully available.
3. Your link says many things that aren’t settled in terms of US law but rather the options of the author. “even if infringement occurs during machine learning, training AI with copyrighted works would likely be excused by the fair use doctrine.” That’s a lot of possibilities not actual guarantees.
However, it ignores the largest issue namely if the output of these AI’s are all simply derivative works.
2. robots.txt is not legally enforceable. It's a gentleman's agreement, a widely accepted standard to codify into a scraper for it to observe and respect.
3. That's why it's called an opinion. No shit. Maybe instead of talking about promoting terrorism, you should have a level head and talk about fair use.
Robots.txt is permission it allows you to do something that is otherwise illegal. You are not allowed to do this stuff by default.
Something people don’t understand is “Right click file save as…” on a copyrighted image breaks copyright, as you don’t have permission to make a copy. The do have implied permission to make incidental copies to view a website, but that’s it.
Are you saying that after all these years of omitting robots.txt from my public personal website of priceless blog posts and code repositories, I can take Google and Microsoft to court for scraping, linking to, and reproducing what I deem as substantial portions? They've violated my rights! They didn't have permission!
In all seriousness, that's total shit. It's publicly accessible. That's the permission. There have been different cases like weev where the judge has viewed it differently because of specific details like carrying out the action of enumeration, or guessing IDs, to reveal non-public information, but that had to do with the CFAA.
If you reproduce content in total in another work, resell, etc., that's another thing.
You brought up robots.txt as if it was illegal to scrape or crawl the artist's website since his site had one that was supposed to "turn off scraping." That's what I was referring to. And that's just flat out laughably wrong.
In the context you're now talking about by the letter of copyright law, downloading a photo, image, file of something or other, a "copyrighted work" where it's publicly accessible, and has a particular license specified that you may not reproduce it, may technically be "unlawful" letter by letter of the law, but I doubt any judge is going to actually see it that way versus intent of the law unless you literally reproduce or share a substantial portion of it, sell a complete copy, etc. It's almost certainly fair use to study, and use portions of the copyrighted work in your own copyrighted works.
The legality of scrapping without explicit permission depends on the context as has been demonstrated by multiple court cases. Robots.txt short circuits that by giving permission to people who might not otherwise qualify.
As to making a copy with intent to X, that’s what fair use is. A student may photocopy the full text of an short article so they can accurate quote it in their term paper. They can’t simply photocopy an article because they are a student with access to the article and a photocopier nor can the photocopy a full book because they want to use a short quote in their paper. Copying incidental to acceptable use becomes retroactively acceptable. This distinction may seem crazy to you, but the fact that intent matters means you can’t judge an action without context.
Fair use in commercial context is looked at with vastly more suspicion than fair use in academic context, which again demonstrates specific actions on their own aren’t always enough information to say if something is allowed.
>The legality of scrapping without explicit permission depends on the context as has been demonstrated by multiple court cases. Robots.txt short circuits that by giving permission to people who might not otherwise qualify.
It's been determined that it's legal, if it's publicly accessible, and you don't receive a cease-and-desist letter telling you to stop. If it's public, that's permission. Simply "turning off scraping" by putting a robots.txt there doesn't make the content and linked images of a public web page any less public and restricted from being scraped, legally.
Just last year, the LinkedIn case: LinkedIn had a robots.txt, and the judge didn't give a fuck. Nor did he care what their terms of use said. Rather, it was hiQ's continued scraping of LinkedIn data even after LinkedIn's cease-and-desist letter to them that constituted access of data "without authorization."
>As to making a copy with intent to X, that’s what fair use is.
Yes, and?
>This distinction may seem crazy to you
It's not a crazy distinction, that's what I'd basically said previously, so perhaps we're talking past each other.
Power Ventures' actions violated the CFAA because they bypassed security measures intended to make the content not-exactly-public. The judge dismissed claims of copyright infringement despite them hosting "cached" versions of the scraped profiles using Facebook's trade dress. The damages + discovery sanction had to do with them bypassing security restrictions with their scraping, creating profiles and using bots to scrape with those profiles to access further information than was public, and Power Ventures' ignoring their explicit cease-and-desist the first time, and their non-compliance with discovery in some context. Read the case.
>I could go on, but I am not sure what exactly you’re trying to argue here.
That public is public. If you leave the door open in the real world, someone CAN enter your home. If you host your image on a public webpage, they can scrape it. robots.txt is not a security measure, nor is it a contract that magically gives the right to scrape where it wasn't given previously, it is a gentleman's agreement that you can ignore if you want to be a dick about it, and know about the robots.txt. Ethically, that's wrong, but it is how it is.
Not to mention, as I came to understand while reading during this discussion, LAION wasn't even crawling: they were using a public commoncrawl dump to gather their images. commoncrawl had crawled the author's site previously. They just took that data and got image links out of it.
1. they weren't selling a dataset
2. the artist didn't "disable" scraping in any meaningful way, legally
3. linking to the image is not illegal, and they're justified to respond with an invoice in Germany to recover legal fees for this dumb copyright complaint
4. it may fall under fair use to download images and train neural nets on them, it may not be. it always depends on the context and the specific case.
No trespassing signs have legal weight with certain conditions, and it's up to the judge. I've been in a court case where simply taking pictures outside of the property and showing that the gap between each no trespassing sign was more than 100ft wide was enough for a judge to throw out the charges. I was dirt biking on the power company's property, you see. It hinges on the defendant not noticing such restrictions. In both cases, the robots.txt meant jack. Even their security measures usually meant jack. Instead, it was a legal cease and desist from a lawyer that constituted "no authorization."
Is sharing someone's private location information on the internet illegal[0] or immoral? Because it is just linking to their home address, it isn't harming or doing anything physically to the person. There are obvious examples that demonstrate this type of slippery slope argument is equivocation on some level, linking to something can be seen as an invitation for people to copy it and violate copyright.
[0] The answer at least in the US is the act itself is likely not illegal although it can be construed to lead to a crime, such as incitement.
You have a human you want to train to paint. You download an image of a painting and give it to the human to train it on, amongst many. The human, in the future, may paint something that has vaguely or substantially similar qualities to the original painting. Has copyright infringement occurred in this instance? No: the downloading of the image for study in this particular instance is for sure, 100%, fair use protected by copyright law.
Now, let's try something similar. You have a monkey you want to train to paint. You download an image of a painting and give it to the monkey to train it on, amongst many. The monkey, in the future, may paint something that has vaguely or substantially similar qualities to the original painting. Has copyright infringement occurred in this instance?
Let's go further: You have a neural network you want to train to paint. You download an image of a painting and give it to the neural network to train it on, amongst many. The neural network, in the future, may paint something that has vaguely or substantially similar qualities to the original painting. Has copyright infringement occurred when that happens?
I find it hilarious that in this thread, instead of asking the question of whether this is fair use, multiple posters have jumped to comparing this to thievery, promoting terrorism and posting of private information as an analogy to harm.
Well good luck trying to teach your monkey to paint.
Let me propose a simple vocabulary refresher: feeding into an human eye is "showing", feeding into a machine is "copying". calling this copying "training" is nice and cute, but training is what you do as a human. As submarines do not swim, LLM's do not train: they copy information and extract similarities inside a neural net.
As for your mention of various "hilarious" scarecrows, it demonstrates that you know your logic is on shaky grounds.
The neural network behind the LLM is "trained." Yes, it's a term of art, very observant. That doesn't change the fact that there's not a substantial reproduction of a work in it. LLMs "learn" to predict text by feeding them vectors, the resulting "weights" of that process could be thought of and implemented as a big walkable decision tree to predict the next token it weren't for combinatorial explosion, but there's no substantial reproduction, or "copy", in there.
Is there a substantial reproduction, in your head, of a song or a text or a painting you can perfectly recall and can't seem to forget? Be careful, the copyright police will get you, the only way to delete it is to delete yourself!
>feeding into a machine is called copying
Copyright law deals in "reproductions" as far as I understand it. Feeding a copyrighted document as vectors into a program that reconfigures some matrix of vectors, and then using that matrices to output probability distributions into something tangible, where something tangible isn't a complete or substantial reproduction of the copyrighted work, is not a copyright violation.
If you think it is, amend copyright law, or take it to court and let a judge settle the matter, and hope it's in your favor.
How is the copyrighted document being "fed as vectors" if not through a process of copy of the data that it contains? Indeed the semantics are surely going to get their day in court.
The fact that you felt the need to qualify reproduction with the word "substantial" probably means that copyright will show how much "substance" is allowed to be, or not to be, copied.
About your question:
"""Is there a substantial reproduction, in your head, of a song or a text or a painting you can perfectly recall and can't seem to forget? Be careful, the copyright police will get you, the only way to delete it is to delete yourself!"""
This equivalence between "perfect" human recall, and a copy of the data input into an AI model is a bit of a strawman I think: it is the distribution of copied information that copyright protects against, the information has been uploaded in vectors, have fun demonstrating it is not being used in producing this or that model output.
If I learn a popular song and do a cover, there is a copyright law for that, I owe rights.
>it is the distribution of copied information that copyright protects against, the information has been uploaded in vectors
The "vectors" are not "uploaded" or "copied" into a file or neural network, they're transformed. In the context of stable diffusion, They're transformed, progressively "noised" or "corrupted", "diffused" with random gaussian noise, and in the context of stable diffusion, it's "trained" and "learns" how to "denoise" various "noised" stages of images represented as a vector of pixel data into their original form.
Then, when it comes to generation of images consistent with an "annotation" or "prompt", it is "conditioned towards" or "biased towards", with more "training", by noising an input image, and concating or combining that vector of pixels with a vector of the annotation of the image. It then "learns" to denoise with that conditioning information, the annotation.
Then, you can take the trained model, and do the same thing, with just a text prompt as a vector concated to a vector of random gaussian noise, and no input images.
That's basically and very simplistically how it works.
The output is not a substantial reproduction from the input images + annotations when trained. It takes the random noise, and "tries" to denoise it into something consistent with the prompt with conditioning to guide it.
Your attempt at covering would be a substantially similar reproduction. Your goal is to do a reproduction. Whereas, the model "learns" to generate images consistent with an annotation/prompt, by conditioning it with that "goal" on top of how it "learned" to denoise the images.
* In the Pirate Bay case, the copyright owners hadn't posted the copyrighted contend anywhere. Someone who was not the copyright owner posted an unauthorized copy, and the Pirate Bay were linking to that unauthorized posting.
* In this case, the copyright owner has posted the content; and LIAON is linking to the copyright owner's authorized posting.
If TPB was links to the original source videos as hosted by Disney, I don't think they'd have had such a hard time arguing it was OK.
Your comparison would be more reasonable if LAION had an origin link and then a link to a copy of the file explicitly for the purposes of copyright infringement.
Let’s be realistic here. They’re not training the model from a link to the image. Whether or not they’ve downloaded and stored the image seems irrelevant to the copyright claim. They’ve clearly used and reproduced his content without a license. Does this count as fair use? I don’t know, but to claim that they’re just storing a url is bogus.
If the image is available publicly, without restriction, then what license is needed?
If the artist doesn’t want people and algorithms to see the image, then put a password in front of it. Or add no-index to robots.txt.
This is like driving around with a sign on your car and complaining that people are looking at it and imagining the image in ways you don’t like. If you don’t want people to see the image, don’t put it on a sign and drive around.
That's not how copyright works though. Whether you post an image publicly or not, it's covered by the standard copyright protection in most countries. It's not "without restriction" unless the author explicitly grants you such license.
There's a bit of nuance in this case, but in general, yes, you need a licence.
That is how copyright works. If you put something in public, people are allowed to look at it. And learn from it. They just can’t distribute copies of it.
People are allowed to learn from it, but as far as law exists now, AI's are not explicitly allowed to learn from it are they? As I understand it, a copyright has to be explicitly granted.
Laws don’t grant rights. They take them away. You generally are allowed to do something if it doesn’t violate a law. “Copyright” is not the right to copy, but rather a protection against other people copying and distributing your work.
There is no law that says you cannot train AI models on copyrighted data.
You don't train a model with data, as the verb "train" is used to describe something that humans and animals do (as per dictionary definition of the word).
The other thing that you can do with data is copy or destroy it.
Therefore, there is a law that says you cannot "train" (really, copy) data that is under copyright.
Train is an accurate word. Copy is not. Training is literally adjusting weights of network such that data that flows through it can look more like the original. But it doesn’t preserve the original.
Also, It is legal to make copies of copyrighted materials so long as you don’t distribute them.
Train is not accurate because this word applies to learning performed by humans or animals. An AI is neither. Taking data and putting it into any form of storage, ie weights in a model, is called copying. Producing an output out of the gathered data is then a step in distribution. Copyright laws ensure authors get protection against unwanted copy and distribution.
Also you do not have to preserve the original for copyright to apply: any extract is only allowed under well defined fair use rules. A GAN's purpose is definitely not parody, commentary, ...
> Train is not accurate because this word applies to learning performed by humans or animals. An AI is neither.
That is not a definition that people use, nor the legal system
> Taking data and putting it into any form of storage, ie weights in a model, is called copying. Producing an output out of the gathered data is then a step in distribution.
It isn’t though.
If I train on Batman and I produce Spider-Man, the Batman copyright holder cannot sue me. It would be dumb to suggest otherwise.
Very interesting to see that every single one of my posts on this have been downvoted, clearly the stakes are high.
This will see its day in court, and clearly indeed the meaning of the verb "train" will feature prominently in the debate.
I tend to side with reliance on dictionaries for the meaning of words, and in my understanding the courts also do.
If you load up only batman pictures on your model you will clearly only produce batman pictures. No one is being dumb here that does not want to be, I am sorry.
The courts have their own dictionaries though, so it depends on which court in which case you're talking about. Here's a list of definitions for how the (US) courts define the word software.
> Very interesting to see that every single one of my posts on this have been downvoted, clearly the stakes are high.
Perhaps. Or, more likely I think, your argument is weak and doesn’t contribute meaningfully to the discussion.
Personally, I downvoted you do to my decades old rule of “downvote messages that complain about downvoting.” I don’t always downvote whiny comments, but I always downvote whiny comments about downvotes. (and I will be downvoting my own message as I am both pedantic and reliable).
> If you load up only batman pictures on your model you will clearly only produce batman pictures. No one is being dumb here that does not want to be, I am sorry
That’s not what I’m claiming though. I’m saying I take a model that has been trained on Batman and other things and I produce Spider-Man. If your point of view is to believed, the Batman copyright holder can sue me for producing a work that contains Spider-Man.
I can’t make copies or distribute or sell publicly available copyrighted works. But I can definite look at them and study them and make notes and sketch them, etc.
That’s definitely how copyright works.
I don’t think training on these has been tested, but I expect it will be allowed, otherwise how can a web browser receive, cache, and render a copyrighted photo? I think the closest is there were some lawsuits about google caching websites that contain copyrighted material. I assume they worked out in Google’s favor as google (and others) cache web sites even though they contain copyrighted material. The exception I remember is there are some specific rulings and laws governing how news is treated.
That's not the point I was disagreeing with. What you're actually allowed to do in terms of training is likely going to be soon refined by courts. I was just raising the point that posting those images online or not, with protection or without, doesn't change what you can legally do with them.
"If the image is available publicly, without restriction" — is irrelevant.
> If the artist doesn’t want people and algorithms to see the image
Why quickly skip past the most unjustifiable part? Because you're about to use a human metaphor (like "see") for an algorithm. My algorithm is "copy." So instead of driving, it's like selling a book with a sign on the book that says "copyright 2017, all rights reserved" and saying that people can't copy it. If you don't want people to copy it, don't write and distribute the book.
I actually believe this, but I don't know how somebody believes in one algo and not the other and remains consistent.
The algorithm used to train models like SD (that use the LAION dataset), do not actually make a copy though, in any reasonable sense of the word. [1]
LAION doesn't do the SD training themselves though. This just provide links with annotations.
[1] Exception: The Ephemeral copy that is made when actually downloading the image. This is permitted by EU law. AFAIK after the image has been analyzed for certain properties like "red" or "traffic light" or "bus" or "vertical line" .. .etc... it is then discarded.
In the USA there was a circuit split on this in 2020 at least [1]. In the European union this form of copying is explicitly excepted in general under Directive 2001/29/EC Article 5(1) see [2]. Specifically in Germany (where LAION is established) it is permitted under Section 44a of the UrhG [3] . One could also argue based on Section 44b of same.
I am sorry but I have a very different reading of the mentioned EC Directive:
"""Article 5
Exceptions and limitations
1. Temporary acts of reproduction referred to in Article 2, which are transient or incidental [and] an integral and essential part of a technological process and whose sole purpose is to enable:
(a) a transmission in a network between third parties by an intermediary, or
(b) a lawful use of a work or other subject-matter to be made, and which have no independent economic significance, shall be exempted from the reproduction right provided for in Article 2.
"""
As upload to a GAN is not transmission, and the economic significance is pretty clear.
Then the German 44a is an exact copy of the EC directive, so this also applies I think.
I haven't heard the phrasing "Upload to a GAN" before. It also doesn't seem to be defined in the sources at hand. Can you provide a source and/or definition?
As the word "train" refers to applying the uniquely human capability of acquiring information, the word "upload" seems the most appropriate term fallback to use when referring to putting information into some AI system.
You're welcome!
Reference: open any dictionary and look up the word "train". Choo choo!
Note that "training" is a pretty common established term-of-art used in AI and ML for particular kinds of operations. Re-using the term "upload" on the spot would be misleading.
"Upload" -in my mind- is typically a network operation (and in the opposite direction at that) , which is not what is happening here. I would prefer to stick to terminology used in the sources and literature, if possible (as opposed to inventing a new nomenclature on the spot).
A very brief summary/introduction of the operations involved can be found in section I. A. of [1]. (a source I referred to previously)
I think the argument is that it’s not the kind of copy that is copyright infringement.
In the sense that yes, the bits are copied from one register to another. But they aren’t stored permanently, nor are they distributed. So there’s no copyright violation.
Similarly my brain is making copies when I look at something. And routers on the internet make many copies sending the material to me. Etc etc.
If training on an image is copyright infringement for making copies, then so is viewing the image through glasses (image is transformed by the lens) or dragging the image from one monitor to another or caching the image in your browser and viewing it later from the cache rather than retrieving it again.
I don't believe it's LION that's copying the image. Just the URLs and alt text from links. They don't actually build the AI models. It's really no different than a search engine.
>They’ve clearly used and reproduced his content without a license.
Where have they reproduced his content?
You don't need a license for your fair use of something. Can you walk me through how downloading a photo to your personal computer and studying it, perhaps to become a better artist is copyright infringement? Now let's apply that analogy to training a model, is the process of using a particular image to train a model, ultimately reconfiguring a bunch of weights in a matrix, so it "learns" whatever particulars it learns, copyright infringement? Is training a painting monkey that may paint something similar to a copyrighted painting, on a copyrighted painting, copyright infringement?
The unfortunate answer is that the courts have yet to decide what exactly "training an AI" means in the context of copyright.
I personally do believe that training using copyrighted data is _not_ a violation of any existing copyright, but using the trained AI to reproduce and distribute something that an expert in the field would regard as a copy is a violation.
You're applying 20th century law and logic to 21st century technology.
FWIW I do believe training a model is fundamentally different from training a human.
The obvious analog is facial recognition, deemed privacy violating by laws and courts in multiple jurisdictions, vs a human doing the same thing manually.
>The obvious analog is facial recognition, deemed privacy violating by laws and courts in multiple jurisdictions, vs a human doing the same thing manually.
It was deemed "privacy violating" under specific privacy laws within those jurisdictions.
i.e. go pass a law amending copyright law if you think so. there's no substantial reproduction here, in the context of stable diffusion, until there is, and then yeah in that instance it's probably a copyright violation, like a substantial reproduction made by a human would be. but training itself, probably not. but yes, there have been no rulings on it in this context.
I'm aware privacy law applied to facial recognition, not copyright law. Nowhere did I imply that it was copyright law. I brought it up to illustrate that "activity + human" vs "activity + computer" can differ in legality for the same activity.
> go pass a law amending copyright law if you think so
I don't need to do anything. Let's see until one of these cases come to the courts. You're talking as though it's a foregone conclusion that they'll rule the way you think. I wouldn't be so sure myself.
It boggles my mind when people seriously claim that if there’s a word “learning” in ML then it justifies OpenAI et al. ignoring basic copyright.
Training(2) software is to training(1) a human what firewall is to an actual wall made of fire, or what killing a process is to killing a human. The word is a synonym—more precisely, industry slang that happens to resemble a word used outside of the industry due to superficial similarity.
So if you wonder how training(2) software is different from training(1) a person, first understand that it’s a different word that never signified the same concept.
Subsequently, if you want to argue that training(1) should apply to software now, the onus would be on you to explain why and how so. You should be prepared to argue that software is sufficiently like a human—you know, that it understands ideas, has agency and free will, thinks like a human and has human-like consciousness, is capable not only of performing instructions given to it by its operator and act as a tool but to actually consider its actions and make own moral judgements, things like that.
And if you come up with satisfactory evidence in favor of all that, and have grounds to believe some software is enough like a human that training(1) applies to it, then why are you fighting to allow its operator to ignore copyright—and not for more important things, such as to free this human-like being from abuse by its operator and grant it basic human rights? If we imagine that software is like a person enough that “training” it means the same thing as training a human, we should be prepared to acknowledge all the implications that come with that.
The reality is that there are companies who would like us to both believe their software is human-like (so that we don’t sue them for rampant copyright abuse) but also not at all human (so that we don’t demand them to stop profiting from what would be slave labor). Naturally, if they pick one or the other they stand to lose from “a lot of money” to “entire business model”—but we should help them make that choice.
That's great and all, "training", "learning", etc. are terms of art.
How does that change the fact that there is no substantial reproduction of the works in the resulting weights, or in the "output" of these models with the weights?
Can you point to me where it's illegal to take copyrighted works and "train" neural networks on them, as long as there's no substantial reproduction of the works in the output of that process, or in the output of a particular configuration of "trained" and "frozen weights"?
Eh, an AI/ML model is an AI/ML model. "Training" one is its own thing.
However you define it, training typically doesn't involve making permanent copies of the data.
This means that Under EU law -as far as I can tell- it is probably legal. Under US law the different circuits have a slightly different interpretations of the law, but probably would agree that this is fair use.
<Training> by any definition you care to give it does not rise to the level of copyright violation (In the case of training eg stable diffusion, on average only a couple of bits worth of "notes" are stored per image. If that's a copyright violation then pretty much anything would be). The main issue -believe it or not- is actually the part where images get temporarily downloaded.
As stated: explicitly legal in the EU. May need some work in the US.
> <Training> by any definition you care to give it does not rise to the level of copyright violation (In the case of training eg stable diffusion, on average only a couple of bits worth of "notes" are stored per image. If that's a copyright violation then pretty much anything would be).
You are talking about two different words, again. Human learning is not a copyright violation. Machine learning is. Machine is not human.
> The main issue -believe it or not- is actually the part where images get temporarily downloaded.
No, the main issue is using a tool to sell derivative works automatically generated from copyrighted original works. If images/text is not stored it wouldn’t change a thing.
If it is not about the transient copies, then what other step in the machine learning process could possibly rise to the level of a copyright violation? What is your reasoning? Do you have a source?
If images/text being stored or not is not part of the argument, then what IS the argument that AI works are derivative? Is it only certain works? All works? Is it automatic? Does it require human intent? How do you get from A to B here?
A work created this way is derivative. If it was through human mind then could be called plagiarism or tribute, but through an automatic tool ran for profit it’s a clear violation if the work was not licensed to allow derivative works.
And of course there is human intent, what are you even talking about? This is law. Law is sort of centered around human actions and intent plays a big role. In this case, operators fully intended to scrape copyrighted works, feed it to this tool and operate it for profit (because money smells good).
You did not answer my question. Define "derivative". Then explain why an arbitrary generated image would be a derivative according to your definition. This is not an adversarial question in and of itself. I'm just asking you to define your terms first.
“Our client fundamentally understands that your client may not like the temporary reproduction of his works,” LAIOn lawyers wrote to Kneschke’s. “Only this has been expressly permitted by the European legislator.”
This is the crux. It’s by design. Europe decided to generally legislate pro scraping because it was seen as a way to liberate data from big techs monopoly.
Which is exactly what happened with stable diffusion - say what you will about it, it’s distributing the ability to create value from AI among a larger number of benefactors than OpenAI
They've no more "reproduced" his copyrighted content, than a student studying photography or other art did, when they saw the picture online and made a mental note to imitate aspects of what they thought made the content interesting. That's the cold, hard reality of the situation. At least, that's how I would argue it as the defense lawyer.
People don't seem to be getting quite yet what 'artificial intelligence' intrinsically really means.
He sent an initial demand that LAION declined. He did not want to give up, then he made his lawyer send a letter, then LAION paid a lawyer to answer, and the invoice is for the cost of the lawyer.
What I find interesting is that the photographer admits in the comments that he sells thousands of AI-generated images, probably made with an AI trained on LAION's dataset. He only doesn't want links (with alt text) to his own images in the dataset.
If the pictures have been integrated into the training data, then they have most definitely not been deleted: they are an integral part of the model IMHO. The only way to really delete them is to retrain the model, without the pictures.
LAION hasn't trained any models, the dataset they produce is a list of links to the original images essentially with some metadata and things like whether the image is probably nsfw.
Both lawyers and the photographer are not looking great, but the legal questions seem much harder to answer than it might appear at first. Keep in mind that German copyright law is quite different from its British or American equivalent. Also, IANAL.
If I understand it correctly:
0. Laion uses WAT-files from Common Crawl to generate a list of suitable image-text pairs, including a URL to the raw image. [0]
1. Laion downloads the raw images for further processing.
2. Based on step 2, Laion creates a curated list of image-text pairs. The images themselves are uncurated.
3. Using [1], anyone can download the images. Tags such as `noai` are respected.
-- Question 1: Under German law, is Laion allowed to do what they are doing?
Maybe, but far from clear.
The lawyers [2] point to § 44b UrhG [3], which is rooted in [7], and § 66d UrhG [4]. These norms state that you can use publicly available data for data mining, especially if you are doing it as part of scientific research. Copyright holders may prohibit such activities but must do it in machine readable form on the website.
First, I am not positive that Laion's activities fall under the law's definition of "data mining". The law says data mining is an activity done in order to gain insight into "patterns, trends and correlations". But Laion does not even attempt to gain any insight from the data, it just wants to provide a data set for model building [0]. I don't think § 44b UrhG is applicable at all.
Second, Laion cannot simply yell "it's for research". To use of § 66d UrhG, you (a) must qualify as a research organisation, and (b) there must not be any private organisation in the background directing the research. Whether or not Laion qualifies as such would have to be decided by the courts, but I think it's questionable.
Overall I think the lawyers' position as outlined in [2] is not very strong. Perhaps they find better arguments when a court is involved.
-- Question 2: Under German law, does the photographer have any claim against Laion?
Difficult.
The photographer asked Laion to remove his work from their dataset [5]. The lawyers replied that Laion does not store any photographies so there is nothing to remove. It seems that the photographer did not understand this correctly when writing the initial cease-and-desist letter.
Does Laion infringe on the photographer's copyright if they mereley distribute links? The photographer argues that Laion admittedly does download raw images themselves, which pertains to Question 1.
Aside from this, in German law there is a theory called "Störerhaftung" [6]. It means that if what you are doing enables copyright infringement you may be liable as well, even if you do not distribute the content yourself. If you get notified that you participate in any such infringement by a copyright holder you must cease and desist immediately.
In some cases, distributors of hyperlinks to copyrighted material have been found to commit copyright infringement, though it has been rather the exception. I have no idea how a court would decide in the current case.
One may argue that simply adding the equivalent of a `robots.txt` will suffice to end the infringement, but that is too simple. The photographer sells his work to a customer who may use it on their website, but does not want anyone else to use his work.
However, the photographer cannot control whether or not the customer uses a `noai` on the website. [3] is opt-out, but for this to work, the photographer must have the opportunity to opt-out. It seems the law did not sufficiently appreciate this scenario. It also seems it did not foresee instances where content mining enables AI models replicating (parts of) the content they did mine.
One plausible scenario is that the court denies the photographer's claims to Laion because he can contractually mandate clients to place a `noai` tag on their website. In any case, I think the photographer would have to demonstrate how specifically Laion cost him money as there are no punitive damages in German law.
Pictures are stored in a different representation.
Otherwise I could download the image, re encode using WebM, put it inside a zip file with other random files as well, and then there would be no single file with their copyrighted material.
Because LLM keep their data on files. Unless we don't keep the data on files but on a Blob on a SQL database! Yay this is fun
This gets into hairy territory of derivative works. I think this will be an uphill battle for content creators until laws are updated for the new state of the art.
Prior to the unfettered access to generative AI, derivative works took time and effort comparable to the original. Now, derivatives can be created en masse and in seconds.
Where's the line? What's fair use now? These questions will likely be answered in a court or by legislative bodies.
In the meantime, protect your images, paywall high res versions, watermark thumbnails, etc. Also, if you're going up against a company? Lawyer up.
It kind of seems like an intimidation tactic. Not too many artists will be willing to claim copyright infringement, if they can get hit with a suit in response. And in the meantime, they cab happily continue training their model on whatever they want.
On the other hand, assuming that the following are true:
* The activity "has been expressly permitted by the European legislator"
* The person in question continued to threaten / press legal action
Doesn't it make sense to be able to ask for damages? IOW, shouldn't we be intimidating people against making baseless legal threats?
Obviously if either of the above two are false, then the legal firm in question are in violation of professional ethics and need to be hauled up before the German equivalent of the Bar Association.
This is like people filing bogus dcma takedowns on YouTube videos. The photographer doesn’t have a valid claim, yet continues to file claims knowing that he doesn’t.
What other remedy does the site have?
People can’t train modes on whatever they want. They can train models on whatever is legally available. If I set up a photography site and post images for the entire world to see, then the entire world can see them. Including model trainers.
That's a terrible comparison, unless you're asserting that the photographer's own website was infringing his copyright, so LIAON has the same vicarious liability that TPB (arguably) did.
In which case, I guess he should sue himself. That's the nonsense world this silly claim leads to.
The open web depends on being able to freely link to content. Don't let an emotional response to AI turn you against by hyperlinks.
There's something to be said for not pissing people off.
Certain other search engines would still accept the filetype:torrent search key for the longest time after, with no problems whatsoever. (Currently doesn't seem to work anymore, mostly because I think people switched to magnet links).
Text of the law referenced:
(4) Where the notification is unjustified or ineffective, the person notified may demand reimbursement of the necessary expenses incurred in respect of defending their rights, unless the person giving notification was not able to recognise at the point in time when notification was made that the notification was unjustified.
So it seems the notified party doesn't have to demand any money, but if they do so then only necessary costs can be demanded. I find it hard to believe that $1000 (or whatever) was a necessary cost. Otherwise, you could create companies that operate on the legal fringe that pretend to "serve humanity" when the real MO of your org is to profit from charging legal fees to regular people confused about the law. This is a new type of patent troll.
Addendum - to clarify my point, I am arguing that someone is almost definitely profiting handsomely from charging $1000 to send an email saying "nope, we still aren't infringing" and that "necessary costs" (as required by law) probably doesn't include such leeway for egregious profit taking. Otherwise, you create a huge moral hazard that silences dissent under threat of arbitrarily inflated legal fees. Not a German resident and not a lawyer though.
$1000 is an hour (maybe two) of attorneys time to draft the response. It was incurred after the photographer received a “free” letter saying “this is legal if you pursue we are legally entitled to recoup fees from you, so please don’t.”
I don't doubt they found someone willing to charge that rate. Very convenient for them.
But you won't me convince me that such a routine letter required 2 hours of work at what is probably the top 0.1% wage rate in the nation as a "necessary" cost.
A company that routinely crawls millions of websites can't send a form letter for less than $1000 a pop? Don't buy it. Scam.
Eh, I sometimes mess with bureaucracy and legal stuff (not a lawyer) and I can quickly lose a morning in figuring out what's actually going on, what the rules are, who to actually contact, etc etc. Not to mention just the initial phone call might cost an hour and it all adds up really quickly.
Yes the final letter can be written in 5 minutes by ChatGPT these days. It's the homework that comes before then that takes the time.
At a normal lawyer rate of Eur 300/hour * one morning (or afternoon) of 4 hours of work, that sounds about right?
I, thankfully, don’t interact with council frequently. But the few times I do, I’ve never encountered a letter, even perfunctory like described in the article, that takes less than a few hours. You’ve got an hour conference call and then the drafting of letter and transmission.
This enters the “lawyer time” world where it seems crazy to you and me, but they will describe all the rules and customs that show exactly how this time is permissible. And they’ll bill you for the time that they describe it to you.
It’s not that they can’t send the letter for less than $1000. It’s that they don’t have to, and have no motivation or incentive to do so. And why would they? What’s their benefit for cutting costs? And the risk is they mess something up. Therefore, they spend an hour or two and bill the time.
This sucks, but what are you going to do? You could petition the EU to set a statutory level for damages and fix the remedy at that amount. Good luck with that.
I actually think that $1000 is them being nice as if it was RIAA they’d probably bring in 20 lawyers to review the letter and bill $20k.
As I understand it current EU law (both the AI team and the photographer are in Germany so this seems relevant) explicitly allows data mining (for example to train AI) if that data is available online and no machine-readable reservation of rights has been made, but please correct me if I am incorrect in my believe. The German law Podcast "Rechtsbelehrung" also has an Episode [2] on copyright law and AI that touches on this topic and might be interesting for any German planning on using AI.
> ...In the case of content that has been made publicly available online, it should only be considered appropriate to reserve those rights by the use of machine-readable means, including metadata and terms and conditions of a website or a service... [1]
I don’t like these forms of half-truth articles. The headline, and really article leaves out the important step of “photographer filed bogus copyright claims” that’s a very important step in the middle.
This article assumes the audience is already on the author’s side so seems to just stir up existing emotions and beliefs by telling an inaccurate story.
He filed a claim he legitimately considered valid, and persisted when told (by the target if his claim, not a court) that his claim was invalid. Then as retribution for persisting, the target of his claim invoiced him.
That's retribution aimed at creating a chilling effect and they deserve every bit of criticism they are getting here and more.
If they want to use this as a test case to redefine copyright law to mean something favorable to their shareholders, fine. Good luck to them.
But attempting to intimidate your adversary with fines is just totally uncool. (And, frankly, a stupid strategy.)
LAION itself does not store the images. Therefore they are claiming they had no part in any AI use/misuse. They also committed no copyright infringement in what they did (which was to download an image, verify that it was in fact an image, and add it to their list).
However, since LAION had to retain a lawyer to deal with this (incorrect) request, they now want to be paid for that lawyers time - which seems reasonable.
Other parties used the LAION list to train AI models - and those parties are who the photographer should have made claims against in the first place.
Potentially it can be argued that LAION is aiding in copyright infringement, but there is a very strong argument that no copyright infringement is occurring at all. (With one of the many interesting twists being that AI output is considered PD in the US)
My lawyer would say this is a "legally interesting situation": by which he means that the situation legally and ethically very unclear, and lawyers can earn a lot of money sorting it out.
That's not what the court ruled. The particular case had no user input, so failed the creativity test. Even making simple prompts, a creative process, is enough to make the result copyrightable, in the same manner that using AI ni photoshop to teak things results in copyrightable, or using AI filters to adjust audio, or even write text, results in copyrightable content.
The requirement is that a human drive any tool with some creative input, which does not have to be extensive at all. I can write a few words and it;s copyrightetd automatically. How can doing that to create a prompt which drives a tool not be copyrightable? They're both creative content.
> some creative input, which does not have to be extensive at all.
The US copyright office's statement suggests otherwise:
> when an AI technology
receives solely a prompt from a human and produces complex written, visual, or musical
works in response, the “traditional elements of authorship” are determined and executed
by the technology—not the human user. Based on the Office’s understanding of the
generative AI technologies currently available, users do not exercise ultimate creative
control over how such systems interpret prompts and generate material. Instead, these
prompts function more like instructions to a commissioned artist—they identify what the
prompter wishes to have depicted, but the machine determines how those instructions are
implemented in its output.
Obviously one could draw a parallel to someone painting using a spray can - where the individual droplets of ink land are mostly a matter of physics, so there is no creative control there. The human simply directs the can to give the general direction of where the ink should be put.
A line must be drawn between these two cases, but it is unclear where it is.
That does not say they do not grant copyrights to such things, and they explicitly list mixtures of things they did grant copyrights to. The only place they did not grant copyright was the example with no human inputs. In the case I claimed, with human creative input to direct a tool, to quote:
"In the case of works containing AI-generated material, the Office will consider whether the AI contributions are the result of “mechanical reproduction” or instead of an author’s “own original mental conception, to which [the author] gave visible form.” The answer will depend on the circumstances, particularly how the AI tool operates and how it was used to create the final work. This is necessarily a case-by-case inquiry."
As I put it, prompt engineering, which is often iterative, directed by a human desiring a goal, will easily surpass the statement. The simple "all AI generated content is not copyrightable" simply isn't true as the OP claimed.
My reading was that the AI generated elements themselves are considered not copyrightable at this time. The additions might be considered copyrightable by themselves.
Did you have eg. a more recent court ruling that overrides this? (you might have been thinking of the monkey selfie case, which predates this). Or do you think I'm misreading?
The link is not a court ruling, nor does it even claim AI generated elements are not copyrighted. It is a guide, not binding, nor court tested, with one goal of getting more public input on how to balance needs. It does say in several places that AI generated content can end up being copyrighted in simple manners - make small changes, curate, select, or arrange.
From that text: "In the case of works containing AI-generated material, the Office will consider whether the AI contributions are the result of “mechanical reproduction” or instead of an author’s “own original mental conception, to which [the author] gave visible form.” The answer will depend on the circumstances, particularly how the AI tool operates and how it was used to create the final work. This is necessarily a case-by-case inquiry."
Prompt engineering as iterative, directed, and the output of a human exercising creative control over the work, is most likely going to be copyrighted. Having a tool create lots of items and having a human curate them is usually a copyrighted action (called "copyright of compilation").
The OP simple (and unfortunately incorrect pop assertion) that AI works are not copyrightable is missing the nuance that is reality. And it takes very little editing of a work wholly made from an AI to make it 100% copyrightable, so discounting AI for use in making copyrighted work is missing the nuance.
If I threaten legal action against you and you hire a lawyer to respond to me, and I then take no further action, am I liable for your lawyer’s fees for the response? I believe the answer is no but I don’t know for sure.
Totally agree, the cost of frivolous copyright claims cannot be 0 if we want such system to work.
> Other parties used the LAION list to train AI models - and those parties are who the photographer should have made claims against in the first place.
"Other parties" have not published a recognizable copy of his work either, only a list of weights that were computed based on his work. How was their copyrights infringed by these?
Say I make an original work and hold all the copyright, I do not license it as "derivative commercial works allowed". LAION makes a dataset and says to everyone "you can use it to train your ML" without distinction and puts my work in it. Some guys like ClosedAI are 100% gonna use it to sell generative works for profit. LAION doesn't stop them.
LAION basically took my work and tells everyone it's OK to use it, like if it were magically public domain now.
On the whole I like the idea that someone sending bogus legal demands has to bear some costs, as doubtful as it is that they'd actually pursue him for costs.
Some countries (I know Canada is one, Germany, where the artist works, is possibly another) have the concept of moral rights enshrined into law. Boiled down, this means that anyone who wants to use an artist’s work in some context must get permission and/or be prepared to compensate the artist (ianal but I did study copyright law for artists in Canada)
The canonical example Canadians refer to is the Eaton Center adorning Michael Snow’s sculpture Flight Stop with ribbons over Christmas. Snow sued the Eaton Centre for compromising the integrity of his art and won.
If a similar mechanism exists in German law (my quick research suggests it does) then the artist may well have an argument to make about AI comprising the integrity of his work.
This isn’t as clear to me as it is to others for two reasons:
1. The alt-text/caption from the original source is copyrighted and they outright copy it.
2. Facilitating copyright infringement is often illegal, and it’s clear that LAION put together this resource to facilitate automatic copying of the works for unauthorized use — and then encouraged other people to do exactly that.
NNs are basically working as a kind of laundromat for digital artifacts. They replace a single copy operation with a multiple transform operations all leading to the same intended result. I wonder how soon we would see NNs which can produce bit perfect copy of the trained dataset on demand. And after that where will we draw the line? Imagine a NN which produces bit perfect copy of the image and then changes 1 pixel in it to make it different. Is it an original art? No? Well, what if we change 2 pixels then? Still not original? What about 3 pixels?..
The amount of popcorn to be consumed watching all these fancy new legal battles will be astronomical :)
I'm not sure about bit perfect, but very similar images has already happened at times, is considered a bug, and newer models do their utmost best to avoid it.
These days -under normal circumstances- it is highly unlikely to occur.
Either way the actual model(s) themselves don't actually store images. (A single SD model based off of (among other things) LAION fits on a single USB stick. AFAICT It takes up several orders less space than the LAION URL LIST, let alone the images themselves )
Yes, I read the PR statements about recent NNs. But what if some new company did it intentionally? So for example, starting from today:
- Currently NNs are in a grey legal area all over the world because nobody knows if so called "training" is a copying or a fair use.
- Lets imagine that NN corpos would win and legalize all NN training as fair use, disregarding any copyrights.
- Now this hypothetical company appears which has a NN producing copies of original training data on demand. When confronted with a question they simply wave it away saying that all "training" of NNs is legal, so they aren't breaking any law. And when perfect copies start to annoy too many people they fine tune their model to produce almost perfect copies. Because there is no definitive line in the art separating a copy from original art.
Maybe I was not clear enough, I'll try to rephrase. I'm not fearmongering against NNs (at least in this particular thread), here I just assume they exist and there is not much we can do about them. But what I do discuss is - because of the complete absence of legal definition of the "training" process, a lot of different situation can arise, sometimes even dangerous (legally) to the NN corpos themselves.
I wonder if he had a license that explicitly forbids AI training, and/or an appropriate robots.txt file, would that have strengthened his case, or not really?
In a subtle way, this case exposes that copyright law itself is unrighteous and fundamentally unethical. Most people have bought into the thinking behind copyright laws (as children of their time), which makes it hard to see how it could be wrong (illustrating the Stockholm -and related- syndromes). But trying to make the case where using copyrighted images as data for a neural network as being wrong, because it "feels wrong" to those whose sense of fairness is calibrated against the concepts behind copyright law, highlights that the case cannot logically be made. "But they made use of the 'copyrighted' images, they are profitting off of them" - which copyright law is supposed to protect against: only the copyright holder should derive financial benefit from the material. It appears now that is not just hard to enforce, but it ultimately doesn't make sense, it just enormously complicated the world (giving rise to a whole class of well-paid lawyers).
reply