[Rumors that start to become lawsuits]
Some speculations are:
- LibGen (4M+ books)
- Sci-Hub (80M+ papers)
- All of GitHub
This is the most funny, but in the end sad aspect. If ChatGPT was indeed trained on pirated content and is able to be(come) such a powerful tool, then the copyright laws should have been abolished yesterday. If ChatGPT was not trained on all these resources out there, then think how much powerful a tool it would be if it were trained, then copyright laws are actively stifling advancement and should have been abolished yesterday.
I now routinely introduce this technology as "copyright laundering" and the hype put out by start-up boards and VCs as a ploy to disguise this fact. The "AI threat" is smoke-and-mirrors to dress up what's happening.
I derive a huge amount of value from chatgpt because I can copy/paste without any IP impact. I could always have done this: from github, from ebooks, from many sources.
Now I can benefits from the labour of many for free -- their copyrights laundered through a thin statistical trick.
As with crypto (, pyramid schemes, etc.) the big "philosophical pitch" becomes a disguise for a brutal material reality.
Midjourney, ChatGPT, etc. are doing automatically what would be illegal by-hand.
An argument could also be made that If you pulled all the copyright infringements out of the training set for ChatGPT it would be not be half as intelligent.
One avenue that chatGPT has, and I'm not sure if it is being utilized at all yet, would be the ability to feed it the unimaginably huge body of information locked behind copyrighted textbooks, books, academic papers and other pay walled information.
Imagine the knowledge that could be accessed by feeding all that information into an ai engine like chat gpt. Presumably, it would not break copyright rules anymore than a regular human reading a bunch of papers behind a paywall and regurgitating the learned information.
There was a rumor about ChatGPT using libgen for training data, if true, I find it hard to believe Google's legal team would touch that with a long stick.
The irony here is that ChatGPT can operate as a fairly good copyrighter. But since it is detectable, there are now human copyrighter services like this one to humanize the output.
I suspect ChatGPT is using a form of clean-room design to keep copyrighted material out of the training set of deployed models.
One model is trained on copyrighted works in a jurisdiction where this is allowed and outputs "transformative" summaries of book chapters. This serves as training data for the deployed model.
Your sentiment is exactly what I intended, albeit I was terse and a little facetious. ChatGPT is like introducing a bunch of new skilled labor, it’s just for the first time this skilled labor isn’t human. The fact that this skilled labor learned from copyrighted material is like saying human labor learned from copyrighted material.
Inevitable outcome. Since ChatGPT launched, nobody has a clue as to what is legal and what is illegal with these chat-based LLMs.
Is the content that LLMs produce enough to rise to the level of copyright infringement? Is the fact that a company trained their LLM on your data, with the knowledge it would be used for outputs (=profit), enough that all of their outputs should be considered, to at least a minuscule degree, influenced by your work? How would ChatGPT's "training" differ from, say, another journalist who reads the NYT, and subconsciously uses that to help provide better services?
None of us can answer these questions definitively. The courts hearing these sorts of arguments were a foregone conclusion. I think a lot of the large LLMs (certainly OpenAI competitors) are going to breathe a sigh of relief that this is happening sooner rather than later, so they know where the legal lines are to be drawn.
I still don't understand how they can keep a straight face claiming that training on all human-written material (copyrighted or not) that can be found on the Internet is perfectly fine, but training on ChatGPT output is not (or in other words, that human writers cannot have a choice on whether their output is used, but bot owners can).
This is known to maybe half the people in the tech world.
> programmed purely for predictive text
Way fewer people know this.
Most people think of it as some magical AI or whatever. Even with huge disclaimers about it hallucinating, there are so many APIs into ChatGPT...I can only imagine this getting worse in terms of lawsuits.
> It's utility seems like it will steamroll any attempts to stop or slow it down.
What? I don't see any utility outside of education and even there it's pretty sketchy.
For business, legal compliance is not a joke and instantly shuts it down. The only businesses willing to use ChatGPT for generating code would be naive young startups who don't realize some assembly is still required and the instructions are missing no matter how much they query the bot. That's called expertise (which they don't yet have). It's not good enough to just write the code. Someone has to comprehend it so they can tweak it as needed. At some point the tweaks will become unwieldy and require actual software engineering that the bot doesn't know how to do (transform from one design pattern to another and know which to use). More power to them if they can cobble something together and then succeed at maintaining it. By the time they're through they'll have pulled off so many miracles that they won't need the bot anymore and become experts. That's quite the trial by fire, but hey everyone has to find their way!
Well it is a machine so of course it doesn't care:
ChatGPT is a bullshitter in the (Frankfurt) sense of having no concern for the truth at all
Detecting true plagiarism with it (or a derived entity as a service) would be as useful the currently proposed watermarking. Turn the technology to some advantage, because the profit seekers and free riders certainly won't be deterred.
While I 100% agree, there is another angle to consider this from, in that ChatGPT replaces reading the NYT. ChatGPT competes with it in the delivery of information.
To add to your point though, a sufficiently advanced AI trained on licensed data could reproduce copywrited content from prompt alone. It's the next step that would cause infringement where someone does something withcthe output.
reply