The output isn't guaranteed to be free of copyright from its training materials. It just usually is. There have been clear demonstrations of it regurgitating code from the training set verbatim, which would of course still be covered by the original license.
Microsoft isn't going to train Copilot on Windows code for the same reason it didn't train it on private repos: the code is private and they don't want to risk leaking private code.
I imagine there would be no problem training it on e.g. the Unreal Engine code which is not open source but is available to read.
The big practical issue is that there's no warning when Copilot produces code that might violate copyright, so you're taking a bit of a risk if you use it to generate huge chunks of code as-is. I imagine they are working on a solution to that though. It's not difficult conceptually.
An interesting though experiment is how keen Microsoft would be to allow Copilot to be trained on the Office or Windows source code. If the output is truly free of copyright from its training materials then if not, why not?
The question is more, did Microsoft train the copilot on their code repositories? If there are no copyright concerns, why not train it on the Windows code base? Should be a treasure trove of good code to train with.
copilot also got its training sets for free and not really with any kind of consent from the owners of that code, and it's really quite ambiguous as to if what it's doing violates many different open source licenses of its training data
Microsoft is selling AI services based on training data they don't own and didn't acquire rights to, nobody writing the licenses of the code it's using had the opportunity to address this kind of code use without license, attribution, or consent. (and the training data is a huge part of the value of an AI product)
> Microsoft suggests that no matter what the license is for training code, generated code is not infringing
They claim no such thing. They claim that Copilot itself is not infringing any license by being trained on code - that their own use of the code for AI training purposes is fair use and that the Copilot NN is not a derivative work of any code in the training set.
But they explicitly warn you that code that Copilot generates may infringe others' copyright, and that you should use "standard IP scanning tools" to ensure you are not infringing copyright by using the code Copilot spit out.
Personally I'm not worried about the end user using copyrighted code. That is their responsibility. If you have verbatim GPL code in your commercial closed source code base that is a liability and it might be dangerous to use copilot.
What I have more of a problem with is Microsoft charging for copilot which was trained on copyrighted code without any permission whatsoever which they really have no right to utilize/charge for.
They would need to remove all code that isn't put in the public domain from their training data. Permissive licenses like MIT require derivatives to propagate the copyright notice, which Copilot does not do.
The objection here is that its training was based on code on github without paying any attention to the license of that code. It’s generally considered ok for people to learn from code and then produce new code without the new code being considered a derived work of what they learned from (I’m not sure if there is a specific fair use clause covering this). But it’s not obvious that copilot should be able to ignore the licenses of the code it was trained on, especially given it sometimes outputs code from the training set verbatim. One could imagine a system very similar to copilot which reads in GPL or proprietary code and writes functionally equivalent code while claiming it’s not a derived work of the original and so isn’t subject to its licensing constraints.
Framing this as learning is unhelpful. Copilot isn't just learning. It can reproduce code it saw during training verbatim. It will even reproduce the GPLv3 licence itself. If you copy-paste code from a GPLv3-licenced sourced without attribution, it's a copyright violation. Why should it be different for Copilot?
EDIT: After more reading on the subject, I'm willing to accept that copyright infringement is unlikely here. This link [1] was the one I found most convincing.
However, I would still shift the goalposts and look at this ethically, and I still think it's wrong that Microsoft is profiting from code with licences like GPLv3. This is a whole other topic, though.
Well no, because only GitHub has access to the training set. But more importantly this misunderstands how Copilot even works -- even if Windows was in the training set, you couldn't get Copilot to reproduce it. It only generates a few lines of code at a time, and even then it's almost certainly entirely novel code.
Now, if you knew the code you wanted Copilot to generate you could certainly type it character by character and you might save yourself a few keystrokes with the TAB key, but it's going to be much MUCH easier to simply copy the whole codebase as files, and now you're right back where you started.
Suing Microsoft for training Copilot on your code would require jumping over the same hurdle that the Authors Guild could not: i.e. that it is fair use to scan a massive corpus[0] of texts (or images) in order to search through them.
My real worry is downstream infringement risk, since fair use is non-transitive. Microsoft can legally provide you a code generator AI, but you cannot legally use regurgitated training set output[1]. GitHub Copilot is creating all sorts of opportunities to put your project in legal jeopardy and Microsoft is being kind of irresponsible with how they market it.
[0] Note that we're assuming published work. Doing the exact same thing Microsoft did, but on unpublished work (say, for irony's sake, the NT kernel source code) might actually not be fair use.
[1] This may give rise to some novel inducement claims, but the irony of anyone in the FOSS community relying on MGM v. Grokster to enforce the GPL is palpable.
Microsoft can argue that Copilot emits a mixture of intellectual property (a pattern from here, a pattern from there), so they don't need to give attribution.
But if we disallow training, it's unambiguous.
Either you fed the program into your training system or you didn't. The No-AI 3-Clause License forbids use in training, no question about it. If you train your model on this text, you are obviously violating the license.
Systems like Microsoft Copilot are a new threat to intellectual property. The open source licenses need to change to adapt to the threat.
Otherwise Microsoft and Amazon will pillage open source software, converting all our work into anonymous common property that they can monetize.
Hey, including copyrighted code, especially from other repositories without a correct license, is an honest mistake. You should be able to file a DMCA request with a list of all the repos containing your code, and then they should just retrain copilot with those repos excluded. Clearly that's what is needed for them to stop distributing your code. /s
Sarcasm aside I think there are several possible legal viewpoints here (IANAL):
1. copilot is distributing copies of code and it's not a safe harbor: Microsoft is directly liable to copyright infringement by copilot producing code without appropriate license/attribution.
2. copilot is distributing copies of code and it's a safe harbor: Microsoft is not directly liable, but it should comply with DMCA requests. Practically that would mean retraining with mentioned code snippets/repositories excluded in a timely manner, otherwise I don't see a way how they could disentangle the affected IP from the already trained models.
3. copilot produces novel work by itself not subject to copyright of the training data: I think this is really a stretch. IANAL, but I think producing novel creative work is a right exclusive to living human beings, so machines can't produce them almost by definition. (There is the monkey selfie copyright case, but at least the "living" there was ticked off).
4. the user of copilot is producing novel work by prompting copilot: it's like triggering a camera. The copyright of the resulting picture is fully owned by the operator, even though much of the heavy lifting is done by the camera itself. Even then, this very much depends on the subject.
IMO option 3 doesn't have a legal standing. Microsoft and users of copilot would very much like if it was option 4 that applied always, but this particular case clearly falls under option 1 or option 2, in which case Microsoft should hold some legal liability, even if they can't always track the correct license ahead of time.
Where's the part of the Copilot EULA that indemifies users against copyright infringement for the generated code?
If the model was trained entirely using code that Microsoft has copyright over (for example: the MS Windows codebase, the MS Office codebase, etc.) then they could offer legal assurances that they have usage rights to the generated code as derivative works.
Without such assurances, how do you know that the generated code is not subject to copyright and what license the generated code is under?
Are you comfortable risking your company's IP by unknowingly using AGPLv3-licensed code [1] that Copilot "generated" in your company's products?
No. Either such training is fair use, or it isn't. If it is fair use, then it's always allowed even if the license explicitly says it's not. If it isn't fair use, then Microsoft is already violating the licenses anyway, such as the GPL, by not making Copilot's source available under the same license (and ditto for things like DALL-E, and also by violating the attribution clauses even of permissive licenses).
Copilot is trained on public repos. Id imagine if Microsoft doesn’t want you to use their code, that code would be in a private repo. There’s nothing stopping me from using code in a public repo, regardless of the license.
You could slap some kind of a clause in your license that forbids the use of your code in conjunction with an AI-powered code tool, and then try to sue people who use Copilot. Microsoft's claim is after all that training Copilot is fair use.
Although if you find your code in a project even without such a clause, you can probably still attempt to sue them for copyright violation, since most likely they are not abiding by your license anyways.
The problem here is that copilot explots a loophole that allows it to produce derivative works without license. Copilot is not sophisticated enough to structure source code generally- it is overtrained. What is an overtrained neural network but memcpy?
the problem isn't even that this technology will eventually replace programmers: the problem is that it produces parts of the training set VERBATIM, sans copyright.
No, I am pretty optimistic that we will quickly come to a solution when we start using this to void all microsoft/github copyright.
Microsoft isn't going to train Copilot on Windows code for the same reason it didn't train it on private repos: the code is private and they don't want to risk leaking private code.
I imagine there would be no problem training it on e.g. the Unreal Engine code which is not open source but is available to read.
The big practical issue is that there's no warning when Copilot produces code that might violate copyright, so you're taking a bit of a risk if you use it to generate huge chunks of code as-is. I imagine they are working on a solution to that though. It's not difficult conceptually.
reply