Hey, including copyrighted code, especially from other repositories without a correct license, is an honest mistake. You should be able to file a DMCA request with a list of all the repos containing your code, and then they should just retrain copilot with those repos excluded. Clearly that's what is needed for them to stop distributing your code. /s
Sarcasm aside I think there are several possible legal viewpoints here (IANAL):
1. copilot is distributing copies of code and it's not a safe harbor: Microsoft is directly liable to copyright infringement by copilot producing code without appropriate license/attribution.
2. copilot is distributing copies of code and it's a safe harbor: Microsoft is not directly liable, but it should comply with DMCA requests. Practically that would mean retraining with mentioned code snippets/repositories excluded in a timely manner, otherwise I don't see a way how they could disentangle the affected IP from the already trained models.
3. copilot produces novel work by itself not subject to copyright of the training data: I think this is really a stretch. IANAL, but I think producing novel creative work is a right exclusive to living human beings, so machines can't produce them almost by definition. (There is the monkey selfie copyright case, but at least the "living" there was ticked off).
4. the user of copilot is producing novel work by prompting copilot: it's like triggering a camera. The copyright of the resulting picture is fully owned by the operator, even though much of the heavy lifting is done by the camera itself. Even then, this very much depends on the subject.
IMO option 3 doesn't have a legal standing. Microsoft and users of copilot would very much like if it was option 4 that applied always, but this particular case clearly falls under option 1 or option 2, in which case Microsoft should hold some legal liability, even if they can't always track the correct license ahead of time.
Sure, and I agree that this can be an honest mistake. But distributing the code through copilot might still be copyright infringement. If copilot indeed infringes on copyright on some cases, then there must be a way to make copilot complying and stop distributing the copyrighted code in question.
Normally, if person 2 steals your code and hosts it on Github while infringing the original license, person 1 can send a DMCA notice to Github to take the code down. A similar process could be involved for copilot too.
If copilot would be a DMCA safe harbor, then they would need to comply would DMCA notices and stop distributing the offending code in a timely manner. It might be implemented by elaborate filters (which are probably not a 100% accurate), or batching up DMCA notices, and retraining their model regularly with offending repos/code excluded.
Of course, Microsoft drag their feet and say that copilot never infringes on copyright. They don't want to do any of this.
You could slap some kind of a clause in your license that forbids the use of your code in conjunction with an AI-powered code tool, and then try to sue people who use Copilot. Microsoft's claim is after all that training Copilot is fair use.
Although if you find your code in a project even without such a clause, you can probably still attempt to sue them for copyright violation, since most likely they are not abiding by your license anyways.
> Microsoft suggests that no matter what the license is for training code, generated code is not infringing
They claim no such thing. They claim that Copilot itself is not infringing any license by being trained on code - that their own use of the code for AI training purposes is fair use and that the Copilot NN is not a derivative work of any code in the training set.
But they explicitly warn you that code that Copilot generates may infringe others' copyright, and that you should use "standard IP scanning tools" to ensure you are not infringing copyright by using the code Copilot spit out.
NOTE: there was a large discussion of this yesterday [1], but that was almost entirely about copyright. This submission is to a different link at Microsoft that makes it clear they are covering much more than copyright. It seemed then that it might be useful to have a separate submission to discuss the non-copyright aspects of this.
They say:
> Specifically, the Copilot Copyright Commitment will:
> • Cover third-party IP claims based on copyright, patent, trademark, trade secrets, or right of publicity, but not claims based on trademark use in trade or commerce, defamation, false light, or other causes of action that are not related to IP rights.
> • Cover the customer’s use and distribution of the output content generated by our Copilot services, but not the customer’s input data, modifications of the output content, or uses of output that the customer knows or should know will infringe the rights of others.
> • Require the customer to use the content filters and other safety systems built into the product and the customer must not attempt to generate infringing materials, including not providing input to a Copilot service that the customer does not have appropriate rights to use.
I’m somewhat at a loss to understand how they can do this. With copyright, filtering to keep the output from too closely matching too much of any particular training inputs goes a long way toward reducing the chances of copyright infringement. It might also be possible to train an AI to only include in its output things that are found in multiple training items from different sources, which would also greatly reduce the chances of emitting something that infringes. The key is that with copyright infringement it is a textual matter—does the output text too closely match too much text from some copyrighted work that the AI had accesses to when it was trained? Also they only actually have to reduce copying enough to get whatever slips through into fair use territory (in those jurisdictions where fair use is a thing).
With patents it is not a text problem. If I upload code to GitHub that implements a patented algorithm, Microsoft trains Copilot on that, and Copilot outputs code that implements that algorithm, then using that code will be patent infringement even if the output code has nothing whatsoever copied from my code. I don’t see how they will be able to filter that out or train to reduce its likelihood. And with patents there is no fair use exception.
Humans do violate copyright if they use copyrighted passages directly in their work and pass it off as their own without any attribution, which is what copilot has been show to sometimes do, though not always. Copilot will sometimes offer chunks of code that can be found verbatim in open source code bases and passes it off to users without attribution. I agree it is ok to learn from copyrighted work and reproduce new different work from a human or a machine learning algorithm, but it isn't ok to pass along exact copies as your own without attribution. Microsoft will likely need to add checks to prevent copilot from offering verbatim copies of code going forward to try to avoid copyright violations here.
copilot is great, and ignorance is bliss, isn't it
The situation that this lawsuit is trying to save you from is this: (1) copilot blurps out some code X that you use, and then redistribute in some form (monetized or not); (2) it turns out company C owns copyright on something Y that copilot was trained on, and then (3) C makes a strong case that X is part of Y, and that your use of X does not fall under "fair use", i.e. you infringed on the licensing terms that C set for Y.
You are now in legal trouble, and copilot put you there, because it never warned that you X is part of Y, and that Y comes with such and such licensing terms.
Whether we like copilot or not, we should be grateful that this case is seeking to clarify some things are currently legally untested. Microsoft's assertions may muddy the waters, but that doesn't make law.
Considering the opinions here [1] and the fact that Microsoft’s lawyers even signed off on something as seemingly risky as Copilot, it seems very likely that courts will not find Copilot to infringe on copyright.
I encourage you to read the linked article and respond to the authors instead of making me argue their case for them!
Suing Microsoft for training Copilot on your code would require jumping over the same hurdle that the Authors Guild could not: i.e. that it is fair use to scan a massive corpus[0] of texts (or images) in order to search through them.
My real worry is downstream infringement risk, since fair use is non-transitive. Microsoft can legally provide you a code generator AI, but you cannot legally use regurgitated training set output[1]. GitHub Copilot is creating all sorts of opportunities to put your project in legal jeopardy and Microsoft is being kind of irresponsible with how they market it.
[0] Note that we're assuming published work. Doing the exact same thing Microsoft did, but on unpublished work (say, for irony's sake, the NT kernel source code) might actually not be fair use.
[1] This may give rise to some novel inducement claims, but the irony of anyone in the FOSS community relying on MGM v. Grokster to enforce the GPL is palpable.
You could try to contribute a license-finder for copilot that would detect potential copyright violations and emit valid attribution / reproduce the licenses of copyrighted works that copilot's output is derived from.
Oh, and you'll want to detect incompatible software licenses and prevent derivative works from being created with such conflicts .
In so doing, you would be helping Microsoft to follow the law and respect my copyright. The lawsuit would probably evaporate overnight.
There is also the option that MS claims: Copilot may be perfectly fine as it is, BUT you may not be allowed to distribute (or even use internally) code it generates for you.
This is what the current terms of service of Copilot say: you as the user are responsible for ensuring you are not breaking anyone's copyright when you accept an auto-completion from Copilot. How you would know that it just spit out someone else's code is unspecified, of course.
Of course, similar tools that make it easy to infringe others' copyright are less accepted when they don't come from corporate behemoths (cough Popcorn Time cough), but such is the world we live: copyright protection for me but not for thee.
The output isn't guaranteed to be free of copyright from its training materials. It just usually is. There have been clear demonstrations of it regurgitating code from the training set verbatim, which would of course still be covered by the original license.
Microsoft isn't going to train Copilot on Windows code for the same reason it didn't train it on private repos: the code is private and they don't want to risk leaking private code.
I imagine there would be no problem training it on e.g. the Unreal Engine code which is not open source but is available to read.
The big practical issue is that there's no warning when Copilot produces code that might violate copyright, so you're taking a bit of a risk if you use it to generate huge chunks of code as-is. I imagine they are working on a solution to that though. It's not difficult conceptually.
Copilot used (for training) copyrighted code without respecting the license and can generate pieces of copyrighted code verbatim without respecting the original license as well.
I pay for copilot and this is very much the truth, but let's see what the court rules out.
Where's the part of the Copilot EULA that indemifies users against copyright infringement for the generated code?
If the model was trained entirely using code that Microsoft has copyright over (for example: the MS Windows codebase, the MS Office codebase, etc.) then they could offer legal assurances that they have usage rights to the generated code as derivative works.
Without such assurances, how do you know that the generated code is not subject to copyright and what license the generated code is under?
Are you comfortable risking your company's IP by unknowingly using AGPLv3-licensed code [1] that Copilot "generated" in your company's products?
You're mistaking the end-user's copyright infringement with Copilot's alleged infringement.
Copilot is fair use and transformative -- that is unless there is an open source Copilot that Copilot is training on, only then would it be competing and it's easy for GitHub or OpenAI to exclude those repos of copilot alternatives from the training set.
Framing this as learning is unhelpful. Copilot isn't just learning. It can reproduce code it saw during training verbatim. It will even reproduce the GPLv3 licence itself. If you copy-paste code from a GPLv3-licenced sourced without attribution, it's a copyright violation. Why should it be different for Copilot?
EDIT: After more reading on the subject, I'm willing to accept that copyright infringement is unlikely here. This link [1] was the one I found most convincing.
However, I would still shift the goalposts and look at this ethically, and I still think it's wrong that Microsoft is profiting from code with licences like GPLv3. This is a whole other topic, though.
I'm not a lawyer so it's entirely possible I used the wrong term. Thank you for clarifying below.
Using the terms as you explained them below, I meant that Microsoft/GitHub has permission to reproduce the code so why wouldn't that extend to copilot?
Microsoft can argue that Copilot emits a mixture of intellectual property (a pattern from here, a pattern from there), so they don't need to give attribution.
But if we disallow training, it's unambiguous.
Either you fed the program into your training system or you didn't. The No-AI 3-Clause License forbids use in training, no question about it. If you train your model on this text, you are obviously violating the license.
Systems like Microsoft Copilot are a new threat to intellectual property. The open source licenses need to change to adapt to the threat.
Otherwise Microsoft and Amazon will pillage open source software, converting all our work into anonymous common property that they can monetize.
Think some of the negativity about Copilot may be the perception that if an individual or small startup attempted training an ML model from public source-code and commercialised a service from it they would be drowning in legal issues from big companies not happy with their code used in such a product.
In addition just because code is available publicly on GitHub does not necessarily mean it is permissively licensed to use elsewhere, even with attribution. Copyright holders not happy with their copyrighted works publicly accessible can use the DMCA to issue take-downs that GitHub does comply with but how that interacts with Copilot and any of its training data is a different question.
As much as the DMCA is bad law rather funny seeing Microsoft be charged in this lawsuit with the less known provision against 'removal of copyright management information'. Microsoft does have more resources to mount at defence so it will probably end up different compared to a smaller player facing this action.
So as a code author I am pretty upset about Copilot specifically, and it seems like SD is similar (hadn't heard before about DeviantArt doing the same as what GitHub did). But I agree with this take: the tech is here, it's going to be used, and it's not going to be shut down by a lawsuit. Nor should it, frankly.
What I object to is not the AI itself, or even that my code has been used to train it. It's the copyright for me but not for thee way that it's been deployed. Does GitHub/Microsoft's assertion that training sidesteps licensing apply to GitHub/Microsoft's own code? Do they want to allow (a hypothetical) FSFPilot to be trained on their proprietary source? Have they actually trained Copilot on their own source? If not, why not?
I published my source subject to a license, and the force of that license is provided by my copyright. I'm happy to find other ways of doing things, but it has to be equitable. I'm not simply ceding my authorship to the latest commercial content grab.
Sarcasm aside I think there are several possible legal viewpoints here (IANAL):
1. copilot is distributing copies of code and it's not a safe harbor: Microsoft is directly liable to copyright infringement by copilot producing code without appropriate license/attribution.
2. copilot is distributing copies of code and it's a safe harbor: Microsoft is not directly liable, but it should comply with DMCA requests. Practically that would mean retraining with mentioned code snippets/repositories excluded in a timely manner, otherwise I don't see a way how they could disentangle the affected IP from the already trained models.
3. copilot produces novel work by itself not subject to copyright of the training data: I think this is really a stretch. IANAL, but I think producing novel creative work is a right exclusive to living human beings, so machines can't produce them almost by definition. (There is the monkey selfie copyright case, but at least the "living" there was ticked off).
4. the user of copilot is producing novel work by prompting copilot: it's like triggering a camera. The copyright of the resulting picture is fully owned by the operator, even though much of the heavy lifting is done by the camera itself. Even then, this very much depends on the subject.
IMO option 3 doesn't have a legal standing. Microsoft and users of copilot would very much like if it was option 4 that applied always, but this particular case clearly falls under option 1 or option 2, in which case Microsoft should hold some legal liability, even if they can't always track the correct license ahead of time.
reply