Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Suing Microsoft for training Copilot on your code would require jumping over the same hurdle that the Authors Guild could not: i.e. that it is fair use to scan a massive corpus[0] of texts (or images) in order to search through them.

My real worry is downstream infringement risk, since fair use is non-transitive. Microsoft can legally provide you a code generator AI, but you cannot legally use regurgitated training set output[1]. GitHub Copilot is creating all sorts of opportunities to put your project in legal jeopardy and Microsoft is being kind of irresponsible with how they market it.

[0] Note that we're assuming published work. Doing the exact same thing Microsoft did, but on unpublished work (say, for irony's sake, the NT kernel source code) might actually not be fair use.

[1] This may give rise to some novel inducement claims, but the irony of anyone in the FOSS community relying on MGM v. Grokster to enforce the GPL is palpable.



sort by: page size:

You could slap some kind of a clause in your license that forbids the use of your code in conjunction with an AI-powered code tool, and then try to sue people who use Copilot. Microsoft's claim is after all that training Copilot is fair use.

Although if you find your code in a project even without such a clause, you can probably still attempt to sue them for copyright violation, since most likely they are not abiding by your license anyways.


> Microsoft suggests that no matter what the license is for training code, generated code is not infringing

They claim no such thing. They claim that Copilot itself is not infringing any license by being trained on code - that their own use of the code for AI training purposes is fair use and that the Copilot NN is not a derivative work of any code in the training set.

But they explicitly warn you that code that Copilot generates may infringe others' copyright, and that you should use "standard IP scanning tools" to ensure you are not infringing copyright by using the code Copilot spit out.


Hey, including copyrighted code, especially from other repositories without a correct license, is an honest mistake. You should be able to file a DMCA request with a list of all the repos containing your code, and then they should just retrain copilot with those repos excluded. Clearly that's what is needed for them to stop distributing your code. /s

Sarcasm aside I think there are several possible legal viewpoints here (IANAL):

1. copilot is distributing copies of code and it's not a safe harbor: Microsoft is directly liable to copyright infringement by copilot producing code without appropriate license/attribution.

2. copilot is distributing copies of code and it's a safe harbor: Microsoft is not directly liable, but it should comply with DMCA requests. Practically that would mean retraining with mentioned code snippets/repositories excluded in a timely manner, otherwise I don't see a way how they could disentangle the affected IP from the already trained models.

3. copilot produces novel work by itself not subject to copyright of the training data: I think this is really a stretch. IANAL, but I think producing novel creative work is a right exclusive to living human beings, so machines can't produce them almost by definition. (There is the monkey selfie copyright case, but at least the "living" there was ticked off).

4. the user of copilot is producing novel work by prompting copilot: it's like triggering a camera. The copyright of the resulting picture is fully owned by the operator, even though much of the heavy lifting is done by the camera itself. Even then, this very much depends on the subject.

IMO option 3 doesn't have a legal standing. Microsoft and users of copilot would very much like if it was option 4 that applied always, but this particular case clearly falls under option 1 or option 2, in which case Microsoft should hold some legal liability, even if they can't always track the correct license ahead of time.


The output isn't guaranteed to be free of copyright from its training materials. It just usually is. There have been clear demonstrations of it regurgitating code from the training set verbatim, which would of course still be covered by the original license.

Microsoft isn't going to train Copilot on Windows code for the same reason it didn't train it on private repos: the code is private and they don't want to risk leaking private code.

I imagine there would be no problem training it on e.g. the Unreal Engine code which is not open source but is available to read.

The big practical issue is that there's no warning when Copilot produces code that might violate copyright, so you're taking a bit of a risk if you use it to generate huge chunks of code as-is. I imagine they are working on a solution to that though. It's not difficult conceptually.


Well I am "a fan" of Copilot and I do think AI is the future, but I think the author has a valid point.

I think the fair use violation he describes doesn't happen during training. I do think training AI on anything that is publicly accessible is fair use just as in an example of a person learning by reading/watching the same materials.

However, this fair use rule is being violated the moment the resulting AI starts suggesting verbatim copied code from licensed works without attribution.

So one could argue the source code is not being used in a transformative way but copilot is just more efficient method of retrieval of licensed code. This misses the fact copilot actually is capable of writing new code. I've used it as "an autocomplete on steroids". Letting it suggest maybe half a line, or 1 line of code at a time (or trivial stuff we automate even without copilot like getters/setters in java). But when actual licensed code is suggested yes, this is IMO a license violation.

Therefore one way of resolving this would be to pair copilot with a tool that scanned the resulting code for presence of licensed code then it woukd make a list of "credits" or references. Also there should be measures taken (perhaps during training) to penalise generation of verbatim (or extremely similar) code. Would this make copilot less of a useful tool? I'm not sure.

One thing that's not going to happen is putting tools like copilot back "in the bottle". We now have similar models anyone can download (faux pilot) and I as well as many others have found those tools to speed up mundane tasks a lot. This translates into monetary advantage for users. Therefore there is no way this will disappear, lawsuit or no lawsuit.


>Is Copilot's training on public repositories infringing copyright? Is it fair use?

My money's on yes, but this isn't settled until SCOTUS says so.

>How likely is the output of Copilot to generate actionable claims of violations on GPL-licensed works?

This depends on how likely Copilot is to regurgitate it's training input instead of generate new code. If it only does so IF you specifically ask it to (e.g. by adding Quake source comments to deliberately get Quake input), then the likelihood of innocent users - i.e. people trying to write new programs and not just launder source code - infringing copyright is also low. However, if Copilot tends to spit out substantially similar output for unrelated inputs, then this goes up by a lot. This will require an actual investigation into the statistical properties of Copilot output, something you won't really be able to do without unrestricted access to both the Copilot model and it's training corpus.

>How can developers ensure that any code to which they hold the copyright is protected against violations generated by Copilot?

I'm going to remove the phrase "against violations generated by Copilot" as it's immaterial to the question. Copilot infringement isn't any different from, say, a developer copypasting a function or two from a GPL library.

The answer to that, is that unless the infringement is obvious, it's likely to go unpunished. Content ID systems (which, AFAIK, don't really exist for software) only do "striking similarity" analysis; but the standard for copyright infringement in the US is actually lower: if you can prove access, then you only have to prove "substantial similarity". This standard is intended to deal with people who copy things and then change them up a bit so the judge doesn't notice. There is no way to automate such a check, especially not on proprietary software with only DRM-laden binaries available.

If you have source code, then perhaps you can find some similar parts. Indeed, this is what SCO tried to do to the Linux kernel and IBM AIX; and it turned out that the "copied" code was from far older sources that were liberally licensed. (Also, SCO didn't actually own UNIX.) Oracle also tried doing this to the Java classpath in Android and got smacked down by the Supreme Court. Having the source open makes it easier to investigate; but generally speaking, you need some level of suspicion in order to make it economic to investigate copyright infringement in software.

Occasionally, however, someone's copying will be so hilariously blatant that you'll actually find it. This usually happens with emulators, because it's difficult to actually hire for reverse engineering talent and most platform documentation is confidential. Maui X-Stream plagiarized and infringed PearPC (a PowerPC Macintosh emulator) to produce "CherryOS"; Atari ported old Humongous Entertainment titles to the Wii by copying ScummVM; and several Hyperkin clone consoles feature improperly licensed SNES emulation code. In every case, the copying was obvious to anyone with five minutes and a strings binary, simply because the scope of copied code was so massive.

>Is there a way for developers using Copilot to comply with free software licenses like the GPL?

Yes - don't use it.

I know I just said you can probably get away with stealing small snippets of code. However, if your actual intent is to comply with the GPL, you should just copy, modify, and/or fork a GPL library and be honest about it.

To add onto the FSF's usual complaints about software-as-a-service and GitHub following US export laws (which, BTW, the FSF also has to do, unless Stallman plans to literally martyr himself for--- oh god he'd actually do that); I'd argue that Copilot is unethical to use regardless of concerns over plagiarism or copyright infringement. You have no guarantee that the code you're actually writing actually works as intended, and several people have already been able to get Copilot to hilariously fail on even basic security-relevant tasks. Copilot is an autocomplete system, it doesn't have the context of what your codebase looks like. There are way better autocomplete systems that already exist in both Free and non-Free code that don't require a constant Internet connection to a Microsoft server.

>Should ethical advocacy organizations like the FSF argue for change in copyright law relevant to these questions?

I'm going to say no, because copyright law is already insane as-is and we don't need to make it worse just so that the copyleft hack still works a little better.

Please, for the love of god, we do not need stronger copyrights. We need to chain this leviathan.


The objection here is that its training was based on code on github without paying any attention to the license of that code. It’s generally considered ok for people to learn from code and then produce new code without the new code being considered a derived work of what they learned from (I’m not sure if there is a specific fair use clause covering this). But it’s not obvious that copilot should be able to ignore the licenses of the code it was trained on, especially given it sometimes outputs code from the training set verbatim. One could imagine a system very similar to copilot which reads in GPL or proprietary code and writes functionally equivalent code while claiming it’s not a derived work of the original and so isn’t subject to its licensing constraints.

copilot also got its training sets for free and not really with any kind of consent from the owners of that code, and it's really quite ambiguous as to if what it's doing violates many different open source licenses of its training data

Microsoft is selling AI services based on training data they don't own and didn't acquire rights to, nobody writing the licenses of the code it's using had the opportunity to address this kind of code use without license, attribution, or consent. (and the training data is a huge part of the value of an AI product)


I agree that training is fair use. I don't agree that when the model spits out verbatim or near-identical copies of copyright code that somehow the copyright is striped or that the usage of it is somehow fair use.

I believe that the vast majority of code that copilot produces is fine. But we have also seen clear examples of copyright violation.

The biggest problem is that it is basically impossible for the user to tell which is which.


NOTE: there was a large discussion of this yesterday [1], but that was almost entirely about copyright. This submission is to a different link at Microsoft that makes it clear they are covering much more than copyright. It seemed then that it might be useful to have a separate submission to discuss the non-copyright aspects of this.

They say:

> Specifically, the Copilot Copyright Commitment will:

> • Cover third-party IP claims based on copyright, patent, trademark, trade secrets, or right of publicity, but not claims based on trademark use in trade or commerce, defamation, false light, or other causes of action that are not related to IP rights.

> • Cover the customer’s use and distribution of the output content generated by our Copilot services, but not the customer’s input data, modifications of the output content, or uses of output that the customer knows or should know will infringe the rights of others.

> • Require the customer to use the content filters and other safety systems built into the product and the customer must not attempt to generate infringing materials, including not providing input to a Copilot service that the customer does not have appropriate rights to use.

I’m somewhat at a loss to understand how they can do this. With copyright, filtering to keep the output from too closely matching too much of any particular training inputs goes a long way toward reducing the chances of copyright infringement. It might also be possible to train an AI to only include in its output things that are found in multiple training items from different sources, which would also greatly reduce the chances of emitting something that infringes. The key is that with copyright infringement it is a textual matter—does the output text too closely match too much text from some copyrighted work that the AI had accesses to when it was trained? Also they only actually have to reduce copying enough to get whatever slips through into fair use territory (in those jurisdictions where fair use is a thing).

With patents it is not a text problem. If I upload code to GitHub that implements a patented algorithm, Microsoft trains Copilot on that, and Copilot outputs code that implements that algorithm, then using that code will be patent infringement even if the output code has nothing whatsoever copied from my code. I don’t see how they will be able to filter that out or train to reduce its likelihood. And with patents there is no fair use exception.

[1] https://news.ycombinator.com/item?id=37420885


To note, there's a class action lawsuit against GitHub Co-Pilot since it learns from a bunch of open source code with very specific licenses. It's very interesting from establishing copyright in an AI training perspective. Hopefully it goes the distance and some nuanced arguments come out in the court case.

https://www.theverge.com/2022/11/8/23446821/microsoft-openai...


Considering the opinions here [1] and the fact that Microsoft’s lawyers even signed off on something as seemingly risky as Copilot, it seems very likely that courts will not find Copilot to infringe on copyright.

I encourage you to read the linked article and respond to the authors instead of making me argue their case for them!

[1] https://www.fsf.org/licensing/copilot/copyright-implications...


copilot is great, and ignorance is bliss, isn't it

The situation that this lawsuit is trying to save you from is this: (1) copilot blurps out some code X that you use, and then redistribute in some form (monetized or not); (2) it turns out company C owns copyright on something Y that copilot was trained on, and then (3) C makes a strong case that X is part of Y, and that your use of X does not fall under "fair use", i.e. you infringed on the licensing terms that C set for Y.

You are now in legal trouble, and copilot put you there, because it never warned that you X is part of Y, and that Y comes with such and such licensing terms.

Whether we like copilot or not, we should be grateful that this case is seeking to clarify some things are currently legally untested. Microsoft's assertions may muddy the waters, but that doesn't make law.


I think you might be considering two different acts here:

1. The act of training Copilot on public code

2. The resulting use of Copilot to generate presumably new code

#1 is arguably close to the Authors Guild v. Google case. You are literally transforming the input code into an entirely new thing: a series of statistical parameters determining what functioning code "looks like". You can use this information to generate a whole bunch of novel and useful code sequences, not just by feeding it parts of it's training data and acting shocked that it remembered what it saw. That smells like fair use to me.

#2 is where things get more dicey - just because it's legal to train an ML system on copyrighted data wouldn't mean that it's resulting output is non-infringing. The network itself is fair use, but the code it generates would be used in an ordinary commercial context, so you wouldn't be able to make a fair use argument here. This is the difference between scanning a bunch of books into a search engine, versus copying a paragraph out of the search engine and into your own work.

(More generally: Fair use is non-transitive. Each reuse triggers a new fair use analysis of every prior work in the chain, because each fair reuse creates a new copyright around what you added, but the original copyright also still remains.)


My version of the tool doesn't even create original code as it doesn't have access to it. If the difference between the two uses is how much verbatim, obviously infringing code, gets created, then Copilot has much bigger issues.

The point of my hypothetical wasn't to whitewash stealing Microsoft code in some kind of legal quirk. It was to point out that if learning the structure of code with this kind of ML model is fair use then doing so with Microsoft copyrights is also useful for other purposes. And if Microsoft themselves think that is ok it would actually be a strong argument in this discussion. I suspect they don't and was pointing that out.


The problem here is that copilot explots a loophole that allows it to produce derivative works without license. Copilot is not sophisticated enough to structure source code generally- it is overtrained. What is an overtrained neural network but memcpy?

the problem isn't even that this technology will eventually replace programmers: the problem is that it produces parts of the training set VERBATIM, sans copyright.

No, I am pretty optimistic that we will quickly come to a solution when we start using this to void all microsoft/github copyright.


If the fair use argument holds, there will be no possible defense against it. Microsoft will convert all open source into anonymous common property and monetize it.

That's the hopeless scenario.

Otherwise, the fair use argument fails to be upheld. Then Microsoft must argue that normally Copilot regurgitates a mixture of patterns from various sources, and therefore the licenses in the training data can be ignored.

To defeat this fall-back argument, we must attack the root of the problem: the moment our code is used in training.

That's precisely what the No-AI 3-Clause License achieves, while otherwise being the permissive BSD 2-Clause License that we know and love.


Licensing matters for things that would be forbidden without permission by copyright law. Fair use is an exception to copyright law. Microsoft's explicit legal theory around Copilot is that ingesting code for it (and ingesting content to train ML models more generally) is fair use. If there theory is correct, license is irrelevant, there is no legal (at least copyright-based) barrier to them using any source code they can get their hands on to train Copilot.

You're mistaking the end-user's copyright infringement with Copilot's alleged infringement.

Copilot is fair use and transformative -- that is unless there is an open source Copilot that Copilot is training on, only then would it be competing and it's easy for GitHub or OpenAI to exclude those repos of copilot alternatives from the training set.

next

Legal | privacy