Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

So as a code author I am pretty upset about Copilot specifically, and it seems like SD is similar (hadn't heard before about DeviantArt doing the same as what GitHub did). But I agree with this take: the tech is here, it's going to be used, and it's not going to be shut down by a lawsuit. Nor should it, frankly.

What I object to is not the AI itself, or even that my code has been used to train it. It's the copyright for me but not for thee way that it's been deployed. Does GitHub/Microsoft's assertion that training sidesteps licensing apply to GitHub/Microsoft's own code? Do they want to allow (a hypothetical) FSFPilot to be trained on their proprietary source? Have they actually trained Copilot on their own source? If not, why not?

I published my source subject to a license, and the force of that license is provided by my copyright. I'm happy to find other ways of doing things, but it has to be equitable. I'm not simply ceding my authorship to the latest commercial content grab.



sort by: page size:

The objection here is that its training was based on code on github without paying any attention to the license of that code. It’s generally considered ok for people to learn from code and then produce new code without the new code being considered a derived work of what they learned from (I’m not sure if there is a specific fair use clause covering this). But it’s not obvious that copilot should be able to ignore the licenses of the code it was trained on, especially given it sometimes outputs code from the training set verbatim. One could imagine a system very similar to copilot which reads in GPL or proprietary code and writes functionally equivalent code while claiming it’s not a derived work of the original and so isn’t subject to its licensing constraints.

To note, there's a class action lawsuit against GitHub Co-Pilot since it learns from a bunch of open source code with very specific licenses. It's very interesting from establishing copyright in an AI training perspective. Hopefully it goes the distance and some nuanced arguments come out in the court case.

https://www.theverge.com/2022/11/8/23446821/microsoft-openai...


Think some of the negativity about Copilot may be the perception that if an individual or small startup attempted training an ML model from public source-code and commercialised a service from it they would be drowning in legal issues from big companies not happy with their code used in such a product.

In addition just because code is available publicly on GitHub does not necessarily mean it is permissively licensed to use elsewhere, even with attribution. Copyright holders not happy with their copyrighted works publicly accessible can use the DMCA to issue take-downs that GitHub does comply with but how that interacts with Copilot and any of its training data is a different question.

As much as the DMCA is bad law rather funny seeing Microsoft be charged in this lawsuit with the less known provision against 'removal of copyright management information'. Microsoft does have more resources to mount at defence so it will probably end up different compared to a smaller player facing this action.


The problem here is that copilot explots a loophole that allows it to produce derivative works without license. Copilot is not sophisticated enough to structure source code generally- it is overtrained. What is an overtrained neural network but memcpy?

the problem isn't even that this technology will eventually replace programmers: the problem is that it produces parts of the training set VERBATIM, sans copyright.

No, I am pretty optimistic that we will quickly come to a solution when we start using this to void all microsoft/github copyright.


> So where is github's legal paperwork that shows that they're only processing the code that's copyrighted by the user who uploaded it

Under other circumstances they don't need it. But if CoPilot is creating a derivative work including parts of that code without including the licence terms or attribution (as required by many licences) things are far more grey, or possible full black.

Some argue that the AI is unaware of the terms so can't be held responsible. Two possible counters for that: 1. it is the licence that gives you the right to use the copyrighted code, if you are unaware of the licence then why assume you have the righ tto use the code? 2. if I found some useful code that happened unbeknownst to me to be from MS, and used it in a way that I wasn't licensed to, and MS noticed, it is a pretty safe bet that they'd state ignorance of the copyright terms doesn't mean you can't be held to them.

Or another angle: the tool is allowing, even encouraging, people to use code or other materials in a way that infringes copyright (again: you don't have the right to use the code under most licences unless you give correct attribution and such) – the very conditions often stated as reasons for trying to ban other tools.

Plus of course the general argument: if this is entirely a non-issue, why is no Windows, Office, or SQL Server code in the training set? Surely they are great examples of how to do things to train the AI with?


One interesting aspect, that I thing will make it difficult for GitHub to argue and justify its not a a license violation would be the answer to the following question: Was Copilot trained using Microsoft internal source code or will it be in the future ?

As GitHub is a Microsoft company and OpenAI although a non-profit just got a massive one billion investment from Microsoft (presumably not for free), will it start spitting out once in a while Windows kernel code ? :-)

And if it was NOT trained on Microsoft source code, because it could starting suggesting some of it...Is that not a validation that the results it produces are a derivative work based on the work of the open source code corpus it was trained on ? IANAL...


99% sure that Nat Friedman is wrong there.

He's correct that AI can train, but it's not relevant.

(1) The issue is with CoPilot outputting copyrighted code AND attaching a separate license to it (looks like it's hiding the initial license and can generate another license automatically). This is reproducing copyrighted code and misrepresenting the license, neither is allowed.

(2) He posted that the output belong to the operator. It's factually wrong. Original copyrighted code belongs to their original writer, not the operator. In effect Github Copilot cannot "launder" license and it is very wrong for their CEO to claim otherwise.

(3) Looks like he's trying to waive responsibility of Github by stating that the developer receiving the code is responsible? Wrong, GitHub is responsible for their actions (vary with the jurisdiction). There's a complex matter of who's responsible when some proprietary code will end up in a company product and they get sued. The company is in violation, they can turn against GitHub for providing copyright code, GitHub was responsible for it. There's a complex chain of responsibility, any lawyer worth their salt would cringe at the claims and responsibility that GitHub is exposing itself to.


copilot also got its training sets for free and not really with any kind of consent from the owners of that code, and it's really quite ambiguous as to if what it's doing violates many different open source licenses of its training data

Microsoft is selling AI services based on training data they don't own and didn't acquire rights to, nobody writing the licenses of the code it's using had the opportunity to address this kind of code use without license, attribution, or consent. (and the training data is a huge part of the value of an AI product)


I disagree with this article. GitHub Copilot is indeed infringing copyright and not only in a grey zone, but in a very clear black and white fashion that our corporate taskmasters (Microsoft included) have defended as infringement.

The legal debate around copyright infringement has always centered around the rights granted by the owner vs the rights appropriated by the user, with the owner's wants superseding user needs/wants. Any open-source code available on Github is controlled by the copyright notice of the owner granting specific rights to users. Copilot is a commercial product, therefore, Github can only use code that the owners make available for commercial use. Every other instance of code used is a case of copyright infringement, a clear case by Microsoft's own definition of copyright infringement [1][2].

Github (and by extension Microsoft) is gambling on the fact that their license agreement granting them a license to the code in exchange for access to the platform supersedes the individual copyright notices attached to each repo. This is a fine line to walk and will likely not survive in a court of law. They are betting on deep lawyer pockets to see them through this, but are more likely than not to lose this battle. I suspect we will see how this plays out in the coming months.

[1] https://www.microsoft.com/info/Cloud.html

[2] https://github.com/contact/dmca


Here is a snippet of things that copyright is intended to cover:

>the right to exclude others from making certain uses of the work: copying it, making a derivative work based on that work, distributing copies of the work to the public, and publicly performing or displaying the work.

So why would "training" "AI" on code with the intention of emitting derived works not be copyright infringement exactly?

This product is transforming copyrighted code into something that's intended to be used or sold in other works. The snippets it emits are directly derived from copyrighted code.

The most common argument against this is that humans also learn from copyrighted material. My argument against this is that CoPilot is not a human and should not be assumed to inherit rules intended for humans.

>in a field that benefits us all

As it stands currently CoPilot is proprietary and does not benefit anyone except for MicroSoft. If CoPilot was released under a FOSS license it would actually benefit us all. Most of the people against CoPilot are not against AI, but rather a proprietary AI product transforming FOSS work into other potentially proprietary works with the intention of profiting off of the completion service and hoarding the code that powers it.


Considering the opinions here [1] and the fact that Microsoft’s lawyers even signed off on something as seemingly risky as Copilot, it seems very likely that courts will not find Copilot to infringe on copyright.

I encourage you to read the linked article and respond to the authors instead of making me argue their case for them!

[1] https://www.fsf.org/licensing/copilot/copyright-implications...


If GitHub could guarantee that the code Copilot had ingested was only made with OSS licenses, then I don't see what the problem is.

But as far as I understand, GitHub trained Copilot on any public repository on GitHub, meaning even if it doesn't have a license specified (so the user publishing it still has the copyright to it), then I don't see how it can be OK.


Microsoft can argue that Copilot emits a mixture of intellectual property (a pattern from here, a pattern from there), so they don't need to give attribution.

But if we disallow training, it's unambiguous.

Either you fed the program into your training system or you didn't. The No-AI 3-Clause License forbids use in training, no question about it. If you train your model on this text, you are obviously violating the license.

Systems like Microsoft Copilot are a new threat to intellectual property. The open source licenses need to change to adapt to the threat.

Otherwise Microsoft and Amazon will pillage open source software, converting all our work into anonymous common property that they can monetize.

We're watching it happen.


I can see the GitHub Copilot controversy being resolved in this way. If Microsoft, GitHub, and OpenAI successfully use the fair use defense for Copilot's appropriation of proprietary and incompatibly licensed code, then a free and open source alternative to Copilot can be trained on Copilot's outputs.

After all, the GitHub Copilot Product Specific Terms say:

> 2. Ownership of Suggestions and Your Code

> GitHub does not claim any ownership rights in Suggestions. You retain ownership of Your Code.

https://github.com/customer-terms/github-copilot-product-spe...


I wasn't staking a position on whether Copilot is fair use, just pointing out that fair use doesn't care about license.

That said, copilot itself is not a replacement for your open source project that it was trained on. The code it generates may or may not be, but that's probably not Github's problem as far as copyright law is concerned.


The flip side is that if, from a policy perspective, we decide that the licensing doesn't matter-- then it's extremely likely that anyone else can make a copilot competitor.

If we decide that licensing is required, it's likely that no one except github (or a few other huge players) could ever make something like this just due to access to the code.

(Github would make it a requirement of the TOS that your code is licensed to allow copilot to use it, and require users to indemnify github from third party legal action arising from github using code you posted in copilot.)

The permissive handling levels the playing field.


Is it just me, or do the comments in this thread seem to be the exact opposite of the sentiment in the comments on similar Github Copilot threads?

I just find it a bit ironic that programmers are irate about Github Copilot using their copyrighted material to train. However, if it's an ML model training off of copyrighted artists material, clearly its a transformative work. I just find the opposing sentiments for these scenarios a bit funny.


Suing Microsoft for training Copilot on your code would require jumping over the same hurdle that the Authors Guild could not: i.e. that it is fair use to scan a massive corpus[0] of texts (or images) in order to search through them.

My real worry is downstream infringement risk, since fair use is non-transitive. Microsoft can legally provide you a code generator AI, but you cannot legally use regurgitated training set output[1]. GitHub Copilot is creating all sorts of opportunities to put your project in legal jeopardy and Microsoft is being kind of irresponsible with how they market it.

[0] Note that we're assuming published work. Doing the exact same thing Microsoft did, but on unpublished work (say, for irony's sake, the NT kernel source code) might actually not be fair use.

[1] This may give rise to some novel inducement claims, but the irony of anyone in the FOSS community relying on MGM v. Grokster to enforce the GPL is palpable.


Yes. GitHub can get away with "oh well, we're all learning" because if the code is violating copyright, it's the user who is infringing directly by publishing it, not GitHub via Copilot. Either the user would have to bring a case against GitHub demonstrating liability (good luck) or the copyright holder would have to bring a case against GitHub demonstrating copyright violation (again, good luck). Otherwise this is entirely between the copyright holder and the Copilot user, legally speaking.

Of course if someone does manage to set a precedent that including copyrighted works in AI training data without an explicit license to do so, GitHub Copilot would be screwed and at best have to start over with a blank slate if they can't be grandfathered. But this would affect almost all products based on the recent advancements in AI and they're backed by fairly large companies (after all, GitHub is owned by Microsoft and a lot of the other AI stuff traces back to Alphabet and there are a lot of startups funded by huge and influential VC companies). Given the US's history of business-friendly legislation, I doubt we'll see copyright laws being enforced against training data unless someone upsets Disney.

next

Legal | privacy