Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

To note, there's a class action lawsuit against GitHub Co-Pilot since it learns from a bunch of open source code with very specific licenses. It's very interesting from establishing copyright in an AI training perspective. Hopefully it goes the distance and some nuanced arguments come out in the court case.

https://www.theverge.com/2022/11/8/23446821/microsoft-openai...



sort by: page size:

Related. Others?

The lawsuit against Microsoft, GitHub and OpenAI that could change rules of AI - https://news.ycombinator.com/item?id=33546009 - Nov 2022 (5 comments)

An open source lawyer’s view on the copilot class action lawsuit - https://news.ycombinator.com/item?id=33542813 - Nov 2022 (175 comments)

Microsoft sued for open-source piracy through GitHub Copilot - https://news.ycombinator.com/item?id=33485544 - Nov 2022 (288 comments)

We've filed a lawsuit against GitHub Copilot - https://news.ycombinator.com/item?id=33457063 - Nov 2022 (781 comments)

GitHub Copilot may steer Microsoft into a copyright lawsuit - https://news.ycombinator.com/item?id=33278726 - Oct 2022 (11 comments)

GitHub Copi­lot inves­ti­ga­tion - https://news.ycombinator.com/item?id=33240341 - Oct 2022 (1219 comments)

What Copilot means for open source - https://news.ycombinator.com/item?id=31878290 - June 2022 (137 comments)

Should GitHub be sued for training Copilot on GPL code? - https://news.ycombinator.com/item?id=31847931 - June 2022 (300 comments)



copilot also got its training sets for free and not really with any kind of consent from the owners of that code, and it's really quite ambiguous as to if what it's doing violates many different open source licenses of its training data

Microsoft is selling AI services based on training data they don't own and didn't acquire rights to, nobody writing the licenses of the code it's using had the opportunity to address this kind of code use without license, attribution, or consent. (and the training data is a huge part of the value of an AI product)


One interesting aspect, that I thing will make it difficult for GitHub to argue and justify its not a a license violation would be the answer to the following question: Was Copilot trained using Microsoft internal source code or will it be in the future ?

As GitHub is a Microsoft company and OpenAI although a non-profit just got a massive one billion investment from Microsoft (presumably not for free), will it start spitting out once in a while Windows kernel code ? :-)

And if it was NOT trained on Microsoft source code, because it could starting suggesting some of it...Is that not a validation that the results it produces are a derivative work based on the work of the open source code corpus it was trained on ? IANAL...


So as a code author I am pretty upset about Copilot specifically, and it seems like SD is similar (hadn't heard before about DeviantArt doing the same as what GitHub did). But I agree with this take: the tech is here, it's going to be used, and it's not going to be shut down by a lawsuit. Nor should it, frankly.

What I object to is not the AI itself, or even that my code has been used to train it. It's the copyright for me but not for thee way that it's been deployed. Does GitHub/Microsoft's assertion that training sidesteps licensing apply to GitHub/Microsoft's own code? Do they want to allow (a hypothetical) FSFPilot to be trained on their proprietary source? Have they actually trained Copilot on their own source? If not, why not?

I published my source subject to a license, and the force of that license is provided by my copyright. I'm happy to find other ways of doing things, but it has to be equitable. I'm not simply ceding my authorship to the latest commercial content grab.


You're mistaking the end-user's copyright infringement with Copilot's alleged infringement.

Copilot is fair use and transformative -- that is unless there is an open source Copilot that Copilot is training on, only then would it be competing and it's easy for GitHub or OpenAI to exclude those repos of copilot alternatives from the training set.


Think some of the negativity about Copilot may be the perception that if an individual or small startup attempted training an ML model from public source-code and commercialised a service from it they would be drowning in legal issues from big companies not happy with their code used in such a product.

In addition just because code is available publicly on GitHub does not necessarily mean it is permissively licensed to use elsewhere, even with attribution. Copyright holders not happy with their copyrighted works publicly accessible can use the DMCA to issue take-downs that GitHub does comply with but how that interacts with Copilot and any of its training data is a different question.

As much as the DMCA is bad law rather funny seeing Microsoft be charged in this lawsuit with the less known provision against 'removal of copyright management information'. Microsoft does have more resources to mount at defence so it will probably end up different compared to a smaller player facing this action.


99% sure that Nat Friedman is wrong there.

He's correct that AI can train, but it's not relevant.

(1) The issue is with CoPilot outputting copyrighted code AND attaching a separate license to it (looks like it's hiding the initial license and can generate another license automatically). This is reproducing copyrighted code and misrepresenting the license, neither is allowed.

(2) He posted that the output belong to the operator. It's factually wrong. Original copyrighted code belongs to their original writer, not the operator. In effect Github Copilot cannot "launder" license and it is very wrong for their CEO to claim otherwise.

(3) Looks like he's trying to waive responsibility of Github by stating that the developer receiving the code is responsible? Wrong, GitHub is responsible for their actions (vary with the jurisdiction). There's a complex matter of who's responsible when some proprietary code will end up in a company product and they get sued. The company is in violation, they can turn against GitHub for providing copyright code, GitHub was responsible for it. There's a complex chain of responsibility, any lawyer worth their salt would cringe at the claims and responsibility that GitHub is exposing itself to.


Out of curiosity - has no one sued OpenAI/GitHub for this? I remember seeing threads like this since Copilot was launched. If there was enough legal pressure, I'd imagine OpenAI/Github training this using opt-in repos instead of using the model which they currently have.

Yes. GitHub can get away with "oh well, we're all learning" because if the code is violating copyright, it's the user who is infringing directly by publishing it, not GitHub via Copilot. Either the user would have to bring a case against GitHub demonstrating liability (good luck) or the copyright holder would have to bring a case against GitHub demonstrating copyright violation (again, good luck). Otherwise this is entirely between the copyright holder and the Copilot user, legally speaking.

Of course if someone does manage to set a precedent that including copyrighted works in AI training data without an explicit license to do so, GitHub Copilot would be screwed and at best have to start over with a blank slate if they can't be grandfathered. But this would affect almost all products based on the recent advancements in AI and they're backed by fairly large companies (after all, GitHub is owned by Microsoft and a lot of the other AI stuff traces back to Alphabet and there are a lot of startups funded by huge and influential VC companies). Given the US's history of business-friendly legislation, I doubt we'll see copyright laws being enforced against training data unless someone upsets Disney.


I can see the GitHub Copilot controversy being resolved in this way. If Microsoft, GitHub, and OpenAI successfully use the fair use defense for Copilot's appropriation of proprietary and incompatibly licensed code, then a free and open source alternative to Copilot can be trained on Copilot's outputs.

After all, the GitHub Copilot Product Specific Terms say:

> 2. Ownership of Suggestions and Your Code

> GitHub does not claim any ownership rights in Suggestions. You retain ownership of Your Code.

https://github.com/customer-terms/github-copilot-product-spe...


Copilot is actually not trained just on GitHub data, and the model is owned by OpenAI, not Microsoft. It relies on fair use, not any particular terms of service.

source: the faq at the bottom of https://copilot.github.com/

GitHub Copilot is powered by OpenAI Codex, a new AI system created by OpenAI. It has been trained on a selection of English language and source code from publicly available sources, including code in public repositories on GitHub.


Suing Microsoft for training Copilot on your code would require jumping over the same hurdle that the Authors Guild could not: i.e. that it is fair use to scan a massive corpus[0] of texts (or images) in order to search through them.

My real worry is downstream infringement risk, since fair use is non-transitive. Microsoft can legally provide you a code generator AI, but you cannot legally use regurgitated training set output[1]. GitHub Copilot is creating all sorts of opportunities to put your project in legal jeopardy and Microsoft is being kind of irresponsible with how they market it.

[0] Note that we're assuming published work. Doing the exact same thing Microsoft did, but on unpublished work (say, for irony's sake, the NT kernel source code) might actually not be fair use.

[1] This may give rise to some novel inducement claims, but the irony of anyone in the FOSS community relying on MGM v. Grokster to enforce the GPL is palpable.


A lot of open source licenses demand that if the licensed code is included in a derivative work, that new work has to carry the same license. GitHub Copilot is straightforwardly violating these terms in many cases, and I hope the pending class action lawsuit sets statutory boundaries around the inclusion of data in a training set, and that those boundaries are retroactive.

I disagree with this article. GitHub Copilot is indeed infringing copyright and not only in a grey zone, but in a very clear black and white fashion that our corporate taskmasters (Microsoft included) have defended as infringement.

The legal debate around copyright infringement has always centered around the rights granted by the owner vs the rights appropriated by the user, with the owner's wants superseding user needs/wants. Any open-source code available on Github is controlled by the copyright notice of the owner granting specific rights to users. Copilot is a commercial product, therefore, Github can only use code that the owners make available for commercial use. Every other instance of code used is a case of copyright infringement, a clear case by Microsoft's own definition of copyright infringement [1][2].

Github (and by extension Microsoft) is gambling on the fact that their license agreement granting them a license to the code in exchange for access to the platform supersedes the individual copyright notices attached to each repo. This is a fine line to walk and will likely not survive in a court of law. They are betting on deep lawyer pockets to see them through this, but are more likely than not to lose this battle. I suspect we will see how this plays out in the coming months.

[1] https://www.microsoft.com/info/Cloud.html

[2] https://github.com/contact/dmca


GitHub isn't liable. That's been established in court with regards to training AIs. Who is liable is you who may or may not have the legal right to use the code CoPilot spits out for you.

If lawsuit goes through, it's not likely that Copilot would disappear.. but there would be a checkbox to opt-in your code. You could check it and your code will be used to train model.

I have some code on Github as well and would not want it to be used in training, nor by Microsoft nor by other company. It is under GPL license to ensure that any derived use is public and not stripped of copyrights and locked into proprietary codebase, and copilot is pretty much 100% opposite of this.


Microsoft can argue that Copilot emits a mixture of intellectual property (a pattern from here, a pattern from there), so they don't need to give attribution.

But if we disallow training, it's unambiguous.

Either you fed the program into your training system or you didn't. The No-AI 3-Clause License forbids use in training, no question about it. If you train your model on this text, you are obviously violating the license.

Systems like Microsoft Copilot are a new threat to intellectual property. The open source licenses need to change to adapt to the threat.

Otherwise Microsoft and Amazon will pillage open source software, converting all our work into anonymous common property that they can monetize.

We're watching it happen.


The author is also one of the lawyers bringing suit against (Microsoft) Github Copilot.[2]

[2]https://githubcopilotlitigation.com/

next

Legal | privacy