Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Considering existing code already has vulnerabilities, some of which were used to train Copilot I think it's possible but not efficient in terms of success rate.

But if they continue to ignore license terms I can see someone create repos with intentionally Copilot-incompatible licenses and watermark it so they can prove the license terms were violated.



sort by: page size:

I wonder if they could generate the correct license to code copilot produces, and maybe even infer the preferred one from repo and generate code that is restricted to that?

This takes place with or without copilot. The problem would be people copying code and releasing it under a different license.

could this be solved by MS brute-force shipping all the licenses (w/ references to their original projects) of all the repos they used to train to copilot along with copilot itself?

it wouldn't cover cases where people illegally copy pasted some code into their projects with dubious / not explicit licenses, but this is the same as using any open source project in general.


Maybe. But copilot also trains on the original gpl code with the gpl license intact so it doesn't matter.

It sounds like the person you're responding to already releases their code under a non-commercial license. The problem with Copilot is that it may allow commercial enterprises to avoid such a license by copying the code verbatim from their repositories, possibly without any party involved knowing that it's happened.

Nope. Copilot users are inserting "ticking time-bombs" into their own codebases.

The buck stops with the user, when they use code from any source at all, whether it's their head, the internet, some internal library, lecture notes, a coworker, a random dude of the street, or who knows what else, it their own responsibility to ensure the code they're using has been released under a license they can use. They don't get to go back and point fingers just because they didn't do their own due-diligence.

The exception would be if a vendor provides code under a legal contract providing liability and an agreed license, that has not happened in this case so there's no reason to expect any legal protections.


This would be even more difficult to achieve than previous attempts (e.g. in the Linux kernel [0]) due to the fact that an attacker needs to corrupt thousands of repositories that are guaranteed to be part of the training set.

Potential attackers would have two problems: 1) getting malicious checked into many repos and 2) making sure that these repos find their way into future deployed versions of GPT-3/Codex/CoPilot.

CoPilot generates enough vulnerable code as-is [1], so the extra effort isn't even required.

[0] https://www.bleepingcomputer.com/news/security/linux-bans-un...

[1] https://cyber-reports.com/2021/07/14/devsecai-github-copilot...


They would need to remove all code that isn't put in the public domain from their training data. Permissive licenses like MIT require derivatives to propagate the copyright notice, which Copilot does not do.

I wonder if FOSS folks could copyleft originally public/leaked but proprietary code using CoPilot.

From context, I thought you were suggesting the strategy "just type pirated proprietary code into the IDE and the Copilot plugin will automatically include it in the training data", since my earlier comment was about the difficulty of training Copilot on such code. I don't believe they won't abuse your work in other ways either.

It would be a Field of Endeavor restriction so the resulting license wouldn't be open source, and I don't think (?) Copilot is trained on proprietary code.

(Section 6 here: https://opensource.org/osd)


I wonder if there's any potential for Copilot to suggest malicious code because it's been trained on an open source projects containing intentionally malicious code.

Has anyone produced a legally watertight license or clause for other licenses that prevents code being used for training of copilot-like services?

Well this would not be hard to verify though.

You can automate this process by providing existing GPL source code and see what CoPilot comes up next.

I am sure at some point it WILL produce exact the same code snippet from certain GPL project, provided that you have attempted enough times.

Not sure what the legal interpretation would be though, it is pretty gray-ish in that regard.

There would always be risk for CoPilot, had it digested certain PII information and people found it out...it would be much more interesting to see the outcome.


I think it won't become a legal problem until Copilot steals code from a leaked repository (i.e. the Windows XP source code) and that code gets reused in public.

Only then will we see an answer to the question "is making an AI write your stolen code a viable excuse".

I very much approve of the idea of Copilot as long as the copied code is annotated with the right license. I understand this is a difficult challenge but just because this is difficult doesn't mean such a requirement should become optional; rather, it should encourage companies to fix their questionable IP problems before releasing these products into the wild, especially if they do so in exchange for payment.


And would therefore have to follow the license of the code they took it from. That's exactly the point. Copilot is reproducing the same code but without the license.

You can configure Copilot to not return code that appears verbatim in public repositories. In that case it at least won't produce code you could legitimately argue would be covered by any individuals' specific license.

Copilot was not only trained on permissively licensed code. It’s trained on all public repos, even if the code is copyrighted (which is the default absent a more permissive license)

someone can still mirror your stuff on github. I wonder if they should make a special open source license, that disallows use of the source code for the purpose of training something like Copilot.
next

Legal | privacy