Hacker Read

alpaca128 · 2022-06-23 08:44:10

Considering existing code already has vulnerabilities, some of which were used to train Copilot I think it's possible but not efficient in terms of success rate.

But if they continue to ignore license terms I can see someone create repos with intentionally Copilot-incompatible licenses and watermark it so they can prove the license terms were violated.

reply

iaml | karma 849 | avg karma 1.69 · | 2021-07-20 08:01:45

I wonder if they could generate the correct license to code copilot produces, and maybe even infer the preferred one from repo and generate code that is restricted to that?

concordDance | karma 4533 | avg karma 1.73 · | 2022-10-17 07:10:30

This takes place with or without copilot. The problem would be people copying code and releasing it under a different license.

fartsucker69 | karma 226 | avg karma 2.26 · | 2022-10-18 04:14:30

could this be solved by MS brute-force shipping all the licenses (w/ references to their original projects) of all the repos they used to train to copilot along with copilot itself?

it wouldn't cover cases where people illegally copy pasted some code into their projects with dubious / not explicit licenses, but this is the same as using any open source project in general.

reply

VWWHFSfQ | karma 5427 | avg karma 3.4 · | 2023-04-21 13:50:50

Maybe. But copilot also trains on the original gpl code with the gpl license intact so it doesn't matter.

johnday | karma 1379 | avg karma 2.82 · | 2021-07-03 22:09:10

It sounds like the person you're responding to already releases their code under a non-commercial license. The problem with Copilot is that it may allow commercial enterprises to avoid such a license by copying the code verbatim from their repositories, possibly without any party involved knowing that it's happened.

meetups323 | karma 1583 | avg karma 7.07 · | 2021-10-27 15:22:00

Nope. Copilot users are inserting "ticking time-bombs" into their own codebases.

The buck stops with the user, when they use code from any source at all, whether it's their head, the internet, some internal library, lecture notes, a coworker, a random dude of the street, or who knows what else, it their own responsibility to ensure the code they're using has been released under a license they can use. They don't get to go back and point fingers just because they didn't do their own due-diligence.

The exception would be if a vendor provides code under a legal contract providing liability and an agreed license, that has not happened in this case so there's no reason to expect any legal protections.

reply

qayxc | karma 3685 | avg karma 2.51 · | 2021-07-18 08:46:54

This would be even more difficult to achieve than previous attempts (e.g. in the Linux kernel [0]) due to the fact that an attacker needs to corrupt thousands of repositories that are guaranteed to be part of the training set.

Potential attackers would have two problems: 1) getting malicious checked into many repos and 2) making sure that these repos find their way into future deployed versions of GPT-3/Codex/CoPilot.

CoPilot generates enough vulnerable code as-is [1], so the extra effort isn't even required.

[0] https://www.bleepingcomputer.com/news/security/linux-bans-un...

[1] https://cyber-reports.com/2021/07/14/devsecai-github-copilot...

reply

Zambyte | karma 2837 | avg karma 1.9 · | 2022-06-23 15:38:45

They would need to remove all code that isn't put in the public domain from their training data. Permissive licenses like MIT require derivatives to propagate the copyright notice, which Copilot does not do.

pabs3 | karma 43824 | avg karma 6.39 · | 2022-06-23 04:11:01

I wonder if FOSS folks could copyleft originally public/leaked but proprietary code using CoPilot.

bsza | karma 754 | avg karma 3.08 · | 2021-07-06 08:41:11+00:00

From context, I thought you were suggesting the strategy "just type pirated proprietary code into the IDE and the Copilot plugin will automatically include it in the training data", since my earlier comment was about the difficulty of training Copilot on such code. I don't believe they won't abuse your work in other ways either.

rwmj | karma 37735 | avg karma 4.63 · | 2022-11-10 03:15:31

It would be a Field of Endeavor restriction so the resulting license wouldn't be open source, and I don't think (?) Copilot is trained on proprietary code.

(Section 6 here: https://opensource.org/osd)

reply

patwolf | karma 1049 | avg karma 4.46 · | 2021-06-29 15:05:03+00:00

I wonder if there's any potential for Copilot to suggest malicious code because it's been trained on an open source projects containing intentionally malicious code.

MattPalmer1086 | karma 2957 | avg karma 1.84 · | 2022-11-10 02:30:10

Has anyone produced a legally watertight license or clause for other licenses that prevents code being used for training of copilot-like services?

karmasimida | karma 1537 | avg karma 1.87 · | 2021-06-30 12:36:18+00:00

Well this would not be hard to verify though.

You can automate this process by providing existing GPL source code and see what CoPilot comes up next.

I am sure at some point it WILL produce exact the same code snippet from certain GPL project, provided that you have attempted enough times.

Not sure what the legal interpretation would be though, it is pretty gray-ish in that regard.

There would always be risk for CoPilot, had it digested certain PII information and people found it out...it would be much more interesting to see the outcome.

reply

jeroenhd | karma 24884 | avg karma 4.06 · | 2022-10-16 17:29:28

I think it won't become a legal problem until Copilot steals code from a leaked repository (i.e. the Windows XP source code) and that code gets reused in public.

Only then will we see an answer to the question "is making an AI write your stolen code a viable excuse".

I very much approve of the idea of Copilot as long as the copied code is annotated with the right license. I understand this is a difficult challenge but just because this is difficult doesn't mean such a requirement should become optional; rather, it should encourage companies to fix their questionable IP problems before releasing these products into the wild, especially if they do so in exchange for payment.

reply

lilyball | karma 19000 | avg karma 20.99 · | 2022-10-17 18:04:28

And would therefore have to follow the license of the code they took it from. That's exactly the point. Copilot is reproducing the same code but without the license.

19h | karma 2198 | avg karma 8.59 · | 2023-05-16 06:07:13

You can configure Copilot to not return code that appears verbatim in public repositories. In that case it at least won't produce code you could legitimately argue would be covered by any individuals' specific license.

eloisius | karma 4790 | avg karma 4.71 · | 2023-05-08 06:53:18

Copilot was not only trained on permissively licensed code. It’s trained on all public repos, even if the code is copyrighted (which is the default absent a more permissive license)

MichaelMoser123 | karma 4337 | avg karma 1.94 · | 2022-08-20 13:05:39

someone can still mirror your stuff on github. I wonder if they should make a special open source license, that disallows use of the source code for the purpose of training something like Copilot.