So.. say Microsoft retrained Copilot on code only explicitly marked as open-source. As an activist or vandal you could start publishing proprietary code with fraudulent license files to pollute Copilot again.
Expanding on that, even if Microsoft sees the error of their ways and retrains copilot against permissively licensed source or with explicit opt-in, it may get trained on proprietary code the old version of copilot inserted into a permissively licensed project.
You would have to just hope that you can take down every instance of your code and keep it down, all while copilot keeps making more instances for the next version to train on and plagiarize.
Arguably copilot was built using open source code against the spirit of the licenses. After all Microsoft has done to undermine open source in the past, can we allow them to have the last laugh? What if we organize a boycott?
could this be solved by MS brute-force shipping all the licenses (w/ references to their original projects) of all the repos they used to train to copilot along with copilot itself?
it wouldn't cover cases where people illegally copy pasted some code into their projects with dubious / not explicit licenses, but this is the same as using any open source project in general.
I'm not really sure what I think about this. How responsible should Microsoft be for someone's badly licensed code on their platform? If they somehow had the ability to ban projects using stolen snippets of code, I don't think I'd dare to host my hobby projects there.
If you can't trust that the code in a project is compatible with the license of the project then the only option I see is that copilot cannot exist.
I love free software and whatnot, but I have a feeling this situation would've been quite different if copilot was made by the free software community and accidentally trained on some non free code..
Considering existing code already has vulnerabilities, some of which were used to train Copilot I think it's possible but not efficient in terms of success rate.
But if they continue to ignore license terms I can see someone create repos with intentionally Copilot-incompatible licenses and watermark it so they can prove the license terms were violated.
It is now proven that copilot returns code from codebases with non-permissive licenses [1].
I'm curious - what are the legal implications of this going forward? I've so many questions.
1. Will Microsoft ever face lawsuits for these license violations?
2. If so, who/how? Class-action?
3. Will copilot be forced to open-source in the future? Under which license? Some open source licenses are incompatible with others, but copilot uses code from probably every OSS license conceived.
4. If Microsoft faces no justice, will we start seeing more OSS license violations? Will Google start using AGPL-licensed code?
And if the code generated by copilot was attached to a license that you had to obey? Suddenly your propriety solution must be released as open source or rewritten, because copilot is effectively laundering open source code?
Life's a lot easier when you can just copy whoever did the hard work without crediting/paying/etc for it.
Perhaps one should create an open-source alternative to GitHub copilot also trained on proprietary source code, such as some leaked Windows source-code, and everyone will be happy and appreciate that we can use fair-use to train such AIs.
Copilot is a product -- at least indirectly -- of Microsoft, a company who for about a decade made very public pronouncements about how they disagreed with the GPL (or copyleft generally), found it problematic, and tried actively to discourage its use.
Today's MS isn't really the same, and they've clearly made their peace with Linux. But it still happens that the GPL is in some fundamental ways at odds with commercial exploitation of open source code. So any corporate entity is going to struggle with it because at best it requires being very careful in distribution, or trying to negotiate or cut a deal with the licensee. At worst it can lead to legal problems and IP leakage on your own product.
So, not claiming any conspiracy. Or intent to violate intentionally. But it is in the convenient interests of companies like MS/OpenAI/GitHub to treat open source work as effectively public domain rather than under copyright, and to push the limits there.
The risk to an employer is of course the accidental introduction of such copylefted material into their code-base through copilot or similar tools.
I suspect two sources of disconnect with the broader community on hackernews that doesn't seem to see the issue here:
a) Much of the folks on this forum are working in the full-stack/web space where fundamentally novel, patented, or conceptually difficult algorithms and datastructures are rare. For them Copilot is an absolute blessing in helping to reduce the tedium of boilerplate. However in the embedded systems, operating systems, compiler, game engine dev, database internals etc. world there are other aspects at work. In certain contexts, Copilot has been shown to reproduce complicated or difficult code taken from copyrighted or copylefted (or maybe even patented sources) without attribution. And apparently now with some explicit obfuscation.
To put it another way: it's unlikely that Copilot's going to violate licenses with its assistance with turning your value/model objects from one structure to another, or writing a call into a SQL ORM. But it's quite possible that if I'm writing a DB join algorithm or some complicated math in a rendering engine or a compiler optimization phase that it could "crimp notes" from a source under restrictive license... because those things are absolutely in its learning set and the LLM doesn't "know" about the licensing behind them.
b) Either misunderstanding of, or lack of knowledge of, or outright hostility to... copylefted or attribution licenses which require special handling.
Microsoft can argue that Copilot emits a mixture of intellectual property (a pattern from here, a pattern from there), so they don't need to give attribution.
But if we disallow training, it's unambiguous.
Either you fed the program into your training system or you didn't. The No-AI 3-Clause License forbids use in training, no question about it. If you train your model on this text, you are obviously violating the license.
Systems like Microsoft Copilot are a new threat to intellectual property. The open source licenses need to change to adapt to the threat.
Otherwise Microsoft and Amazon will pillage open source software, converting all our work into anonymous common property that they can monetize.
Open source licenses aren't a free-for-all. Many have terms like GPL's copyleft/share-alike or the attribution requirements of many other licenses. If copilot was trained on such code, then it seems that it, and/or the code it generates, violates those licenses.
Keep in mind that CoPilot might itself be a well-poisoning tactic.
If use of Github presumes use of CoPilot and of comingled code under incompatible, or proprietary, licenses, such that use of such code could then create contributory infringement claims against distributors, users, or developers, there's something of a problem across the Free Software world.
Though that does seem rather a bit of a major footgun for Github / Microsoft themselves.
The Free Software movement didn't create, or even want, a world in which copyrighted software was a norm. But it adpated to the circumstance by treating copyright as a serious matter and being diligent in practices of use, appropriation, and licencing.
It's rather ironic that the source of the "Letter to Hobbyists"[1] are now advocating a devil-may-care attitude to copyright, software, and licensing in their own works and service offerings.
Microsoft already launders open source code by just hiring people in China and Romania to rewrite it. Copilot is their engineering culture distilled. However most big companies do this.
I’m already terrified how many developers have been working on proprietary code bases with copilot, having an extension in their editor upload all their employer’s proprietary code to Microsoft, who then share it with OpenAI - then they’ve taken code OpenAI and Microsoft sent back to them, of unknown authorship, and added it into their code.
And now those devs are going to have to go to their boss and explain all the ways they’ve opened their company up to liability?
This could be terribly fun.
reply