Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

> But that's about as unlikely as the code containing trade secrets.

Unpublished code, is itself a trade secret. Even just the processes, procedures, organisation, tooling, library use, etc in the code provides a competitive advantage. i.e. The 'metadata' is also a trade secret.

The only intent you'd need to prove is that the accused is using the trade secret to the 'economic benefit of anyone other than the owner'.

It seems obvious that Kite is training a proprietary ML algorithm, with trade secrets, for their own economic benefit.



sort by: page size:

> As pointed out upthread, if it the source code is leaked then there may be trade secret protections. The GPL specifically allows the code to be posted online, so by design it is not secret.

The reverse maybe true. I may be GPL'ing a code to prevent a useful algorithm from being buried deep inside a commercial code with an incompatible license. What makes it a "trade secret" level code? I have a 25 line algorithm which is worthy of its own paper. What if I open its reference implementation with AGPLv3+?

I have no problems with you reading the paper, and implementing it. I don't obfuscate my papers, but I put the implementation out with AGPLv3+. You can't use that in a codebase with an incompatible license. I expect and want you respect the license of my implementation.

> The code snippets that copilot generates look more like fair use than infringement. They are small, adapted to the destination context, and usually not direct copies of one source but more of an average of many different sources. And usually the programmer does not keep the suggestion that copilot suggests unmodified - the programmer does their own editing of the snippet afterwards to further tune it to the surrounding context.

Emphasis mine. First, there's no consensus on fair use, yet. Second they may be direct copies of the code. Third, they're remixed with other code pieces, which makes it a derivative work of many code pieces at once, then lastly, programmer re-derives the derived work. Which is clearly a derivative of GPL code, which brings in GPL license with itself (if what copilot derives the code from GPL licensed repositories, which it does).

I have no problem with Copilot as a technology. I have no problems with other licenses, which are not breached when used by Copilot and derived and used. The point which makes my blood boil is copilot using this GPL corpus, and don't admitting it publicly, breaching the terms of GPL en masse, and outright ignoring it. Then feeding this GPL derived code to any and all projects which pay for a copilot membership, and calling it a day.


> The maker of the software, Cybergenetics, has insisted in lower court proceedings that the program's source code is a trade secret.

Should this be, in general, a disqualifying condition when something is to be used as evidence?


> considering your competitors' employees already infiltrated your slack channel.

It was a public community slack channel. There may have been proprietary/protected information shared in there but in my naive opinion (NAL!) that wouldn't be protected by trade secrets if it was shared publicly. Some things could still be protected by copyright or trademark but it's pretty unlikely. It's possible copying the API could be infringing but Google v. Oracle[0] makes that a very uphill battle.

"Secret sauce" is highly valued by founders and product managers but IMHO it's generally overvalued and doesn't constitute a moat. For the most part, competitors are allowed to copy most things and leverage whatever other advantages they have over the original innovators.

0: https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_....


> Any secrets it generated would have been already public and compromised.

Two things..

1. This is not some vigilante hacking groups you are talking about. This is MS - a humongous corp. They know doing this is wrong. So it shouldn't matter. It is still illegal and making a proprietary software product is exactly the reason why these things exist. Don't you think?

2. Already public unintentionally. AFAIK, any code that I produce is automatically copyrighted to me. This means if I write something in public and not provide a license, IT IS LEGALLY under the copyright protection provided to me by my country. At least that is the case in US and India which are home to a huge portion of OSS. Do correct me if I am wrong. Putting it like that in public would be just plain stupid for sure. But legally it is still mine. Reproducing it and remixing my work would be illegal. It is just my PITA now to prove that in court is all.

> Whether doing so violates the licenses is probably up for courts to decide.

I am yet to see any attribution to ANY OSS code that it is trained on. The PII and secrets is enough to find out the license of a repo which would make it easier to prove whether they violated it or not. Don't tell me all OSS that copilot has trained on is only public domain stuff. Even ISC license needs attribution.

> Human programmers can do this with search engines.

OK. CAN. And that is wrong too. So when they do, we should incriminate them as well if evidence is out there or if a matter such as this comes to light. How does that change anything I mentioned? Very curious.


> That is only enforceable for Trade Secrets

Do you have legal citations supporting this assertion? I'm fairly certain this isn't so.


> If their AI uses my code as the basis to generate similar code, how is that not a derivative work?

How would you know the LLM used your work specifically and then prove that in court?


> Trade secrets, IP, secret sauce: covered by NDA and IP assignment agreements

As a developer pretty much none of these matter or protects anything.

Imagine this scenario; - John has no idea about video encoding but a good developer.

- John joins to a video encoding startup

- This startup encodes videos 3 times faster than the competitor

- After working on the core product for 2 years, John knows a lot about video encoding, because he's been trained. He also knows why they can do faster than anyone else. It's not one thing, bunch of things.

- Then John receives an offer from the competitor with 50% more salary (obviously this is smart thing to do for competing company). He obviously leaves, because 50% more! All the know-how, experience etc. will be just automatically transferred to this competitor. NDA, copyright etc. nothing can prevent it.

So how is this good for anyone but John? If you think this kind of stuff doesn't happen and all this kind of advancements are public domain anyway, you are wrong. There are many niche fields where competing advantage comes from technical excellence and understanding couple of key things better than your competition.

Not to mention John will have inner knowledge of so many other non-technical but important details that can give obvious unfair competitive advantage.

When non-competes are removed companies do need to treat their employees differently. "If I don't trust this employee enough I shouldn't give them the important bit of the source code, shouldn't train them on X know-how that we internally produced" etc. which is pretty bad for everyone.


>> If you had said it can copy/paste its training data I wouldn't have argued, but "it just copy/pastes the code it was trained on" is demonstrably false, and anyone who's really tried it will tell you the same thing.

So if "it could commit copyright infringement, but does not always do so" is good enough for your company's legal review team, then go for it.


> It is a statistical function that produces some optimized outputs for some inputs.

So is a human mind.

> In the US you are allowed to take a photo of anything you want in public. These models are not being trained on datasets collected wholly in public though, it is very insidious to suggest that they are.

How so? What non-public training data are they using, and why does it matter?

> The internet is not "the public". It is a series of digital properties that define terms for interacting with them. Now, a lot of material is publicly accessible online, but that does not mean that it is not still governed by copyright. For example, my code on Github is publicly accessible, but that doesn't mean you can disregard the license.

It does mean you can read the code and learn from it without concern for the license (morally, if not legally).


> to create and sell a product.

This is not model training.

> Copilot is a derivative work of that code, the license terms of that code (e.g. GPL, non-commercial) should extend to the new function that was derived from it.

But the very act of training copilot is not problematic. And in fact, if GitHub never did anything with Copilot, the physical act of training the model is not problematic at all. And that's what at issue here. How Copilot is used is orthogonal to the article.

> Sure, I can make a shirt with Spider Man on it and give it to my brother, but if a company were to use what I made or I tried to sell it, I would expect a cease and desist from Disney.

Yes. And training the model isn't the part where you sell it. It's the part where you make it.

> Training the model may very well be a copyright issue. The images have been copied, they are being used.

What do you think "being used" means here? If I work for a company and download a bunch of text and save it to a flash drive, have I violated copyright? Of course not. If I put that data in a spreadsheet, is it copyright infringement? Of course not. If I use Excel formulas on that text is it infringement? Still no.

And so how can you claim in any way that the creation of a model is anything more than aggregating freely available information?

I don't disagree with you about the use of a model. But training the model is just taking some information and running code against it. That's what's important here.


> Are there any open source licenses that explicitly forbid use when training an AI?

Github/OpenAI's defense is "training ML systems on public data is fair use" (https://news.ycombinator.com/item?id=27678354). Unless this assertion gets invalidated in courts I think they mostly don't care about wordage in your license

> I want to release some open source code, but I don't want to make the mistake of training my replacement.

Most of a software engineer's value is not in the code they write. By far and large employers care instead that you solve their problems. Code's just a by-product, means to achieve that.


> Some are indeed producing some innovative code

I think even anything that reveals something about their infra, or customers, would be a problem.

> Normally this shouldn't be an issue though?

It doesn't have to be impossible, just hard enough to not be worth it e.g. having to hire a foreign lawyer.


> In contrast when talking about training code generation models there are multiple comments mentioning this is not ok if licenses weren't respected.

I think one of the differences is that people are seeing non-trivial amounts of copyrighted code being output by AI models.

If a 262,144 pixel image has a few 2x2 squares copied directly, you can't tell.

If a 300 line source file has 20 lines copied directly from a copyrighted source, well, that is more blatant.


>The difference is that you can't claim open source code is a trade secret that contains algorithms that you can't reimplement in your own code.

You can't use "trade secret" but MS Windows code is also no longer secret, they shared the code with third parties and was also leaked, so they can claim you do not respect the license or TOS,NDAs etc. Not sure if "secret" has anything to do with this, proprietary stuff is not secret, it could be source availleble or be a scripting language stuff or something trivial to find the original implementation like C# and Java


> I'd much, much rather code be a "trade secret" than a patent.

These are not the only two options. If I understood it right, the programmer is asking for money to develop something that will be available for download for free, but whose source code would be secret.

As a programmer myself (who writes free-as-in-speech software), the source code is the most relevant artifact and the only one that interests me. I would not sponsor the development of a project where the only deliverable is an opaque binary executable where other development models are possible.


> How can they have legal standing for such a claim, unless one of the authors had privileged access to Wolfram Research knowledge to misappropriate a trade secret?

The absence of legal standing obviously is something that can be raised once a lawsuit is underway, but it doesn't prevent the threat of a lawsuit, and many organizations will knuckle under to the threat of a lawsuit from a wealthy opponent just to avoid the expense of consulting with lawyers if there isn't a big cost in avoiding the lawsuit.


> I see a lot of people claiming that AI is producing code they've written verbatim, but sometimes I wonder if everyone just writes certain things the same way. For the above rangeCheck function, there isn't much opportunity for the individual programmer to be creative.

This point is absolutely going to come up in any lawsuits; because the law does sometimes examine how much creativity there is available in a field before making a determination (Oracle v Google comes to mind). If you can show that there are very, very few reasonable ways to accomplish a goal, and said goal is otherwise not patented or prohibited, it's either not copyrightable or Fair Use, take your pick.

This even applies under the interoperability section of the DMCA and similar laws for huge projects. Assuming that ReactOS, for example, is actually completely clean-room; that would be protected despite having the same API names and, likely, a lot of similar code implementing most of the most basic APIs.


> So to that point, your doom & gloom absolutist scenario could not play out if the product of the model was sufficiently different.

It would absolutely play out, because it is impossible to guarantee that such a model will always produce something "sufficiently different". These models are black boxes with billions of parameters. That's how they work. It's just as unrealistic as those politicians pushing through "lawful access to encrypted data" that'd effectively make strong end-to-end encryption illegal. There's no middle ground here. We either accept that such a model might sometimes output a snippet from its training data and benefit from the 99% of times it doesn't, or we can be copyright maximalists and ensure no one benefits. (Except ironically huge corporations like Microsoft, either because they can license the data for training, or because they have sufficiently well funded legal departments.)

> Now imagine you're a not-Google sized company, are you going to take the chance that Copilot will spit out something that they consider copyrighted? > > I think in terms of legal/business risk, it's just too high as it stands now.

This is a fair point, but you can say this about any other interesting model. Is that piece of text generated by GPT-NeoX-20B (which is a fully free and open model trained by essentially hobbyists) illegal to use because it might infringe on someone else's copyright? You don't know. And it was also trained on code from Github. Where are the posts calling for people to sue their authors because they're not respecting the GPL?

Here, I've just tried it and screenshoted it for you, spitting out GPL'd code: https://i.imgur.com/2T4uSJR.png

Again, this is not the Copilot. This is the free GPT-NeoX-20B model that anyone can download. The model's not under GPL, and yet it clearly "contains" GPL'd code. Anything which affects Copilot's legal status will also affect GPT-NeoX-20B, but even more severely since GPT-NeoX was also trained on a ton of "all rights reserved" data. So when you raise your pitchfork at Copilot you should also ask yourself the question - are you fine with also killing projects such as GPT-NeoX, or maybe a more lax copyright law is more beneficial to the society as a whole when it comes to machine learning?


> I think people may be drastically over-valuing their code. If it was emitting an entire meaningful product, that would be something else. But it’s emitting nuts and bolts.

Please refrain yourself from this kind of blatant gaslighting. You're not the one to assess its value or usefulness and your point is at most tangential to the issue. The problem is that the model systematically took non-public domain code without any permits from the author, not whether it's useful or not. It's worth to hear this complaint and Copilot team should be more accountable for this problem since this could lead to more serious copyright infringement fights for its users.

next

Legal | privacy