Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Even if you use it for personal projects.

To be safe, we'd have to get Microsoft to agree to indemnify users (if they really believe using this is safe, they should do so), or wait until a court case on copyright as it regard to training corpus for large models is decided and appeals are exhausted.



sort by: page size:

I'm not sure if most data that these models are trained on is copyrighted, but I feel pretty safe saying that a majority of data that human beings have created is copyrighted. Think movies, books written recently, every website that isn't explicitly "creative commons" or something similar, code that isn't permissively licensed, etc.

We definitely need clarification, but however long the first court case takes there will be an appeal, and then probably several more. So I'm afraid we're going to be living in limbo for at least a decade, which is sort of an answer in of itself since by that time services like this will have become pervasive and will have been integrated into lots of workflows across the planet.

It seems to me that training on MS proprietary code is perfectly legal, but how you acquire that code is probably important. If you are able to decompile the code from your Windows machine and use it for training then that looks A-OK, but if you use Microsoft code that was leaked as part of a hack then maybe that's a different story since you're in possession of stolen property.


unless that's directly in the contract, I'd not trust Microsoft to discard anything that might be of value in a training dataset at some point in the future

say, if training is determined to be fair use


Yes. Even if it may be permitted under some licenses, training models off millions of developers' code and capitalizing on those models goes against the spirit of open source software. I'd expect nothing less from Microsoft.

It shouldn't even hinge on a TOS.

If Microsoft loses this case, it actually means Microsoft wins and we all lose.

Who has a large enough corpora of training data? Only institutional copyright holders.

This is probably going to play out like Oracle vs. Google when Google suddenly realized that they should lose and intentionally threw the case.

I'm so worried about this case. Treating copyrighted training data as fair use, and letting models learn as a child might learn from a book or movie, is the best way to proceed. It widens the playing field for both development and disruption.


Well, at least in the US training an AI will probably fall under Fair Use. In the EU there is an explicit copyright exception for data mining. So I don't think there's a legal obligation for Microsoft to only train within the bounds of public GitHub repos.

It doesn't matter how it's used. Do you think Microsoft would be happy with someone training a model on Windows source code, as long as they didn't use it to reproduce the code?

I'm pretty sure DALL-E was trained only on not copyright material ( they say so :| ).

But to be honest if your code is open source im pretty sure Microsoft don't care about licence they'll just use it cause "reasons" same about stable diffusion they don't give a fuk about data if its in internet they'll use it so its topic that probably will be regulated in few years.

Until then lets hope they'll get milked (both Microsoft and NovelAI) for illegal content usage and I srsly hope at least few layers will try milking it asap especially NovelAI which illegally usage a lot of copyrighted art in the training data.


Depending on the outcome of the recent court cases related to AIs like Copilot and Stable Diffusion, this might not be of use at all.

If courts rule that training, distributing and using models without regard for the data author's consent is a fair use (which has a high chance of happening) then the license is is worthless.


Thanks for the clarification. So according to their theory, I could train a model on any code, even private Microsoft code, and that would be okay? That sounds surprising to me.

The Pile is still used to train LLMs and it's still very much available on the net. I agree it's a risk to train your models on the dataset until the legal implications are worked out, but it doesn't seem to be stopping people.

Well upthread the discussion changed to using a non-open license that prevents people from training AI on it. If you released software under such a license, someone re-uploading to Github would probably be violating their terms or yours. Regardless, Microsoft would probably remove the repo if you contacted them to let them know you're the copyright holder, and the software license is incompatible with their terms.

It remains to be seen if they have a way to then clean their training data of the influence.

It would be the same situation if someone uploaded any other proprietary code.


I think we should be able to use any data as fair use for the purpose of training of models. Otherwise data will be owned by the privileged few, and most of us will never be able to compete.

By hoping Microsoft loses this case, you're hoping they gain the permanent upper leg.


It's not really a question of what it's used for, but what it's used with.

Training on all the Old Masters and all public domain art? No problem.

Training on people's work posted on the internet to which they retain copyright? Not fine.

Training on Microsoft's own huge codebase for their tool, and calling for voluntary authorizations from the public? Fine.

Hoovering it up with not so much as a by your leave, effectively creating a license filter that sells the output as copyrightable?

Well you know.

(edited for tone, I get pissed about this...)


I don't think it's possible to have an "open training data" model because it would get DMCA'd immediately and open you up to lawsuits from everyone who found their works in the training set.

I wonder if this is legally enforceable. If Microsoft is right with Copilot on the point that training models counts as "fair use", then the only thing stopping people from fine tuning their own models are these terms of use, which seems like it can be easily sidestepped by having an intermediate person/company to "process" the outputs.

they should just mention that the content is only available for people to train their ML model

Microsoft can get away with it, we might as well do the same


Yes - to me, this is a much more obvious hazard than the copyright arguments back and forth around the training data.

There's no way any reasonable corporate legal or risk department should or would allow this plug-in to be installed on their developers' computers, for this reason alone.

I am surprised this is not being talked about more, given the massive conversation around the regurgitation / rote learning arguments.


So how long till new software licenses that prohibit any use of code for model training purposes? I'd be willing to bet there's a significant group of people that won't be happy either its literal or not, the fact that it was used in the training might be enough.

Unfortunately the way IP law works, at least in the US, is that you can use essentially whatever you want as training data and it's up to the user to make sure none of the generated code violates licensing agreements.
next

Legal | privacy