So, I can't see how they can argue that the code generated is not a derrivative of at least some of the code that it was trained on, and therefore encumbered by a complicated, and for anyone other than GitHub, impossible to disentangle, copyright claims. If they haven't even been careful to only use software under one license that does not require the original author to be attributed, then I don't see how it can even be legal for them to be running the service.
All that said, I'm not confident that anyone will stop them in court anyway. This hasn't tenmded to be very easy when companies infringe other open source code copyright terms.
Until it is cleared up though, it would seem extremely unwise for anyone to use any code from it.
I am not a lawyer... There may be an argument here that
1. GitHub has a valid license to distribute it (as a result of their TOS)
2. Running the downloaded code is not copyright infringement (or not obviously so, and hasn't been established as so in any court that I am aware of)
3. Using the APIs is not copyright infringement (see Oracle v Google, if that was fair use this almost certainly is)
Thus no copyright infringement has occurred.
Still, keeping this in the codebase is at best boobytrapping your code to create accidental future instances of copyright infringement, and it's an interesting case of people not checking licenses (since it's pretty clear they didn't realize this in advance).
A interesting code as speech issue could be made here for sure. Just because one of the effects (even the main effect) of the software is facilitating copyright violations, doesn't mean that the code itself is violating any copyright. I don't think github will fight this, they probably on't want to pick a battle with some of the most monied legal interests around, but I sure wish they would.
That may be true, although even GitHub doesn't know for sure. But the problem remains: they're reproducing other people's code without regard to license status.
Until then all you have is opinions, mine is pretty straightforward: if the generative model can be made to work without first training it on other people's code then it isn't copyright infringement, if not then it is transforming one set of works into another.
The only thing that might let GitHub off the hook is their terms of service, but that might mean mass exodus from GitHub because if they interpret you using GitHub to host your code as a blanket permission to do with that code whatever they want then that's clearly not the original intent of the service.
If Microsoft buying GitHub claims that gave them a blanket license to do as they please with the contributions of millions of FOSS contributors then they are still just as bad as they were in the past.
Almost every GitHub repository comes with a license file, even GitHub should have to abide by that license, otherwise the whole thing is pointless.
It will not be GitHub that will get sued. It'll be the developers that use the code without attribution.
The copyright infringement might not matter if code from individual developers is being used - they usually don't sue. But once this happens to say Oracle's copyrighted code... Well, that is going to be interesting.
This seems to ignore the widely repeated claim that GitHub's terms of service explicitly grant them a license beyond the actual open source license attached to the code and thus transfer the burden of liability to the uploader when it comes to code they can not control the licensing of.
So either this is about code authored by people who did not use GitHub (in which case GitHub would be immediately liable, though they could try to sue whoever uploaded that code to GitHub for damages) or it's going to have to argue that the terms of service can't smuggle in a provision that effectively sidesteps even the most permissive open source licenses.
It's not at all obvious they'd lose. Github is not violating anybody's copyright. They're only distributing source code, presumably with the full consent of the authors.
But that doesn't address code licensing. GitHub is attempting to dance the around licensing issue by reframing the argument as a pure copyright issue. There's is no such thing as 'fair use' when it comes to licensed code unless it's implicitly granted by the license. For example you can't take GPLv3 licensed code, copypasta it into your proprietary app and call it 'fair use'. Github can't legally indemnify devs from using 3rd party licensed code.
Sorry, to be clear, I meant even if a Github user asserts their code is public-domain/no-attribution/unlicensed, they could have lifted it off a codebase that doesn't allow it. It would be tricky for Github to establish the code was indeed original and hence their agreement with the user allows them to train their models on it.
GitHub’s terms of service explicitly allow them to do what they did, if anyone is in violation of the licenses you’re referencing it’s the people that loaded the code into GitHub.
You are right in asking that question. What just happens that when there are thousands of small violators, it's not easy for lawsuits to proceed. When there's a central entity involved, Github in this case, there is someone discrete to be challenged or talked about.
What Github has done is that they have not claimed rights over the generated code, and have passed on the responsibility to the developer/entity using the generated code. In other words, they are supplying tool that can make copies, not necessarily violating the copyright themselves {Foonote 1}. So the Betamax ruling can come into play here, AFAIK.
What's thereby needed is that the developers now need to be educated about it, which is what the articles like the OP will end up doing. The developers often aren't well-educated about this stuff themselves, and presume that using generated code is just fine. See how I was downvoted here, as an example: https://news.ycombinator.com/item?id=27735484
{Footnote 1}: Whether the Github CoPilot model itself is violating copyrights is another matter. If the training data has included AGPL-licensed code, even that may be an violation. However, several other licenses would not be violated till the model is "distributed".
>“If you look at the GitHub Terms of Service, no matter what license you use, you give GitHub the right to host your code and to use your code to improve their products and features,” Downing says. “So with respect to code that’s already on GitHub, I think the answer to the question of copyright infringement is fairly straightforward.”
Not as straightforward as they think thou.
If a code project used (a)gpl code found elsewhere on the internet in their repo, and another user took the project and hosted it on github, the tos can not give github a license to use the code outside of the license given by (a)gpl, even if github thinks they have one, that won't shield them from legal liability, nor will it shield co-pilot users from being legally compelled to (a)gpl their code if a court case was won on those grounds.
The github tos is basically a non-factor in this case.
GitHub used code that wasn't under any license at all, just publicly visible. Their claim is not that the license allows what they're doing, but that they do not need a license.
I can't imagine a scenario in which any lawyer would consider granting Github the right to "analyze" code anywhere close to granting Github the right to spit out that same code verbatim without your copyright notice (even if laundered by AI).
I would assume github could supercede your license by putting its own claim to your code in the TOS. I doubt they have done that, but just pointing it out.
I was thinking the same. It being published on GitHub makes it especially difficult to claim damages for reuse, as I think it’s fair to say the very act of publishing to GitHub implies a desire to share the code.
The real issue would be if the author decides they want to close source it later, updates the license, and then demands everyone stop using it.
Can they even pull the AGPL license? The re-licensing of current/new code is one thing, but removing the license on the old code? How is that possible? What about copies elsewhere, are those no longer validly licensed? That's not how I thought licenses work.
These guys just appear to go from bad to worse, ignorantly making shit up as they go (on a road to certain self-destruction, if you'd ask me).
Also, why don't I hear more people about how GitHub was used to not at all host any code, but instead only as a (marketing) referral point for proprietary stuff somewhere else? I said goodbye to GitHub a while ago, so I wasn't aware this this had become acceptable practice.
Anyways, if this was a court case .. this would probably be the point where a judge would look these fellows straight in the eyes and say something along the lines of: "With arrogant behavior you so clearly displayed, you yourselves have squandered any of the sympathy, lenience, and benefit of a doubt I usually give to defendants, and will hereby sentence you to the harshest possible punishment. Not only to give a clear indication that your behavior is in no way acceptable. But also to serve as a reminder for any future parasites, whenever they think of making up their own rules as they see fit".
Either way, I think this shit is going to haunt these guys. Unless I got it all completely wrong, probably well deserved too.
All that said, I'm not confident that anyone will stop them in court anyway. This hasn't tenmded to be very easy when companies infringe other open source code copyright terms.
Until it is cleared up though, it would seem extremely unwise for anyone to use any code from it.
reply