Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

I think the core argument has much more to do about plagiarism than learning.

Sure, if I use some code as inspiration for solving a problem at work, that seems fine.

But if I copy verbatim some licensed code then put it in my commercial product, that's the issue.

It's a lot easier to imagine for other applications like generating music. If I trained a music model on publicly available Youtube music videos, then my model generates music identical to Interstellar Love by The Avalanches and I use the "generated" music in my product, that's clearly a use that is against the intent of the law.



sort by: page size:

I'm not talking about copying but learning by reading code, you then synthesize the code you read surely you don't expect any copyright law to apply in such cases.

You could make the same argument for piracy and remix culture (e.g. sampling parts of songs to make new music). Yet for both of these the law situation is not particularly great. Currently the argument seems to be that "learning" is sufficiently distinct from both of these but code hosting websites tend to still explicitly carve such rights out in the user agreement because the line is a bit blurry.

Yes, similarly I could definitely create a simple ML algorithm and feed it a single codebase to learn. Then it would be possible to predictively output code, such that it reproduces that entire codebase verbatim.

There's nothing magical about a complex copying method that makes that less of a copyright infringement.

There may be some threshold where it becomes fair use, but I agree with you that it's not cut and dry at all.

The argument that someone could create all possible works of art or music or whatever and copyright them all is a ridiculous idea, from someone who doesn't understand exponentials.


You also wouldn't see them giving their code to people learning to code to read and learn from, but that doesn't necessarily mean that learning from something violates copyright. It seems like a bit of a grey area where the distinction is between learning and using it directly

In this case though we have machine learning model that is trained with some code and is not merely learning abstract concepts to be applied generally in different domains, but instead can use that knowledge to produce code that looks pretty much the same as the learning material, given the context that fits the learning material.

If humans did that, it would be hard to argue they didn't outright copy the source.

When a machine does it, does it matter if the machine literally copied it from sources, or first transformed it into an isomorphic model in its "head" before regurgitating it back?

If yes, why doesn't parsing the source into an AST and then rendering it back also insulate you from abiding a copyright?


That's cute but code has been copyrightable for decades, especially when it's Microsoft's code.

>as the AI did not use any copyrightable content in determining the output.

uuhhhhhh...didn't it? I thought we were talking about AI systems that are outputting code from their training sets verbatim. I don't see how you could possibly argue that the code it copies from its training set wasn't used to copy that same exact code from its training set.


and this ruling will prove to be a disaster for music creation as there will be fewer copyright free spaces for music as time goes on.

a similar ruling will also be a disaster for software as our tools of expression are very restricted. code is based on boolean algebra and predicate calculus, practice guides like design patterns and books teaching algorithms and data structures.

there are lots of ways to write bad code and only a few for good, correct code. Recognizing this led me to replicating known working code, code I had created, for multiple employers. so who's copyright did I intentionally violate?

I think we are attacking the wrong problem WRT ML and copyright. to me, ML shows the foundation on which copyright is built is a lie. we should use ML to break copyright for code.


That's probably not viable under US copyright law, especially with the Bright Tunes Music v. Harrisongs Music precedent; if someone is going to reimplement the algorithms and concepts without a copyright license, they're better off not reading your code so they don't have to prove in court, to a computer-illiterate jury, that the aspects their code had in common with your code were really purely functional and not creative.

> What if a human reads some restrictively licensed code and years later uses some idea he noticed in that code, maybe even no longer being aware from where this idea comes?

In general using the idea is fine, whether it is AI or human written. I think the major concern here is when the code is copied verbatim, or near verbatim. (AKA the produced code is not "transformative" upon the original)

> But what if the system memorizes entire functions? What if a human does so?

In both of these cases I believe it would be a copyright concern. It is not strictly defined, and it depends on the complexity of the function. If you memorized (|a| a + 1) I doubt any court would call that copying a creative work. But if you memorized the quake fast inverse square root it is likely protected under copyright, even if you changed the variable names and formatting.

It seems clear to me that GitHub Copilot is capable of producing code that is copyrighted and needs to be used according to the copyright owner's license. Worse still, it doesn't appear of capable of knowing when it is doing that, and what the source is.


If a human """learns""" the code and ends up writing a 1:1 copy of a large function, comments and all, without respecting the license, they are breaking the law. People need to stop twisting this into a different issue than it is, copying is not the same as learning.

> output and weights are deterministic transformations of the inputs;

That may be true but I fail to see how any process that produces the same content that was input into it somehow strips the license. If the generated code is novel, then there is no copyright and it is just the output of the tool. If the code is a copy, but non-creative (example a trivial function) then it isn't covered by copyright in the source anyways, so the output is not protected by copyright either. However if the output is a copy and creative I don't think it matters how complicated your copying process was. What matters is that the code was copied and you need to obey copyright.

Again, I don't think that novel code generated from being trained on copyrighted code is the problem. I think it is just the verbatim (or minimally transformed) copying that is the issue.


Am I the only person who feels that it's copyright that's the issue rather than machine learning training sets?

Consider a new additional feature added on to Copilot - a language aware rewriting tool that transforms the initial generated code into a new form with equivalent functionality.

It would be nearly impossible to trace the original code or make a copyright claim.

However - you could use this same trick directly on copyrighted code. Now things are even murkier...

But I would argue that this is essentially what our brains are doing. I've read code, got the gist of it and written my own version. Technically it's not a clean-room reimplementation but an average coder wouldn't realistically expect to get sued for copyright for doing this.

Maybe they should but if you're an open source advocate and you've reached this position then there's something very weird going on.

I always thought the idea of open source was to use copyright against itself because we believed in openness. Not embracing it and just throwing out one small aspect of it.


Copyright. Copyright. That is the issue. If you reproduce the code verbatim then you are in violation. This is what the AI is doing.

Just learning from the GPL code to make yourself smarter is not the problem.


A lot of comments here keep drawing parallels to writing as for why this is right or wrong, but I think a more apt comparison would be music, where components like riffs or rhythms can be reused to make something wholly different. Many a musician has claimed IP infringement over another musician using similar melodies, but just like programming, if you look closely enough, you'll see that everyone is copying each other and there's not much that can be done to stop it.

Personally, I'm a bit bothered by this myself, but I'd be lying if I said I never once got any ideas by looking at the source code of a GPL project.


if you write it yourself, it's fine. if you directly copy it from somewhere you arent allowed to copy from, then it is wrong.

There are no rules about the form of the code itself that governs whether or not someone owns it. Common sense applies. Sure you could "steal" very small, common, code snippets and get away with it; but that doesnt make it less wrong.

When a commercial entity explicitly does it, however, some times we can catch them. Like if they do it through algorithms that we more or less know how they work - i.e. the algorithm is using advanced control flow logic to copy and paste from it's training data set and copyrighted material is in that data set


For this reason, and a few others, my workplace simply put a blanket ban on these kinds of tools. If our code is never exposed to the learning tool, it’s never in danger of being showing up somewhere else.

Incidental to that, I feel like these tools expose the reality behind “copyrighting code/math” and how fallacious it is. If the tool can generate the efficient methods of achieving a result, I think it becomes obvious that one shouldn’t be able to protect it via IP law.


We can't yet equivocate ML systems with human beings. Maybe one day. But at the moment, it's probably better to compare this to a compiler being fed licensed code. The compilation output is still subject to the license. Regardless of how fancy the compiler is.

Also, a human being that reproduces licensed code from memory - because they read that code - would constitute a license violation. The line between derivative work, and authentic new original creation is not a well defined one. This is why we still have human arbiters of these decisions and not formal differential definitions of it. This happens in music for example all the time.


Just because code exists in a copyrighted project doesn't mean that it is on the only instance of that code in the world.

In a lot of scenarios, there is an existing best practice or simply only one real 'good' way to achieve something - in those cases are we really going to say that despite the fact a human would reasonably come to the same output code, that the AI can't produce it because someone else wrote it already?


It's also a license violation to train an AI on Open Source code, generate "new" code from that model even if it's not an exact copy, and ignore the licenses of the input.

That's not exactly a given that we can simply take as true. Of course that's borderline a trite tautology about any legal issue, but I'd argue that this is even fuzzier than usual. If a human writes some code, after having seen a given corpus of code previously, the "new" code might or might not be a derivative work of that corpus. It's not clear that replacing the human with an AI somehow changes the equation so categorically that it becomes automatic to consider the output of the AI a derivative work.

License violations don't suddenly become acceptable just because you're violating a million licenses at once.

No, but if either a human or an AI emits a given line of code, and that line of code can't be shown to have been cribbed from some corpus of existing code, or to be substantially similar to such, then why wouldn't it be considered original work in both cases?

next

Legal | privacy