In this case though we have machine learning model that is trained with some code and is not merely learning abstract concepts to be applied generally in different domains, but instead can use that knowledge to produce code that looks pretty much the same as the learning material, given the context that fits the learning material.
If humans did that, it would be hard to argue they didn't outright copy the source.
When a machine does it, does it matter if the machine literally copied it from sources, or first transformed it into an isomorphic model in its "head" before regurgitating it back?
If yes, why doesn't parsing the source into an AST and then rendering it back also insulate you from abiding a copyright?
>When a machine does it, does it matter if the machine literally copied it from sources, or first transformed it into an isomorphic model in its "head" before regurgitating it back?
You've hit the nail on the head here. If this is okay, then neural nets are simply machines for laundering IP. We don't worry about people memorizing proprietary source code and "accidentally" using it because it's virtually impossible for a human to do that without realizing it. But it's trivial for a neural net to do it, so comparisons to humans applying their knowledge are flawed.
This is not such a big problem in reality because the output of Copilot can be filtered to exclude snippets too similar to the training data, or any corpus of code you want to avoid. It's much easier to guarantee clean code than train the model in the first place.
That's a really good observation. Perhaps it highlights an essential difference between two modes of thought - a fuzzy, intuitive, statistical mode based on previously seen examples, and a reasoned, analytical calculating mode which depends on a precise model of the system. Plausibly, the landscape of valid musical compositions is more continuous than the landscape of valid source code, and therefore more amenable to fuzzy, example-based generation; it's entirely possible to blend two songs and make a third song. Such an activity is nonsensical with source code, and so humans don't even try. We probably do apply that sort of learning to short snippets (idioms), but source code diverges too rapidly for it to be useful beyond that horizon.
If humans did that, it would be hard to argue they didn't outright copy the source.
When a machine does it, does it matter if the machine literally copied it from sources, or first transformed it into an isomorphic model in its "head" before regurgitating it back?
If yes, why doesn't parsing the source into an AST and then rendering it back also insulate you from abiding a copyright?
reply