Hacker Read

ithkuil | karma 7515 | avg karma 2.06 · 2021-07-03 20:42:51+00:00

In this case though we have machine learning model that is trained with some code and is not merely learning abstract concepts to be applied generally in different domains, but instead can use that knowledge to produce code that looks pretty much the same as the learning material, given the context that fits the learning material.

If humans did that, it would be hard to argue they didn't outright copy the source.

When a machine does it, does it matter if the machine literally copied it from sources, or first transformed it into an isomorphic model in its "head" before regurgitating it back?

If yes, why doesn't parsing the source into an AST and then rendering it back also insulate you from abiding a copyright?

reply

dTal | karma 17131 | avg karma 2.88 · 2021-07-03 20:58:47+00:00

>When a machine does it, does it matter if the machine literally copied it from sources, or first transformed it into an isomorphic model in its "head" before regurgitating it back?

You've hit the nail on the head here. If this is okay, then neural nets are simply machines for laundering IP. We don't worry about people memorizing proprietary source code and "accidentally" using it because it's virtually impossible for a human to do that without realizing it. But it's trivial for a neural net to do it, so comparisons to humans applying their knowledge are flawed.

reply

visarga | karma 12425 | avg karma 1.65 · 2021-07-03 21:38:24

This is not such a big problem in reality because the output of Copilot can be filtered to exclude snippets too similar to the training data, or any corpus of code you want to avoid. It's much easier to guarantee clean code than train the model in the first place.

slavik81 | karma 6917 | avg karma 2.95 · 2021-07-04 08:08:54

> We don't worry about people memorizing proprietary source code and "accidentally" using it

I'm not sure why it's different, but that's a common concern with music. For example: https://www.reddit.com/r/WeAreTheMusicMakers/comments/4v8u8d...

reply

dTal | karma 17131 | avg karma 2.88 · 2021-07-04 09:08:06+00:00

That's a really good observation. Perhaps it highlights an essential difference between two modes of thought - a fuzzy, intuitive, statistical mode based on previously seen examples, and a reasoned, analytical calculating mode which depends on a precise model of the system. Plausibly, the landscape of valid musical compositions is more continuous than the landscape of valid source code, and therefore more amenable to fuzzy, example-based generation; it's entirely possible to blend two songs and make a third song. Such an activity is nonsensical with source code, and so humans don't even try. We probably do apply that sort of learning to short snippets (idioms), but source code diverges too rapidly for it to be useful beyond that horizon.