Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
GitHub Copilot, with “public code” blocked, emits my copyrighted code (twitter.com) similar stories update story
914 points by davidgerard | karma 13630 | avg karma 2.77 2022-10-16 14:33:52 | hide | past | favorite | 806 comments



view as:

Github Copilot is not AI at all, it is just a dumb code regurgitator that just sells you code you wrote on GitHub and takes all the credit for it shamelessly.

it's totally AI, in the "legal responsibility laundering" sense. This is the main present day use case for saying "AI".

Yeah you can just throw some "AI" into the code and suddenly people are okay with it stealing licensed code. I really thought technically inclined people would be resistant to this type of manipulation from advertisers but it seems like most people are totally going for it!

Not just plagiarisms - today you could be subjected to consequential mistreatment "because AI told us to, not us doing". The concept of "legal responsibility laundering" GP used could cover either infringements.

Hopefully you understand how artists feel about DALL-E and Midjourney now.

I like that if you prompt these with specific artists names, they try their best to rip those particular artists off.

I use copilot in my work every day, but only in places where I know the code cannot be regurgitated because what I'm doing has never been done before.

I can write an HTML form, then prompt copilot to generate a serializable class that can be used to deserialize that form on the server. I can write a test for one of our internal apis, and for every subsequent test I can just write the name of what I expect it to check and it generates a test that correctly uses our internal APIs and verifies the expected behavior.

You can have problems with the ethics of how GitHub and OpenAI produced what they did, but to describe it the way that you did requires never having really attempted to use it seriously.


This is a huge and looming legal problem. I wonder if what should be a big uproar about it is muted by the widespread acceptance/approval of github and related products, in which case its a nice example of how monopolies damage communities.

I think it won't become a legal problem until Copilot steals code from a leaked repository (i.e. the Windows XP source code) and that code gets reused in public.

Only then will we see an answer to the question "is making an AI write your stolen code a viable excuse".

I very much approve of the idea of Copilot as long as the copied code is annotated with the right license. I understand this is a difficult challenge but just because this is difficult doesn't mean such a requirement should become optional; rather, it should encourage companies to fix their questionable IP problems before releasing these products into the wild, especially if they do so in exchange for payment.


> I very much approve of the idea of Copilot as long as the copied code is annotated with the right license.

I agree, but I somehow doubt that will ever happen. Partly because MS is motivated to muddy the waters and shift norms towards allowing more of this kind of license-defying copying (because they make money from a product that does just that), and partly because the market for the most part doesn't think clearly about these issues. Many commenters here seem really fuzzy on the fact that nearly all code is, with or without an explicit statement as such, copyrighted (thanks, Berne convention), that that copyright (with or without documentation) is owned by someone, and that it is licensing which grants use of copyrighted work under specific circumstances. So as you say, the real problem is losing information about the license.

I'm grateful that the author of some LGPL'd code has triggered this discussion, since its a more consequential license w.r.t. code reuse.


Microsoft should just train it on all their proprietary code instead. See how sanguine they are about it then.

who said they haven't.

for something to show up verbatim in the output of a textual AI model it needs to be an input many times.

I wonder if the problem is not copilot, but many people using this person's code without license or credit, and copilot being trained on those pieces of code as well. copilot may just be exposing a problem rather than creating one.

I don't know much about AI, and I don't use copilot.


Microsoft have a public statement that they don't use proprietary code, only public code with public licenses. They have a lot of companies as customers who uses github, and they also use a lot third-party code in their own products.

Even BSD et. al. have attribution requirements - that must be a vanishingly small amount of code to be used. Me thinks the people who run GitHub (who have apparently decided to abandon the core business for the latest fun project) aren't being entirely upfront.

I thought they said all public repos without regard to the license they are under, which could be a proprietary EULA.

there's exactly no way they have

I'm curious how you could possibly know that for sure.

because Microsoft is known to be extremely protective of their code. there is just no way they would expose their internal code to being straight up decoded from the model, while they can just train the model on the huge public data of GitHub

With the amount of resources that Microsoft has, how hard can it be for them to exclude proprietary code that other people have stolen? I’d bet it is easy for them, but they won’t do it. Because they don’t care, because who is gonna take on them?

Will they “accidentally” include proprietary code from say, Oracle? Nope. They’ll make sure of it. But Joe Random? Sure


As a thought experiment: what do we all suppose would be the impact to Microsoft if they deliberately made public the proprietary source code for all of their publicly available commercial products and efforts (including licensed software, services; excluding private contracts, research), but the rest of their intellectual property and trade secrets remained private?

Since I’m posing the question, here’s my guess:

- Their stock would take at least a short term hit because it’s an unconventional and uncharacteristic move

- The code would reveal more about their strategic interests to competitors than they’d like, but probably nothing revelatory

- It might confirm or reinforce some negative perceptions of their business practices

- It might dispel some too

- It may reduce some competitive advantage amongst enormous businesses, and may elevate some very large firms to potential competitors

- It would provide little to no new advantage to smaller players who aren’t already in reach of competing with them and/or don’t have the resources to capitalize on access to the code

- It would probably significantly improve public perception of the company and its future intentions, at least among developers and the broader tech community

In other words, a wash. Overall business impact would be roughly neutral. The code has more strategic than technical value, there are few who could leverage the technical value that is any kind of revenue center with growth potential. Any disadvantage would be negated by the public image goodwill it generated.

Maybe my take is naive though! Maybe it would really hurt Microsoft long term if suddenly everyone can fork Windows 11, or steal ideas for their idiosyncratic office suite, or get really clever about how to get funded to go head to head with Azure armed with code everyone else can access too.


If they’d open source their software I wouldn’t have to wait two months till they finally release the pdbs for the kernel after every 2XH1 / 2XH2 update.

It’s so annoying that they are sooooo slow at this and we have to keep our users from upgrading after every release.


If they released all the source, I'd be able to run the nice drawing app from windows inkspaces again, unkilling the app they want dead

Some day, I think the Windows source will be public, at least for reference purposes.

They already have one open source part I know of, the new conhost[0].

[0] https://github.com/microsoft/terminal


I think Microsoft, their ISVs, and everyone would benefit a lot if Windows were "open source" in the narrow sense - viewable source code, with a license to compile and use only to the extent that you already own the requisite Windows license(s).

Pirating Windows is already utterly trivial with KMS activation, so it's not like they'd lose anything there.


They avoided answering this question at all costs.

Because it exposes their direct hypocrisy in this, its fair use for OSS but not for us.

Questions here are very important, and its no surprise GitHub avoided answering anything about CoPilot's legality:

https://sfconservancy.org/GiveUpGitHub/


Maybe open-source licenses might need to be revised to disallow this sort of thing, e.g. by saying that any thing trained on GNU data must also carry that license.

I’ve noticed that people tend to disapprove of AI trained on their profession’s data, but are usually indifferent or positive about other applications of AI.

For example, I know artists who are vehemently against DALL-E, Stable Diffusion, etc. and regard it as stealing, but they view Copilot and GPT-3 as merely useful tools. I also know software devs who are extremely excited about AI art and GPT-3 but are outraged by Copilot.

For myself, I am skeptical of intellectual property in the first place. I say go for it.


I think sadly it's just people being protective, the technology is interesting so if it doesn't hit their line of work, it's fantastic, if it does, then it's terrible.

There is no arguing against it though, you can't stop it, all this stuff is coming eventually to all of these areas, might as well try and find ways to use the oppurutinies while you can while some of this is still new.


I mean we definitely can stop it. Laws are a pretty strong deterrent.

"We" maybe can't stop it. But if there were the political will to kneecap many uses of machine learning, it's not obvious there's any reason it couldn't be done even if not 100% effective. Whether that would be a good thing is a different question.

You can slow this, you can't stop it whatsoever. It's about as ultimately futile as an effort as trying to stop piracy. People are ALREADY running salesforce codegen and stable diffusion at home, you can't put the genie back in the bottle, what we'll have 20 years from now is going to make critics of these tools have nightmares.

If you try to outlaw it, the day before the laws come into effect, I'm going to download the very best models out there and run it on my home computer. I'll start organising with other scofflaws and building our own AI projects in the fashion of leelachesszero with donated compute time.

You can shut down the commercial versions of these tools. You can scare large corporations from banning the use of these tools by corporations. You can pull an uno reverse card and use modified versions of the tools to CHECK for copyright infringement and sue people under existing laws AND you'll probably even be able to statistically prove somebody is an AI user. But STOPPING the use of these tools? Go ahead and try, won't happen.


> You can slow this, you can't stop it whatsoever. It's about as ultimately futile as an effort as trying to stop piracy. ... But STOPPING the use of these tools? Go ahead and try, won't happen.

So? No one needs to stop it totally. The world isn't black and white, pushing it to the fringes is almost certainly a sufficient success.

Outlawing murder hasn't stopped murder, but no one's given up on enforcing those laws because of the futility of perfect success.

> If you try to outlaw it, the day before the laws come into effect, I'm going to download the very best models out there and run it on my home computer. I'll start organising with other scofflaws and building our own AI projects in the fashion of leelachesszero with donated compute time.

That sounds like a cyberpunk fantasy.


Cyberpunk sure, but fantasy? Not at all.

> Cyberpunk sure, but fantasy? Not at all.

The fantasy is the idea that doing what you describe will matter.


You'll never be able to push it to the fringes because there will never be a legal universal agreement even from country to country on where to draw the line.

And as computers get more powerful and the models get more efficient it'll become easier and easier to self host and run them on your own dime. There are already one click installers for generative models such as stable diffusion that run on modest hardware from a few years back.


> You'll never be able to push it to the fringes because there will never be a legal universal agreement even from country to country on where to draw the line.

Huh? "Legal universal agreement" has never been required to push something to the fringes in a particular country.

If (in the US) these models were declared to be copyright infringement, or the users were required to pay license feeds to the creators of the data that was used to build the models, they will vanish from the public sphere. GitHub/Microsoft's legal department will pull Copilot down immediately, and development will effectively cease. No US company will sponsor development, and no company will allow in-house use. It will be dead.

Some dude might still run the model in his bedroom in his spare time on his own hardware, but that's what irrelevance looks like.

> And as computers get more powerful and the models get more efficient it'll become easier and easier to self host and run them on your own dime. There are already one click installers for generative models such as stable diffusion that run on modest hardware from a few years back.

If that's the only way you can run something, because it's illegal, you're describing a fringe technology right there.


What would the law do? Forbid automatic data collection and/or indexing and further use without explicit copyright holder agreement? That would essentially ban the whole internet as we know it, not saying that would be bad, but this is never going to happen, too much accumulated momentum in the opposite direction.

To your point, the law can do a lot of things. The issue here is the clarity and ability to enforce the law.

Laws in which nation and enforced by which juries?

I'm pretty sure DALL-E was trained only on not copyright material ( they say so :| ).

But to be honest if your code is open source im pretty sure Microsoft don't care about licence they'll just use it cause "reasons" same about stable diffusion they don't give a fuk about data if its in internet they'll use it so its topic that probably will be regulated in few years.

Until then lets hope they'll get milked (both Microsoft and NovelAI) for illegal content usage and I srsly hope at least few layers will try milking it asap especially NovelAI which illegally usage a lot of copyrighted art in the training data.


> I'm pretty sure DALL-E was trained only on not copyright material

Nope. DALL-E generates images with the Getty Watermark, so clearly there’s copyrighted materials in its training set: https://www.reddit.com/r/dalle2/comments/xdjinf/its_pretty_o...


Thanks for posting this out never see that before. If they use copyright images they should also get punished in the original paper they say no copyright content was used but it can be just lies who know data speak for itself and if they can prove this in court they should get punished ( so again Microsoft getting rekt for that will be good to see :] ).

Lots of people ironically put the Getty watermark on pictures and memes that they make to satirically imply that they are pulling stock photos off the internet with the printscreen function instead of paying for them.

Memes generally would not fall under the category of non-copyrighted material; they’re most of the time extremely copyrighted material just being used without permission. And even a wholly original work an artist sarcastically puts a Getty watermark and then licensed under Creative Commons or something would fall into very murky territory – the Getty watermark itself is the intellectual property of Getty. The original image author might plead fair use as satire, but satirical intentions aren’t really a defence available to DALL-E.

So even if we’re assuming these were wholly original works that the author placed under something like a Creative Commons license, the fact that it incorporated an image they had no rights to would at the very least create a fairly tangled copyright situation that any really rigorous evaluation of the copyright status of every image in the training set would tend to argue towards rejecting as not worth the risk of litigation.

But the more likely scenario here is that they did minimal at best filtering of the training set for copyrights.


You could argue that mocking the Getty logo like that is some form of fair use, which would be a backdoor through which it can end up as a legitimate element of a public domain work, in which case it would be fair game.

I agree with you that it is also possible that people posted Getty thumbnails to some sites as though they are public domain, and that is how the AIs learned the watermark.


Fair use does not make a work public domain; it merely helps the creator of the derivative work defend their case in court. But neither the original nor the derivative becomes public domain after a successful fair use defense.

Not a lawyer, of course, but I think slapping the Getty logo on a work claiming "fair use" and then releasing the work under public domain would be a case of misrepresentation, because Getty still has a copyright claim on your work. Regardless of the copyright status, it's still a clear trademark violation to me.


You can produce a public domain work using content that you have fair use rights to. The original owner of the content you are using fairly has no claim of ownership. You would have to assert that right in court if the owner of the copyright came after you, but that does not preclude the possibility of making a public-domain work with other copyrights used in fair use.

Obviously, that would not entitle anyone to rip those elements from your work and use them in a way that was not fair use. The Getty watermark could fall into this category: public domain pictures using the watermark fairly (for transformative commentary/satire purposes) could go into the network, which uses that information to produce infringing images.

Trademarks are a different story, but trademark protections are a lot narrower than you might think.

The point is that it's very conceivable that the neural network is being trained to infringe copyrights by training entirely with public-domain images.


Dunno about Getty, but I've been shown the cover for Beatles' Yellow Submarine done in different colors as some great AI advancement.

When Joe Rando plays a song from 1640 on a violin he gets a copyright claim on Youtube. When Jane Rando uses devtools to check a website source code she gets sued.

When Microsoft steals all code on their platform and sells it, they get lauded. When "Open" AI steals thousands of copyrighted images and sells them, they get lauded.

I am skeptical of imaginary property myself, but fuck this one set of rules for the poor, another set of rules for the masses.


If this is the new status quo then I suggest we find out how to fuck up the corpus as best as possible.

> one set of rules for the poor, another set of rules for the masses.

Conservatism consists of exactly one proposition, to wit:

There must be in-groups whom the law protects but does not bind, alongside out-groups whom the law binds but does not protect.

—Composer Frank Wilhoit[1]

[1]: https://crookedtimber.org/2018/03/21/liberals-against-progre...


Thanks for posting the link to the quote. Having said that, I don't think it's possible to quote that bit and get an understanding of the idea being conveyed without it's opening context. Indeed, it's likely to cause a false idea of what's being conveyed. From earlier in the same post:

"There is no such thing as liberalism — or progressivism, etc.

There is only conservatism. No other political philosophy actually exists; by the political analogue of Gresham’s Law, conservatism has driven every other idea out of circulation."


I agree that adds considerable depth to the value of the quote, and connects it to the conversation he appeared to be having, which is about the first line you've quoted:

There is no such thing as being a Liberal or Progressive, there is only being a Conservative or anti-Conservative, and while there is much nuänce and policy to debate about that, it boils down to deciding whether you actually support or abhor the idea of "the law" (which is a much broader concept than just the legal system) existing to enforce or erase the distinction between in-groups and out-groups.

But that's just my read on it. Getting back to intellectual property, it has become a bitter joke on artists and creatives, who are held up as the beneficiaries of intellectual property laws in theory, but in practice are just as much of an out-group as everyone else.

We are bound by the law—see patent trolls, for example—but not protected by it unless we have pockets deep enough to sue Disney for not paying us.


None

> Joe Rando plays a song from 1640 on a violin he gets a copyright claim on Youtube

That can't possibly be a valid claim, right? AFAIK copyright is "gone" after the original author dies + ~70 years. Before fairly recently it was even shorter. Something from 1640 surely can't be claimed under copyright protection. There are much more recent changes where that might not be the case, but 1640?

> When Jane Rando uses devtools to check a website source code she gets sued.

Again, that doesn't sound like a valid suit. Surely she would win? In the few cases I've heard of where suits like this are brought against someone they've easily won them.


This isn't a legal copyright claim, it's a "YouTube" copyright claim which is entirely owned and enforced by YouTube.

OK but then we're just talking about content moderation, which seems like a separate issue. I think using "YouTube copyright claim" as a proxy for "legal copyright claim" is more to the parent's point, especially since that's how YouTube purports the claim to work. Otherwise it feels irrelevant.

Copyright claims are a form of content moderation, by preventing reuse of content that others own.

But it can still be weaponized to prevent legitimate resubmissions of parallel works, that can potentially deplatform legitimate users, depending on the reviewer and the clarity of the rebuttal.


YouTube does this moderation in order to avoid legal pressure from copyright holders, as in

https://en.m.wikipedia.org/wiki/Viacom_International_Inc._v.....


The poster isn't claiming that this is a valid DMCA suit. Nearly everyone who is at a mildly decent level and has posted their own recordings of classical musical to YouTube have received these claims _in their Copyright section_. YouTube itself prefixes this with some lengthy disclaimer about how this isn't the DMCA process but that they reserve the right to kick you off their site based on fraudulent matches made by their algorithms.

They are absolutely completely and utterly bullshit. Nobody with half an ear for music will mistake my playing of Bach's G Minor Sonata with Arthur Grumiaux (too many out of tune notes :-D). But yet, YouTube still manages to match this to my playing, probably because they have never heard it before now (I recorded it mere minutes before).

So no, it isn't a valid claim, but this algorithm trained on certain examples of work, manages to make bad classifications with potentially devastating ramifications for the creator (I'm not a monetized YouTube artist, but if this triggered a complete lockout of my Google account(s), this likely end Very Badly).

I think it's a very relevant comparison to the GP's examples.


I have dealt with fraudulent Youtube copyright claims, it's a long and annoying process. First, you have to file a dispute, which typically the claimant will automatically reject, and then you have to escalate to a DMCA counternotice, which will take the video offline for a few days to give the claimant a chance to respond. In my experience, the claimant will drop the complaint at this point, but you're theoretically opening yourself up to legal action by sending the counternotice.

> That can't possibly be a valid claim, right?

It's not, but good luck talking to a human at Youtube when the video gets taken down.

> Again, that doesn't sound like a valid suit. Surely she would win?

Assuming she could afford the lawyer, and that she lives through the stress and occasional mistreatment by the authority, yes, probably. Both are big ifs, though.


> Assuming she could afford the lawyer, and that she lives through the stress and occasional mistreatment by the authority,

To add to that, there is provisions to lock her out of pushing new videos to the platform if the number of unresolved copyright claims passes some low number (3?).

So she loses new revenue until her claims prevail, and of course the entity which the claim is made for knows that and has no incentive to help her (don't they even get the monetization from her videos in the meantime ?)


> That can't possibly be a valid claim, right?

I'm not a lawyer, but my understanding is that while the "1640's violin composition" itself may be out of copyright, if I record myself playing it, my recording of that piece is my copyright. So if you took my file (somehow) and used it without my permission, and I could prove it, I could claim copyright infringement.

That's my understanding, and I've personally operated that way to avoid any issues since it errs on the side of safety. (Want to use old music, make sure the license of the recording explicitly says public domain or has license info)


The problem is that YouTube AI thinks your recording is the same as every other recording, because it doesn't understand the difference between composition and recording.

…yes, as I understand it there are ‘mechanical’ rights vs. publishing rights… (for example hip hop artists may recreate a sample to avoid paying mechanical royalties, but still end up paying for publishing) https://www.lawinsider.com/dictionary/mechanical-rights

Yes, that sounds right to me. But that's not relevant to "Joe Whoever played it and got sued".

> That can't possibly be a valid claim, right?

It has indeed happened.

https://boingboing.net/2018/09/05/mozart-bach-sorta-mach.htm...

Sony later withdrew their copyright claim.

There are two pieces to copyright when it comes to public domain:

* The work (song) itself -- can't copyright that

* The recording -- you are the copyright owner. No one, without your permission, can re-post your recording

And of course, there is derivative work. You own any portion that is derivative of the original work.


> Sony later withdrew their copyright claim.

Right, that's my point... I can sue anyone for anything, doesn't mean I'll win.


It worked out justified in this case.

The VAST majority of cases it does not.


> I can sue anyone for anything, doesn't mean I'll win.

You cant sue if you dont have money, a big corp can sue even if they know they are wrong.


> Again, that doesn't sound like a valid suit. Surely she would win? In the few cases I've heard of where suits like this are brought against someone they've easily won them.

That's freedom of speech for everyone who can afford a lawyer to bring suit against a music rights-management company.


Yes, this is a problem with the legal system in general.

The songwriter copyright is expired but there is still a freshly minted copyright on the video and the audio performance.

This becomes particularly onerous when trolls claim copyright on published recordings of environmental sounds that happen to be similar but not identical to someone else's but they do have a legitimate claim on the original recording.


>That can't possibly be a valid claim, right?

For literally everything but music, yes.

Even by the standards of copyright technicality, music copyright is weird. For example, if you ask a lawyer[0] what parts of copyright set it apart from other forms of property law[1], they would probably answer that it's federally preempted[2] and that it has constitutionally-mandated term limits.

Which, of course, is why music has a second "recording copyright", which was originally created by states assigning perpetual copyright to sound recordings. I wish I was making this up.

So the musical arrangement that constitutes that song from 1640? Absolutely public domain. You can tell people how to play Monteverdi all damned day. But every time you record that song being played, that creates a new copyright on that recording only. This is analogous to how making a cartoon of a public-domain fairy tale gives you ownership over that cartoon only. Except because different performers are all trying to play the same music as perfectly as possible, the recordings will sound the same and trip a Content ID match.

Oh, and because music copyright has two souls, the Sixth Circuit said there's no de minimus for sampling. That's why sample-happy rap is dead.

If you want public domain music on your YouTube video you either record it yourself or license a recording someone else did. I think there are CC recordings of PD music but I'm not sure. Either way you'll also need to repeatedly prove this to YouTube staff that would much rather not have to defend you against a music industry that's been out for blood for half a century at this point.

[0] Who, BTW, I am very much NOT

[1] Yes, yes, I know I'm dangerously close to uttering the dangerous propaganda term "intellectual property". You can go back to bed Mr. Stallman.

[2] Which means states can't make their own supra-federal copyright law and any copyright suit immediately goes to federal court.


The copyright on the original composition may be expired, but there are many people who make new recordings of that piece and those recordings are copyrighted. While it is entirely legal to record your own rendition, YouTube’s automated Content ID is dumb and often can’t tell the difference between your recording and some other contemporary recording.

Yeah, inequality sucks. So how about we focus on making the world better for everyone instead of making the world equally shitty for everyone?

This makes no sense.

Absolutely nobody is arguing to make the world shittier


Because we’re not the ones with the power. People with limited power pick the fights they might win, not the fights that maximize total welfare for everyone including large copyright holders. There’s no moral obligation to be a philosopher king unless you’re actually on a throne.

I think copilot is a clearer copyright violation than any of the stable diffusion projects though because code has a much narrower band of expression than images. It's really easy to look at the output of CoPilot and match it back to the original source and say these are the same. With stable diffusion it's much closer to someone remixing and aping the images than it is reproducing originals.

I haven't been following super closely but I don't know of any claims or examples where input images were recreated to a significant degree by stable diffusion.


I think the is exacty the gap the gp is mentionning: to a trained artist it is clear as water that the original image has been lifted wholesale, even if for instance the colors are adjusted here and there.

You put it as a remix, but remixes are credited and expressed as such.


I haven’t seen any side by sides that seem like a lift. Any examples?

I don’t see Midjourney (et al) as remixes, myself. More like “inspired by.”


Its clear where the knowhow was lifted from it doesnt matter that if the final image is somewhat unique (almost every image is).

style is not copyrightable under current rules

But it means the models were trained on images that are under copyright. In fact many of these models were trained exclusively on such images without any permission. For example Midjourney is clearly trained on everything on artstation.com where almost all images have commercial purpose / licenses.

Not safe for work, but one example I saw going around:

https://twitter.com/ebkim00/status/1579485164442648577

Not sure if this was fed the original image as an input or not.

Also seen a couple cases where people explicitly trained a network to imitate an artist's work, like the deceased Kim Jung Gi.


It's really interesting. I suspect the face was inpainted in, or this was a "img2img".

I think over time we are going to see the following:

- If you take say a star wars poster, and inpaint in a trained face over luke's, and sell that to people as a service, you will probably be approached for copyright and trademark infringement.

- If you are doing the above with a satirical take, you might be able to claim fair use.

- If you are using AI as a "collage generator" to smash together a ton of prompts into a "unique" piece, you may be safe from infringement but you are taking a risk as you don't know what % of source material your new work contains. I'd like to imagine if you inpaint in say 20 details with various sub-prompts that you are getting "safer".


Features outside the face is lost/changed from original on the right, so can’t be face inpainting. Unlikely to be style transfers, because some body parts are moved. Most plausibly this was generated.

So much for “generation” - it seems as if these models are just overfitting on extremely small subset of input data that it did not utterly failed to train on, almost that there could be geniuses who would be able to directly generate weight data from said images without all the gradient descent thing.


That's clearly lifting style, pose and general location but in each of those there are changes. Even for the original art we could find tons of examples of very similar poses and backgrounds because anime girl in a bathing suit on a beach background isn't that original of an image at the concept level. That pose also is a pretty well worn.

This is the problem of applying the idea of ownership to ideas and expression like art. Art in particular is a very remix and recombination driven field.


I think the key detail is to look at what happened in the bottom left - in the original drawing, there's dark blue (due to lighting) cloth filling the scene, but the network has instead generated oddly-hued water there, even though on the right side there's sand from the beach shore. There's seemingly no geometric representation driving the AI so it ended up turning clothing into mystery ocean water when synthesizing an image that (for whatever reason) looked like the original one. It's an interesting error to me because it only looks Wrong once you notice the sand on the right.

https://alexanderwales.com/wp-content/uploads/2022/08/image....

Left: “Girl with a Pearl Earring, by Johannes Vermeer” by Stable Diffusion Right: Girl with a Pearl Earring by Johannes Vermeer

This specific one is not copyright violation as it is old enough for copyright to expire. But the same may happen with other images.

from https://alexanderwales.com/the-ai-art-apocalypse/ and https://alexanderwales.com/addendum-to-the-ai-art-apocalypse...


If a human drew that, it would not be copyright violation.

I’m not so sure about that.

The scenes à faire doctrine would certainly let you paint your own picture of a pretty girl with a large earring, even a pearl one. That, however, is definitely the same person, in the same pose/composition, in the same outfit. The colors are slightly off, but the difference feels like a technical error rather than an expressive choice.


Even if it is an expressive choice of the new artist, if enough of the original artist's expressive choice remains, it could still be a copyright violation. Fair use can sometimes be a defense, but there are a lot of factors that go into determining whether something is fair use.

Really? It looks like some bad Warhol take on the Vermeer original.

That’s a really apt comparison, since the Supreme Court just heard Andy Warhol Foundation for the Visual Arts v. Goldsmith, which hinges on whether Warhol’s use of a copyrighted photo of Prince as the basis for “Orange Prince” was Fair Use.

Warhol’s estate seems likely to lose and their strongest argument is that Warhol took a documentary photo and transformed it into a commentary on celebrity culture. Here, I don’t even see that applying: it just looks like a bad copy.

https://www.scotusblog.com/2022/10/justices-debate-whether-w...


Why? Obviously it wouldn't be a copyright violation because the original one is old enough to no longer by copyrighted. But other than age?

The photograph of the art, which will be more recent, might have copyright protections.

It looks like it wouldn't in the UK, probably wouldn't in the US but would in Germany. The cases seem to hinge on the level of intellectual creativity of the photograph involved. The UK said that trying to create an exact copy was not an original endeavour whereas Germany said the task of exact replication requires intellectual/technical effort of it's own merit.

https://www.theipmatters.com/post/are-photographs-of-public-...


If the original art is still copyrighted, and you’d start selling your hand drawn variation, you’d totally be violating the copyright.

To make it concrete, imagine the latest Disney movie poster. You redraw it 95% close to the original, just changing the actual title. Then you sell your poster on Amazon at half the price of the actual poster. Would you get a copyright strike ?


It is a clear case of derivative work (see also https://commons.wikimedia.org/wiki/Commons:Derivative_works - internal docs, but their explanation of copyright status tends to be well done)

This specific one would not be a problem, but doing it with a still copyrighted work would be.


Exactly to a programmer copilot is clear violation, to a writer gpt-3 is clear violation, to an artist dalle-2 is clear violation. The artist might love copilot, the writer might love dalle, the programmer might love gpt-3.

Its all the same they just dont realize this.


Does dalle-2 verbatim reproduce artwork? I have never used it.

It's kind of like having millions of parameters you can tweak to get to an image. So an image does not really exist in the model.

I can imagine Mona Lisa in my head, but it doesn't really "exist" verbatim in my head. It's only an approximation.

I believe copilot works the same way (?)


NNs can and do encode information from their training sets in the models themselves, sometimes verbatim.

Sometimes the original information is there in the model, encoded/compressed/however you want to look at it, and can be reproduced.


This is just nonsense.

It's similar to saying that any digital representation of an image isn't an image just a dataset that represent it.

If what you said was any sort of defense every image copyright would never apply to any digital image, because the images can be saved in different resolutions, different file formats, or encoded down. e.g. if a jpeg 'image' was only an image at an exact set of digital bits i could save it again with a different quality setting and end up with a different set of digital bits.

But everyone still recognises when an image looks the same, and courts will uphold copyright claims regardless of the digital encoding of an image. So goodluck with that spurious argument that it's not copyright because 'its on the internet (oh its with AI etc).


I don't understand what is nonsense, how it works? Your response seems to be for something entirely different.

But anyway, how I see stable diffusion being different is that it's a tool to generate all sorts of images, including copyrighted images.

It's more like a database of *how to* generate images rather than a database *of* images. Maybe there isn't that much of a difference when it comes to copyright law. If you ask an artist to draw a copyrighted image for you, who should be in trouble? I'd say the person asking most of the time, but in this case we argue it's the people behind the pencil or whatever. Why? Because it's too easy? Where does a service like fiver stand here?

So if a tool is able to generate something that looks indistinguishable from some copyrighted artwork, is it infringing on copyright? I can get on board with yes if it was trained on that copyrighted artwork, but otherwise I'm not so sure.


A tool can't be held accountable and can't infringe on copyright or any other law for that matter. It's more of a product. It seems to me like it's a gray area that's just going to have to be decided in court. Like did the company that sells the tool that can very easily be used to do illegal things take enough reasonable measures to prevent it from being accidently used in such a way? In the case of Copilot, I don't believe so, because there aren't really even any adequate warnings to the end user that say it can produce code which can only legally be used in software that meets the criteria of the original license.

The issue is not about what it produces. Copilot i am sure has safeguards to not output copyrighted code (they even mention they have tests). So it will sufficiently change the code to be legally safe.

The issue is in how it creates the output. Both Dalle and Copilot can work only by taking work of people in past, sucking up their earned know how and creations and remixing it. All that while not crediting (or paying) anyone. The software itself might be great but it only works because it was fed with loads of quality material.

It's smart copy&paste with obfuscation. If thats ok legally. You can imagine soon it could be used to rewrite whole codebases while avoiding any copyright. All the code will technically be different, but also the same.


The DMCA disagrees. Specific methods of "circumvention" which inevitably take the form of a software tool are prohibited. Tools and their authors can be held accountable.

> I haven't been following super closely but I don't know of any claims or examples where input images were recreated to a significant degree by stable diffusion.

I think that the argument being made by some artists is that the training process itself violates copyright just by using the training data.

That’s quite different from arguing that the output violates copyright, which is what the tweet in this case was about.


I'm dubious of that in cases where the training set isn't distributed. If we call the training copyright infringement is downloading an image infringement? is caching?

I think it's more a question of derivative work. Normally derivative work is an infringement unless it falls under fair use.

Now a human can take inspiration from like 100 different sources and probably end up with something that no one would recognize as derivative to any of them. But it also wouldn't be obvious that the human did that.

But with an ML model, it's clearly a derivative in that the learned function is mathematically derived from its dataset and so is all the resulting outputs.

I think this brings a new question though. Because till now derivative was kind of implied that the output was recognizable as being derived.

With AI, you can tweak it so the output doesn't end up being easily recognizable as derived, but we know it's still derived.

Personally I think what really matters is more a question of what should be the legal framework around it. How do we balance the interests of AI companies and that of developers, artists, citizens who are the authors of the dataset that enabled the AI to exist. And what right should each party be given?


The real kink in that application of derivative work to me is the entire dataset goes into the model and is to some vanishingly small extent is used in every output how can we meaningfully assign ownership through that transition and mixing. And when we do how do we do it without exacerbating the extant problem of copyright in art? We already can't use characters and settings made during out own lifetimes in our own expression because Disney got life + 70 through Congress.

I don’t think copilot is intrinsically a copyright violation, as you seem to be alluding to. Examples like this seem to be more controversial, but I’m not sure there’s a clear copyright violation there either.

If you asked every developer on earth to implement FizzBuzz, how many actually different implementations would you get? Probably not very many. Who should own the copyright for each of them? Would the outcome be different for any other product feature? If you asked every dev on earth to write a function that checked a JWT claim, how many of them would be more or less exactly the same? Would that be a copyright violation? I hope the courts answer some of these questions one day.


> If you asked every developer on earth to implement FizzBuzz, how many actually different implementations would you get?

Thousands at least. Some of which would actually work.


There’s a finite number of ways to implement a working FizzBuzz (or anything else) in any given language, that aren’t substantially similar, is my point. At least without introducing pointless code for the explicit purpose of making it look different.

> If you asked every developer on earth to implement FizzBuzz, how many actually different implementations would you get?

Does it matter? If you examined every copyright lawsuit on earth over code, how many of them would actually be over FizzBuzz?


The same rationale applies to any other simple code block, as I elaborated on.

And my point is you don't have lawsuits over one simple code block.

This entire thread is about how copilot committed a copyright violation on a simple code block.

That code block is neither "simple like FizzBuzz" nor is it in a lawsuit. I feel like we're speaking past each other at this point.

What makes it not simple like FizzBuzz? You will not be able to come up with a reason why this one single function is copyrightable, but a FizzBuzz function isn’t. It’s one function in 15 lines of code. Get 1,000,000 developers to implement that function and you’re not going to have anywhere near 1,000,000 substantially different implementations.

For one thing FizzBuzz is like... 5-6 statements? This function has 13. FizzBuzz has a whopping 1 variable to keep track of. This function has so many I'm not even going to try to count. I'm not going to keep arguing about this, but if you want to believe they're equally simple then you'll just have a hard time convincing other people. That's all I have left to say on this.

It doesn't seem that far off to me. Copyright makes more sense in a larger context, such as making a Windows clone by copy pasting code from some Windows leak.

Without that context, fizzbuzz is not that different from a matrix transpose function to me.


SCO v. IBM[1] included claims of sections as small as "…ranging from five to ten to fifteen lines of code in multiple places that are of issue…" in some of the individual claims of the case.

[1] https://en.wikipedia.org/wiki/SCO_Group,_Inc._v._Internation....


The "..." part you redacted out explicitly said "it is many different sections of code". It was (quite obviously) not one or two 5-line blocks of code, let alone "simple" ones like FizzBuzz.

So your claim is that the code in the OP tweet is actually not copyrightable, and it would only become a copyright violation if you also copied many additional code blocks from the same copyrighted work?

Google v. Oracle ended with a six line function not being granted de minimus protection. What you're talking about is arguably common sense, but not based on current case law in the US.

Copyright is for original whole works. Utility functions don’t fall under that I don’t think.

I suppose whoever wants to pay the fees would “own” these things ?

https://www.copyright.gov/circs/circ61.pdf


I think the issue people have is that every developer trying to implement FizzBuzz will not have studied all the existing public copyrighted implementations. They will likely be reinventing the solution with maybe never having seen an existing FizzBuzz implementation or having only seen one or two at most, and probably won't be re-implementing it verbatim.

But the machine learning model has studied every single one of them.

And maybe more preposterous, if its dataset had no FizzBuzz implementation would it even be able to re-invent it?

I feel this is the big distinction that probably annoys people.

That and the general fact that everyone is worried it'll devalue the worth of an experienced developer as AI will make hard thing easier, require less effort and talent to learn and thus making developers less high demand and probably lower paid.


I don’t know of any examples of images being wholly recreated, but it’s certainly possible to use the name of some living artists to get work in their style. In those cases, it seems like not such a leap to say that the AI has obviously seen that artist’s work and that the output is a derivative work. (The obvious counterargument is that this is the same as a human looking at an artist’s work and aping the style.)

https://alexanderwales.com/wp-content/uploads/2022/08/image....

Left: “Girl with a Pearl Earring, by Johannes Vermeer” by Stable Diffusion Right: Girl with a Pearl Earring by Johannes Vermeer

This specific one is not copyright violation as it is old enough for copyright to expire. But the same may happen with other images.

from https://alexanderwales.com/the-ai-art-apocalypse/ and https://alexanderwales.com/addendum-to-the-ai-art-apocalypse...


I think this happens a lot with famous images since that image will be in the training set hundreds of times.

Even if deduplication efforts are done, that painting will still be in the background of movie shots etc.


> Left: “Girl with a Pearl Earring, by Johannes Vermeer” by Stable Diffusion Right: Girl with a Pearl Earring by Johannes Vermeer

Even that if done by a person as far as I understand it would not constitute a copyright infringement. It's a separate work mimicking Vermeer's original. The closest real world equivalent I can think of is probably the Obama Hope case by AP vs Shepard Fairy but that settled out of court so we don't really know what the status of that kind of reproduction is legally. On top of that though the SD image isn't just a recoloring with some additions like Fairy's was so it's not quite as close to the original as that case is.


Have you been following the Andy Warhol Prince drawing case?

It is current at the SCOTUS so we should see a ruling for the USA sometime in the next year or so.

https://en.m.wikipedia.org/wiki/Andy_Warhol_Foundation_for_t...


No hadn't heard of it, I don't follow copyright law extremely closely it tends to make me annoyed. On it's face reading the case summaries and looking at the two pictures, it feels like the act of manually repainting and the color choices should be enough to render it a transformative work. It's one of the fundamental problems with trying to apply copyright to anything other than precise copies, art remixes and recombines all the time, it's fundamental to the process.

It is a clear case of derivative work (see also https://commons.wikimedia.org/wiki/Commons:Derivative_works - internal docs, but their explanation of copyright status tends to be well done)

It’s not a copyright violation to commission an artist to make you something in the style of another artist and it’s also not copyright infringement for the artist you hired to look at that artist’s work to learn what that style means. And it’s also not always infringement to draw another artist’s work in your own style same as reimplementing code.

If you “trace” another artists work the hammer comes down though. For Copilot it’s way easier to get it to obviously trace.


Right, but what if you commission an artist to create a work similar to an already existing piece of art and the artist decides that the most efficient way to do that is to just place the original piece of art in a photocopier, crops out the copyright notice and original artist's signature, and sells you the resulting print?

That's a violation but not what SD is doing. It's not copying it's recreating a similar (sometimes extremely similar image).

> n those cases, it seems like not such a leap to say that the AI has obviously seen that artist’s work and that the output is a derivative work.

"Copying" a style is not a derivative work:

> Why isn't style protected by copyright? Well for one thing, there's some case law telling us it isn't. In Steinberg v. Columbia Pictures, the court stated that style is merely one ingredient of expression and for there to be infringement, there has to be substantial similarity between the original work and the new, purportedly infringing, work. In Dave Grossman Designs v. Bortin, the court said that:

> "The law of copyright is clear that only specific expressions of an idea may be copyrighted, that other parties may copy that idea, but that other parties may not copy that specific expression of the idea or portions thereof. For example, Picasso may be entitled to a copyright on his portrait of three women painted in his Cubist motif. Any artist, however, may paint a picture of any subject in the Cubist motif, including a portrait of three women, and not violate Picasso's copyright so long as the second artist does not substantially copy Picasso's specific expression of his idea."

https://www.thelegalartist.com/blog/you-cant-copyright-style


Stable Diffusion sometimes reproduces the large watermarks used by stock photo providers on their free sample images. That’s embarrassing at the minimum, and potentially a trademark violation.

Surely at the very least it'd be a TOS violation? I doubt any stock photo service grants you enough rights to redistribute their watermarked free image samples? Especially not in the context of a project like Stable Diffusion?

But it's not reproducing their samples. It's just adding their watermark to newly generated pictures you can't find in the training set.

If the watermark is their logo or name, it could copyrighted or trademarked.

And it's the responsibility of the person using the tool to generate that image not to violate copyright by redistributing it.

The tool is already redistributing it.

A broadcaster of copyrighted works is not protected against infringement just because they expect viewers to only watch programming they own.


It's not broadcasting an exact replica though, it's instructions to recreate an approximation of the original image. If I look at an image describe it and have someone else or even myself recreate it later that in general isn't copyright infringement, that's just a normal process in art. A more extreme example is the Fairy Hope image and the original AP but even that is more similar to the original than the output created by stable diffusion. Approximate recreations aren't generally copyright violations.

On the subject of trademarks the issue is as far as I know even more on the end user because the protections on them is around use in commerce and consumer confusion not about just recreating them like copyright protections.


Just like it's the person's responsibility to only recombine jpeg basis states when they don't correspond to a copyrighted image? It seems more and more to be the case that the trained model is, in large part, a very compact representation of the training data. I'm not seeing a difference between distributing a model that can be used to reconstruct the input images, as opposed to distributing jpeg basis states that can be used to reconstruct the original image.

If it faithfully memorized and reproduced a set of watermarks, it would be premature to conclude that it hadn’t memorized other (non-generic) graphical elements.

The watermark of a stock photo service is usually copyright protected, and also a (usually registered) trademark.

Well no.

Code is only protected to the degree it is creative and not functionally driven anyway.

So the reduced band of possible expression often directly reduces the protectability-through-copyright.


> With stable diffusion it's much closer to someone remixing and aping the images than it is reproducing originals.

So very similar to how the music industry treats sampling then?

Everybody using CoPilot needs to get "code sample clearance" from the original copyright holder before publishing their remix or new program that uses snippets of somebody else's code...

Try explaining _that_ to your boss and legal department.

"To: <all software dev> Effective immediately, any use of Github is forbidden without prior written approval from both the CTO and General Councel."


This is already a problem with anyone who ever copypastes from Stack Overflow. You're all violating CC-BY-SA[0] and nobody really cares about this.

[0] https://stackoverflow.com/help/licensing


If I ever take any code from SO, I include a comment with a link to it. Surely that's standard practice for anything longer than a line or two?

I do the same. I think it satisfies BY (attribution) but not SA (Share Alike).

As GP says, no one really cares, but it seems hard to satisfy SA... even if you are pasting into open source, is your license compatible with CC?

Perhaps I'm over-thinking this.


The reason why it's easy to match Copilot results back to the original source is that the users are starting with prompts that match their public code, deliberately to cause prompt regurgitation.

Stable Diffusion actually has a similar problem. Certain terms that directly call up a particular famous painting by name - say, the Mona Lisa[0] - will just produce that painting, possibly tiled on top of itself, and it won't bother with any of the other keywords or phrases you throw at it.

The underlying problem is that the AI just outright forgets that it's supposed to create novel works when you give it anything resembling the training set data. If it was just that the AI could spit out training set data when you ask for it, I wouldn't be concerned[1], but this could also happen inadvertently. This would mean that anyone using Copilot to write production code would be risking copyright liability. Through the AI they have access to the entire training set, and the AI has a habit of accidentally producing output that's substantially similar to it. Those are the two prongs of a copyright infringement claim right there.

[0] For the record I was trying to get it to draw a picture of the Mona Lisa slapping Yoshikage Kira across the cheek

[1] Anyone using an AI system to "launder" creative works is still infringing copyright. AI does not carve a shiny new loophole in the GPL.


> The reason why it's easy to match Copilot results back to the original source is that the users are starting with prompts that match their public code, deliberately to cause prompt regurgitation.

The reason doesn't really matter...


GP is just highlighting why this is so common and often a challenging edge case. If you ask it for something that's exactly in its dataset, the "best" solution that minimizes loss will be that existing code. Thus, it's somewhat intrinsic to applying statistical learning to text completion.

This means MS really shouldn't have used copyleft code at all, and really shouldn't be selling copilot in this state, but "luckily" for them, short of a class action suit I don't really see any recourse for the programmers who's work they're reselling.


Pretty much all code they have requires attribution, and based on reports, Copilot does not generate that along with the code. So excluding copyleft code (how would you even do that?) does not address the issue (assuming that the source code produced is actually a derivative work).

That's a good point. I was thinking that during the curation phase of the dataset they should check for a LICENSE.txt file in the repo, and just batch exclude all copyleft/copyright containing repositories. This obviously won't handle every case as you say, and when it does generate copyleft code it will fail to attribute, but hopefully not having copyleft code in its dataset or less of it reduces the chance it generates code that perfectly satisfies its loss function by being exactly like something its seen before.

The main problem I see with generating attribution is that the algorithm obviously doesn't "know" that it's generating identical code. Even in the original twitter post, the algorithm makes subtle and essentially semantically synonymous changes (like the changing the commenting style). So for all intents and purposes it can't attribute the function because it doesn't know _where_ it's coming from and copied code is indistinguishable from de novo code. Copilot will probably never be able to attribute code short of exhaustively checking the outputs using some symbolical approach against a database of copyleft/copyrighted code.


Suing Microsoft for training Copilot on your code would require jumping over the same hurdle that the Authors Guild could not: i.e. that it is fair use to scan a massive corpus[0] of texts (or images) in order to search through them.

My real worry is downstream infringement risk, since fair use is non-transitive. Microsoft can legally provide you a code generator AI, but you cannot legally use regurgitated training set output[1]. GitHub Copilot is creating all sorts of opportunities to put your project in legal jeopardy and Microsoft is being kind of irresponsible with how they market it.

[0] Note that we're assuming published work. Doing the exact same thing Microsoft did, but on unpublished work (say, for irony's sake, the NT kernel source code) might actually not be fair use.

[1] This may give rise to some novel inducement claims, but the irony of anyone in the FOSS community relying on MGM v. Grokster to enforce the GPL is palpable.


> The reason why it's easy to match Copilot results back to the original source is that the users are starting with prompts that match their public code, deliberately to cause prompt regurgitation.

Sounds like MS has devised a massive automated code laundering racket.


Seems more like a massive class-action copyright target, potentially at ($50k/infraction) x (the number of usages).

Good. Where do I sign up?

Find a good and ambitious copyright attorney with some free capacity.

Also, register your code with the copyright office.

Edit: Apparently, with the #1 post on HN right now, you could also just go here: https://githubcopilotinvestigation.com/


Both.

I think that's backwards. The AI doesn't "forget", it never even knew what novelty is in the first place.

I tried some very simple queries with copilot on random stuff, and tried to trace it back to the source. I was successful about 1/3 of the time.

(Sorry I didn't log my experiment results at the time. None of it was related to work I'd done - I used time adjustment functions if I remember correctly)


None

I suspect it’s going to be a discussion similar to the introduction of music sampling, followed by a lot of litigation, followed by a settling of law on the matter.

The interesting part is if AI will be considered a tooling mechanism much like the tooling used to record and manipulate a music sample into a new composition.


Preach. So incredibly annoyed when I tried to send a video of my son playing Beethoven to his grandparents and it was taken down due to a copyright violation.

> When Joe Rando plays a song from 1640 on a violin he gets a copyright claim on Youtube. When Jane Rando uses devtools to check a website source code she gets sued.

Do you have any evidence for those claims, or anything resembling those examples?

Music copyright has long expired for classical music, and big shots are definitely not exempt from where it applies. Just look at how much heat Ed Sheeran, one of the biggest contemporary pop stars, got for "stealing" a phrase that was literally just chanting "Oh-I" a few times (just to be clear, I am familiar with the case and find it infuriating that this petty rent-seeking attempt went to trial at all, even if Sheeran ended up being completely cleared, but to great personal distress as he said afterwards).

And who ever got sued for using dev tools? Is there even a way to find that out?


There have been a number of stories about musicians being copyright claims. Here is the first result on Google

https://www.radioclash.com/archives/2021/05/02/youtuber-gets...

For being sued for looking at source here is the first result on Google

https://www.wired.com/story/missouri-threatens-sue-reporter-...


Ok - it is a true shame that the YouTube copyright claim system is so broken as to enable those shady practices, and that politicians still haven't upped their knowledge of the internet beyond a 'series of tubes'.

But surely the answer should be to fix the broken YT system and to educate politicians to abstain from baseless threats, not to make AI researchers pay for it?


Just to be clear, because it's in the title, the reporter was threatened with a lawsuit for looking at source code. I cannot find anyone acually sued for it. BTW, here's an article saying said reporter wasn't sued: https://www.theregister.com/AMP/2022/02/15/missouri_html_hac...

Anyone with a mouth can run it and threaten a lawsuit. If fact, I threaten to sue you for misinformation right now unless you correct your post. Fat lot of good my threat will do because no judge in their right mind would entertain said lawsuit because it's baseless.


https://twitter.com/mpoessel/status/1545178842385489923

Among many others. Classical music may have fallen into public domain, but modern performances of it is copyrightable, and some of the big companies use copyright matching systems, including YouTube's, that often flags new performances as copies of recordings.


Basically, copyright is for people with copyright lawyers

That's not even a joke. One of the premises of a copyright is that you defend your intellectual property or lose it. If the system were more equitable then it would defend your copyright.

This is an inaccurate description of copyright, at least in the United States.

Trademarks require active defense to avoid genericization. Copyright may be asserted at the holder's discretion.


You're thinking of trademarks.

Ah! You're right.

Your point about losing copyright is incorrect. But copyright absolutely was designed with corporations in mind not small individual creators. It was designed around TV,Movies, and print publishers, not around YouTube and Patreon.

The poor are the masses, or at least part of the masses.

Yeah, presumably this was an editing error and he meant "the corporations."

> one set of rules for the poor, another set of rules for the masses

Presumably by "the masses" you meant "the large corporations"?

Usually, "the masses" means "the common people" ... i.e. not much different from "the poor."

If you meant corporations, I'm 100% behind this comment.


I think they mixed "for the rich … for the masses" and "for the poor … for corporations" while writing. But it's clear they meant contrasting terms.

I'm not even a good cellist and YouTube has put copyright claims on the crappy practice videos I have of me playing Saint-Saëns.

I suspect a video of you playing literally anything on the cello--even an improvised song or a random motif--is likely to get reported as a copyright violation when uploaded to YouTube.

Interesting theory. I'll have to test that.

Oh, actually I remember now -- I think the copyright complaint specifically said what recording they thought I was infringing, and it was the correct piece.


Diffusion of responsibility. When Joe rando plays his song, it’s easy to see the offender and do something about it.

When it’s a faceless mass of 100k employees…? Not so much.


The "Joe Rando" example is of playing a song that predates the copyright system.

Well copyright and intellectual property are made-up concepts anyway.

It’s only logical that people twist and bend the rules.

Anyway, this kind of bs will go on until people start bringing companies to court over this.

I still don’t get why don’t lawyers start offering their services for a cut of the damages… there probably is good money to be made by suing companies that put copyright-infringing ai in production.


You have four examples of using replicable stuff that has been shared publicly, and you call two of them stealing.

All I can take away from this is the absurdity of intellectual property laws in general. I agree with the GP, if people are sharing stuff, it's fair game. If you don't want people using stuff you made, keep it to yourself. Pretending we can apply controls to how freely available info is used is silly anyway


Wowwww. This is exactly what I was thinking, but I couldn’t put it into such a terse simple example. +1

In theory AI should never return an exact copy of a copyrighted work or even anything close enough you could argue is the original “just changed”. If the styles are the same I think that’s fine, no different than someone else cloning it. But there’s definitely outputs from stable diffusion that looks like the original with some weird artifacts.

We need regulation around it.


Code is much easier to do that because the avenues for expression are significantly limited compared to just creating an image. For it to be useful copilot has to produce compiling and reasonably terse and understandable code. The compiler in particular is a big bottle neck to the range of the output.

> there’s definitely outputs from stable diffusion that looks like the original with some weird artifacts.

Do you have examples? Because SD will generate photoreal outputs and then get subtle details (hands, faces) wrong, but unless you have the source image in hand then you've no way of knowing whether it's a "source image" or not.


This is like saying "we need a regulation around bugs in software", with similar consequences. ML models are generally too large to ensure that there's no bugs. Same with software.

I am a programmer who has written extensively on my blog and HN against Copilot.

I am also not a hypocrite; I do not like DALL-E or Stable Diffusion either.

As a sibling comment implies, these AI tools give more power to people who control data, i.e., big companies or wealthy people, while at the same time, they take power away from individuals.

Copilot is bad for society. DALL-E and Stable Diffusion are bad for society.

I don't know what the answer is, but I sure wish I had the resources to sue these powerful entities.


I’m a programmer and a songwriter and I am not worried about these tools and I don’t think they are bad for society.

What did the photograph do to the portrait artist? What did the recording do to the live musician?

Here’s some highfalutin art theory on the matter, from almost a hundred years ago: https://en.wikipedia.org/wiki/The_Work_of_Art_in_the_Age_of_...


But this isn’t like photography and portrait artistry. This is more like a wealthy person stealing your entire art catalog, laundering it in some fancy way, and then claiming they are the original creator. Stable Diffusion has literally been used to create new art by screenshotting someone’s live-streamed art creation process as the seed. While creating derivative work has always been considered art(such as deletion poetry and collage), it’s extremely uncommon and blasé to never attribute the original(s).

> This is more like a wealthy person stealing your entire art catalog, laundering it in some fancy way, and then claiming they are the original creator.

If I take a song, cut it up, and sing over it, my release is valid. If I parody your work, that's my work. If you paint a picture of a building and I go to that spot and take a photograph of that building it is my work.

I can derive all sorts of things, things that I own, from things that others have made.

Fair use is a thing: https://www.copyright.gov/fair-use/

As for talking about the originals, would an artist credit every piece of inspiration they have ever encountered over a lifetime? Publishing a seed seems fine as a nice thing to do, but pointing at the billion pictures that went into the drawing seems silly.


Fair use is an affirmative defense. Others can still sue you for copying, and you will have to hope a judge agrees with your defense. How do you think Google v. Oracle would have turned out if Google's defense was "no your honor, we didn't copy the Java sources. We just used those sources as input to our creative algorithms, and this is what they independently produced"?

If I take a song, cut it up, and sing over it, my release is valid

"valid", how? You still have to pay royalties to the copyright holder of the original song, and you don't get to claim it as your own.


Nobody is suing anybody over AI art yet.

Until there are a large amount of court cases, the burden of proof is on you to say that this is copyright infringement.


If you sing over a song you’re adding your own voice. If you photograph a building that’s your own photograph, where decisions like lighting and framing are creative choices. If you paint a picture of a building that’s your own picture.

An artist should credit when they are directly taking from another artist. Erasure poems don’t quite work if the poet runs around claiming they created the poem that was being erased.

But more importantly SD allows you to take and use existing copyright works and funny-launder them and pass them off as your own, even though you don’t own the rights to that work. This would be more akin to I take a photograph you made and sell it on a t shirt on red bubble. I don’t actually own the IP to do that with.


Do you know what's different about the photograph or the recording?

They are still their own separate works!

If a painter paints a person for commission, and then that person also commissions a photographer to take a picture of them, is the photographer infringing on the copyright of the painter? Absolutely not; the works are separate.

If a recording artist records a public domain song that another artist performs live, is the recording artist infringing on the live artist? Heavens, no; the works are separate.

On the other hand, these "AI's" are taking existing works and reusing them.

Say I write a song, and in that song, I use one stanza from the chorus of one of your songs. Verbatim. Would you have a copyright claim against me for that? Of course, you would!

That's what these AI's do; they copy portions and mix them. Sometimes they are not substantial portions. Sometimes, they are, with verbatim comments (code), identical structure (also code), watermarks (images), composition (also images), lyrics (songs), or motifs (also songs).

In the reverse of your painter and photographer example, we saw US courts hand down judgment against an artist who blatantly copied a photograph. [1]

Anyway, that's the difference between the tools of photography (creates a new thing) and sound recording (creates a new thing) versus AI (mixes existing things).

And yes, sound mixing can easily stray into copyright infringement. So can other copying of various copyrightable things. I'm not saying humans don't infringe; I'm saying that AI does by construction.

[1]: https://www.reuters.com/world/us/us-supreme-court-hears-argu...


I'm not sure sure that originality is that different between a human and a neural network. That is to say that what a human artist is doing has always involved a lot of mixing of existing creations. Art needs to have a certain level of familiarity in order to be understood by an audience. I didn't invent 4/4 time or a I-IV-V progression and I certainly wasn't the first person to tackle the rhyme schemes or subject matter of my songs. I wouldn't be surprised if there were fragments from other songs in my lyrics or melodies, either from something I heard a long time ago or perhaps just out of coincidence. There's only so much you can do with a folk song to begin with!

BTW, what happened after the photograph is that there were less portrait artists. And after the recording there were less live musicians. There are certainly no less artists nor musicians, though!


> I'm not sure sure that originality is that different between a human and a neural network. That is to say that what a human artist is doing has always involved a lot of mixing of existing creations.

I disagree, but this is a debate worth having.

This is why I disagree: humans don't copy just copyrighted material.

I am in the middle of developing and writing a romance short story. Why? Because my writing has a glaring weakness: characters, and romance stands or falls on characters. It's a good exercise to strengthen that weakness.

Anyway, both of the two people in the (eventual) couple developed from my real life, and not from any copyrighted material. For instance, the man will basically be a less autistic and less selfish version of myself. The woman will basically be the kind of person that annoys me the most in real life: bright, bubbly, always touching people, etc.

There is no copyrighted material I am getting these characters from.

In addition, their situation is not typical of such stories, but it does have connections to my life. They will (eventually) end up in a ballroom dance competition. Why that? So the male character hates it. I hate ballroom dance during a three-week ballroom dancing course in 6th grade, the girls made me hate ballroom dancing. I won't say how, but they did.

That's the difference between humans and machines: machines can only copyright and mix other copyrightable material; humans can copy real life. In other words, machines can only copy a representation; humans can copy the real thing.

Oh, and the other difference is emotion. I've heard that people without the emotional center of their brains can take six hours to choose between blue and black pens. There is something about emotions that drives decision-making, and it's decision-making that drives art.

When you consider that brain chemistry, which is a function of genetics and previous choices, is a big part of emotions, then it's obvious that those two things, genetics and previous choices, are also inputs to the creative process. Machines don't have those inputs.

Those are the non-religious reasons why I think humans have more originality than machines, including neural networks.


Asked to give practical advice to starting writers, he said, “Read.”

https://www.nytimes.com/2022/09/30/books/early-cormac-mccart...


And my advice is to read and live!

One of the reasons Roald Dahl was such a great writer is his life experiences. Read his books Boy and Solo.


Imagine telling someone who wanted to learn a sport to watch it. I define someone that writes as a writer. It is the act of writing that enables you to then read and learn from others.

An example: a dyslexic friend and a dyslexic family member: their writing communication skills of both is now fine in part because their jobs required it from them (and in part because technology helps). I also had one illiterate friend, who has taught himself to read and write as an adult (basic written communication), due to the needs of his job. Learn by doing, and add observation of others as an adjunct to help you. Even better if you can get good coaching (which requires effort at your craft or sport).

Disclaimer: never a writer. Projecting from my other crafts/sports. Terribly written comment!


Coming from an actual, though unpublished, writer: you are right.

> I'm not sure sure that originality is that different between a human and a neural network.

It is, yes. For example, a neural network can't invent a new art style on its own, or at least existing models can't, they can only copy existing art styles, invented by humans.


> What did the recording do to the live musician?

The recording destroyed the occupation of being a live musician. People still do it for what amounts to tip money, but it used to be a real job that people could make a living off of. If you had a business and wanted to differentiate it by having music, you had to pay people to play it live. It was the only way.


It also gave birth to the recording artist. It certainly didn’t get rid of musicians.

> What did the photograph do to the portrait artist?

It completely destroyed the jobs of photo realistic portrait artists. You only have stylised portrait painting now and now that is going to be ripped off too.


It also gave birth to the photographic portrait artist. It certainly did not get rid of portrait artists in general.

This was of course a leading question. The point was to get you to think about what artists did in response to the photograph. They changed the way they paint.

I'm positive that machine learning will also change the way that people create are and I am positive that it will only add to the rich tapestry of creative possibilities. There are still realistic portrait painters, after all, they're just not as numerous.


> these AI tools give more power to people who control data, i.e., big companies or wealthy people, while at the same time, they take power away from individuals.

Not sure I agree, but I can at least see the point for Copilot and DALL-E - but Stable Diffusion? It's open source, it runs on (some) home-use laptops. How is that taking away power from indies?

Just look at the sheer number of apps building on or extending SD that were published on HN, and that's probably just the tip of the iceberg. Quite a few of them at least looked like side projects by solo devs.


SD is better than the other two, but it will still centralize control.

I imagine that Disney would take issue with SD if material that Disney owned the copyright to was used in SD. They would sue. SD would have to be taken off the market.

Thus, Disney has the power to ensure that their copyrighted material remains protected from outside interests, and they can still create unique things that bring in audiences.

Any small-time artist that produces something unique will find their material eaten up by SD in time, and then, because of the sheer number of people using SD, that original material will soon have companions that are like it because they are based on it in some form. Then, the original won't be as unique.

Anyone using SD will not, by definition, be creating anything unique.

And when it comes to art, music, photography, and movies, uniqueness is the best selling point; once something is not unique, it becomes worth less because something like it could be gotten somewhere else.

SD still has the power to devalue original work; it just gives normal people that power on top of giving it to the big companies, while the original works of big companies remain safe because of their armies of lawyers.


> I imagine that Disney would take issue with SD if material that Disney owned the copyright to was used in SD. They would sue. SD would have to be taken off the market.

Are you sure?

I'm not familiar with the exact data set they used for SD and whether or not Disney art was included, but my understanding is that their claim to legality comes from arguing that the use of images as training data is 'fair use'.

Anyone can use Disney art for their projects as long as it's fair use, so even if they happened to not include Disney art in SD, it doesn't fully validate your point, because they could have done so if they wanted. As long as training constitutes fair use, which I think it should - it's pretty much the AI equivalent of 'looking at others' works', which is part of a human artist's training as well.


> Are you sure?

Yes, I'm sure.

> I'm not familiar with the exact data set they used for SD and whether or not Disney art was included, but my understanding is that their claim to legality comes from arguing that the use of images as training data is 'fair use'.

They could argue that. But since the American court system is currently (almost) de facto "richest wins," their argument will probably not mean much.

The way to tell if something was in the dataset would be to use the name of a famous Disney character and see what it pulls up. If it's there, then once the Disney beast finds out, I'm sure they'll take issue with it.

And by the way, I don't buy all of the arguments for machine learning as fair use. Sure, for the training itself, yes, but once the model is used by others, you now have a distribution problem.

More in my whitepaper against Copilot at [1].

[1]: https://gavinhoward.com/uploads/copilot.pdf


>The way to tell if something was in the dataset would be to use the name of a famous Disney character and see what it pulls up.

I tried out of curiosity. Here[1] are the first 8 images that came up with the prompt "Disney mickey mouse" using the stable diffusion V1.4 model. Personally I don't really see why Disney or any other company would take issue with the image generation models, it just seems more or less like regular fan art.

[1]: https://i.imgur.com/cIHBCRe.png


How is this situation made any worse by these AI systems?

If a small time artist has their work stolen, they probably won't be able to fight it very well. They might be able to get a few taken down, but the sheer number will make it impossible to keep up.

Disney, on the other hand, will have armies of lawyers going after any copyright violation.

It seems the same whether AI is involved or not.


The sheer scale is what makes it worse.

Because you are right: a few, and a small time artist can fight. Hundreds and thousands of copies, or millions, and even Disney struggles. That's why Disney would go after the model itself; it scales better.


but I sure wish I had the resources to sue these powerful entities.

I wonder if there is a crowdfunding platform like gofundme, for lawsuits. Or can gofundme itself can be used for this purpose? It would be fantastic to sue the mega polluters, lying media like Fox etc.

That said, even with a lot of money, are these cases winnable? Especially given the current state of Supreme Court and other federal courts?


Obviously this is a matter of philosophy. I am using Copilot as an assistant, and for that it works out very nicely. It's fancy code completion. I don't know who is trying to use this to write non-trivial code but that's as bad an idea as trying to pass off writing AI "prompts" as a type of engineering.

These things are tools to make more involved things. You're not going to be remembered for all the AI art you prompted into existence, no matter how many "good ones" you manage to generate. No one is going to put you into the Guggenheim for it.

Likewise, programmers aren't going to become more depraved or something by using Copilot. I think that kind of prescriptive purism needs to Go Away Forever, personally.


I think the methods can be unsavory even if the result is nice.

Yes, the way Copilot was trained was morally questionable, but probably legaly fine (Github terms of service).

There is no doubt the result is extremely helpful though.


The best proposal I’ve heard to deal with the societal/economic problems this sort of A.I. poses were made by Jaron Lanier: https://youtu.be/rGqiswuJuQI?t=1190 …I can see why his proposals of providing (micro-)compensation to people whose tremendous efforts end up being mined by these algorithms would not be popular with researchers/companies who stand to benefit vastly (…presumably including the investors who own this site?). The lobbying power/political power/ awareness/financial resources of your average (atomised) artist/programmer/musician etc. is pretty much nil in comparison… Forgive the clumsy analogy, but I have a feeling the whole thing might end up something like a haulage company that doesn’t want to pay any taxes to help fix the roads though?

What makes them bad?

I am against Copilot because Microsoft is training the models with public data disregarding copyright (also, doesn't include it's own code).


Because they centralize control, as I said in [1].

Put another way, AI's are tools that give more power to already powerful entities.

[1]: https://news.ycombinator.com/item?id=33227303


Not the OP but I have a sinking feeling that these AI tools are going to take away from the most enjoyable careers and creative pursuits and leave us with only mundane button pusher AI supervisor jobs.

Current AI is not replacing anything yet but I feel we are only a few years before AI can do a better job at drawing or programming than someone with years of practice. Sure, you can utilise those tools to stay ahead. But will AI prompt engineer be as emotionally satisfying as drawing for real?


It seems like these AI tools, if anything, will take away the least enjoyable parts of creative careers. Artists will less time thinking about how to adjust a camera lens or mix paints, and more time thinking about how to tell a story that connects with people. Programmers will spend less time banging out boilerplate, and more time thinking about system design.

AI isn’t replacing camera usage or paint mixing. It’s replacing the years of learning to draw with prompt engineering. Like how the camera obsoleted the art of realistic painting, I feel that AI will replace the art of drawing. Which is something I like doing. But I like doing it because it’s hard but with purpose.

It’s not satisfying to painstakingly work on something that I could have generated with an AI in seconds.


> while at the same time, they take power away from individuals.

Stable Diffusion and DALL-E give a ton of power to individuals, hence why they are popular.

It feels like you're doing a cost analysis instead of a cost-benefit analysis, i.e. you're only looking at the negatives. It's a bit like saying cars are bad because they give more power to the big companies who sell them + put horse and buggy operators out of a job.


I explained more in my comment at [1].

The big difference is that cars were a tool that helped regular people by being a force multiplier. Stable Diffusion and DALL-E are not force multipliers in the same way. Sure, you may now produce images that you couldn't before, but there are far fewer profitable uses for images than for cars. Images don't materially affect the world, but cars can.

[1]: https://news.ycombinator.com/item?id=33227303


This isn't super convincing to me. You're basically predicting that some new innovation will be limited in its usefulness, but you have no real way of knowing that, because the variables are too complex.

This is why we have a market. We let billions of individuals vote on what they think is useful or not, in real-time, multiple times a day, every day. If AI-generated images are less desirable than what came before, then people won't use them or pay to use them in the long run. They'll die like other flash-in-the-pan fads have died, artists will retain their jobs en masse, and OpenAI won't gain much if any power.

The entire idea of the market is to ensure that if some entity is gaining money/power, that's happening as a result of it providing some commensurate good to the people. And if that's not happening, or if the power is too great, that's why we have laws and regulatory bodies.


Thankfully, Stable Diffusion is on thousands of hard drives, so the genie can't be put back in the bottle.

I don’t think anyone believes it’s possible. Even the “ai ethicists” have to realise this. We can still acknowledge that these tools can be bad for society while knowing they can’t be stopped.

This talking point seems to come up often, but since it's basically saying that people are hypocrites I think it is a bad faith thing to say without reasonable proof that it's not a fringe opinion (or completely invented).

For what it's worth, the people I know who are opposed to this sort of "useful tool" don't discriminate by profession.


An accusation of hypocrisy is not an argument; at least not a relevant one.

I’m not making an argument.

I think the distinction is that only one of those classes tends to produce exact copies of work. Programmers get very upset at DALL-E and Stable Diffusion producing exact (and near-exact) copies of artwork too. In contrast to exact copying, production of imitations (not exact copies, but "X in the style of Y") is something that artists have been doing for centuries, and is widely thought of as part of arts education.

For some reason, code seems to lend itself to exact copying by AIs (and also some humans) rather than comprehension and imitation.


I'm mildly suspicious that this example is an implementation of a generic matrix functionality though: you couldn't patent this sort of work, because it's not patentable - it's a mathematics. It's fundamentally a basic operation, that would have to be implemented with a similar structure regardless of how you do it.

Patents and copyrights are totally different, and should be treated as such. The issue isn't about whether someone copies the algorithm, it's whether they copy the written code. Nothing in an algorithms textbook is patentable either, but if you copy the words describing an algorithm from it, you are stealing their description.

Mathematics is not patentable, but you can patent the steps a computer takes to compute the results of that particular algorithm.

Only if it has physical consequences. There was a case in 2014 that narrowed software patents significantly, called "Alice vs CLS Bank." No more patents on computerized shopping carts, but encryption or compression can still be patented.

In my mentoring/tutoring experiences, the comprehension is also resisted when a copy is available.

I, with my software developer hat, am not excited by AI. Not a bit, honestly. Esp. about these big models trained on huge amount of data, without any consent.

Let me be perfectly clear. I'm all for the tech. The capabilities are nice. The thing I'm strongly against is training these models on any data without any consent.

GPT-3 is OK, training it with public stuff regardless of its license is not.

Copilot is OK, training on with GPL/LGPL licensed code without consent is not.

DALL-E/MidJourney/Stable Diffusion is OK. Training it with non public domain or CC0 images is not.

"We're doing something amazing, hence we need no permission" is ugly to put it very lightly.

I've left GitHub because of CoPilot. Will leave any photo hosting platform if they hint any similar thing with my photography, period.


I disagree.

Those are effectively cases of cryptomnesia[0]. Part and parcel of learning.

If you don't want broad access your work, don't upload it to a public repository. It's very simple. Good on you for recognising that you don't agree with what GitHub looks at data in public repos, but it's not their problem.

[0] https://en.m.wikipedia.org/wiki/Cryptomnesia


> Those are effectively cases of cryptomnesia.

Disagree, outputting training data as-is is not cryptomnesia. This is not Copilot's first case. It also reproduced ID software's fast inverse square root function as-is, including its comments, but without its license.

> If you don't want broad access your work, don't upload it to a public repository. It's very simple.

This is actually both funny and absurd. This is why we have licenses at this point. If all the licenses is moot, then this opens a very big can of worms...

My terms are simple. If you derive, share the derivation with the same license (xGPL). Copilot is deriving my code. If you use my code as a derivation point, honor the license, mark the derivation with GPL license. This voids your business case? I don't care. These are my terms.

If any public item can be used without any limitations, Getty Images (or any other stock photo business) is illegal. CC licensing shouldn't exist. GPL is moot. Even the most litigious software companies' cases (Oracle, SCO, Microsoft, Adobe, etc.) is moot. Just don't put it on public servers, eh?

Similarly, music and other fine arts are generally publicly accessible. So copyright on any and every production is also invalid as you say, because it's publicly available.

Why not put your case forward with attorneys of Disney, WB, Netflix and others? I'm sure they'll provide all their archives for training your video/image AI. Similarly Microsoft, Adobe, Mathworks, et al. will be thrilled to support your CoPilot competitor with their code, because a) Any similar code will be just cryptomnesia, b) The software produced from that code is publicly accessible anyway.

At this point, I even didn't touch to the fact that humans are trained much more differently than neural networks.


Disagree, outputting training data as-is is not cryptomnesia

Outputting training data as-is without attribution is just plain plagiarism. You don't get to put verbatim text from textbooks in your academic papers either.


It's funny to say id's fast inverse square root. Conway certainly didn't come up with the algorithm or the magic number.

But your reasoning boils down to I don't like it so it mustn't be that way. That's never been necessarily true.

At any rate piracy is rampant so clearly a large body of people don't think even a direct copies is morally wrong. Let alone something similar.

You're acting as though there are constant won and lost cases over plagiarism. Ed Sheeran seems to defend his work weekly. Every case that goes to court means reasonable minds differ on the interpretation of plagiarism legally.

So what's your point?

Because it seems the main thrust of your argument is I should argue with Microsoft instead (*who own GitHub lol*)? That's all you got to hold back the tide of AI? An appeal to authority?


> It's funny to say id's fast inverse square root. Conway certainly didn't come up with the algorithm or the magic number.

I'm not claiming that they did. What I said is, Copilot emitted the exact implementation in IDs repository, incl. all comments and everything.

> But your reasoning boils down to I don't like it so it mustn't be that way. That's never been necessarily true.

If you interpret my comment with that depth and breadth, I can only say that you are misinterpreting completely. It's not about my personal tastes, it's about ethical frameworks and social contracts.

> At any rate piracy is rampant so clearly a large body of people don't think even a direct copies is morally wrong. Let alone something similar.

I believe if you listen to a street musician for a minute, you owe them a dollar. Scale up from there. BTW, I'm a former orchestra player, so I know what making and performing music entails.

> You're acting as though there are constant won and lost cases over plagiarism. Ed Sheeran seems to defend his work weekly. Every case that goes to court means reasonable minds differ on the interpretation of plagiarism legally.

When there's a strict license on how a work can be used, and the license is violated, it's a clear case. That AI is just a derivation engine, and the license that derivations carry the same license. I don't care if you derive my code. I care you derive my code and hide the derivations from public.

It's funny that you're defending close-souring free software at this point. This is a neat full-circle.

> So what's your point?

All research and science should be ethical. AI research is not something special which allows these ethical norms and guidelines (which are established over decades if not centuries) to be suspended. If medicine people act with quarter of this lassiez faire attitude, they'd be executed with a slow death. If security researchers act with eighth of this recklessness, their career are ruined.

> That's all you got to hold back the tide of AI?

As I aforementioned, I'm not against AI. It just doesn't excite me as a person who knows how it works and what it does, and the researchers' attitude is leaving a bad taste in my mouth.


> If all the licenses is moot, then this opens a very big can of worms

We are talking ‘de facto’ here, not ‘de jure’. It may be legally problematic, but anything made public once is never going back in the box.


> I don't care. These are my terms.

That sounds like a you problem, not a us problem.

As of yet, no court has said that any of this is illegal.

So tough luck. Go take it to the supreme court if you disagree, because right now it actually seems like people can do almost whatever they want with these AI tools.

Your objection simply doesn't matter, until there is a court case that supports you. You can't do anything about it, if that doesn't happen.


> That sounds like a you problem, not a us problem.

This stance allows me to do whatever do I want with any software or work you put out there, regardless of the license you attach to it, since it's your problem, not mine.

However, this is not the mode I operate ethically.

> As of yet, no court has said that any of this is illegal.

I assume this will be tested somehow, sometime. So I'm investing in popcorn futures.

> Your objection simply doesn't matter, until there is a court case that supports you. You can't do anything about it, if that doesn't happen.

You know, this goes both ways. Same will be very valid for your works, through your own reasoning.


> this stance allows me to do whatever do I want with any software

Actually, no it doesn't. This topic is about AI training on code.

Courts have not held that this is illegal.

But there are absolutely other things, that people might do with code, that break copyright law.

> it's your problem, not mine.

Oh, but it would be your problem as well, if you break the law, and someone else sues you for it.

That's the difference. AI training is not against the law. Other things, that you are imagining in your head right now, very well could be, and you could lose.

> Same will be very valid for your works

Not if what you are hypothetically doing breaks the law, and AI training doesn't break the law.

So that the difference, which makes the reasoning legitimate.


> Courts have not held that this is illegal.

Laws are just codified version of ethics. Just because it's not codified in law, it doesn't mean it's ethically correct, and I hold ethics over laws. Some people call this conscience, others call this honor.

Just because it's not deemed illegal, it's not deemed ethical. These are different things. The world has worked under honor and ethical codes for a very long time, and still works under these unwritten laws in a lot of areas.

Science, software and other frontiers value ethics and principles a great deal. Some niches like AI largely ignore these, and I find this disturbing.

However, some people prefer to play the game with the written rules only, and as I said, I'm investing in popcorn futures to see what's gonna happen to them.

I might tank and go bankrupt of course, but I will sleep better at night for sure, and this is more important for me at the end.

I'm passionate about computers, yes. This is also my job, yes, but I'm not the person who'll do reckless things just because an incomplete code of written ethics doesn't prevent me to do it.

I'd rather not do anything to anyone which I don't want to receive. IOW, I sow only the seeds which I want to reap.


> Laws are just codified version of ethics.

And a quite reasonable code of ethics is thst people do not have absolute, complete control over their intellectual property, and instead only have the ability to control it in certain circumstances.

Things like fair use, which makes this legal, exists for many very good reasons.

So yes, the code of ethics that society has decided on, includes perfectly reasonable exception, such as fair use, and it is your problem, not ours, that you have some ridiculous idea that people should have complete, 100% authoritarian control over their IP.

And no, people not having infinite control over IP, does not allow you to extend this reasonable exception, to you being able to do literally anything to other people's IP.


You're completely right. My premise is not extending the (court tested, honored) license I attach to my code.

What I say with the GPL license is clear:

If you derive anything from this code base, you're agreeing and obliged to carry this license to the target code base (The logical unit in this case is a function in most cases).

So the case is clear. AI is a derivation engine. What you obtain is a derivation of my GPL licensed code. Carry the license, or don't use that snippet, or in AI's case, do not emit GPL derived code for non-GPL code bases.

This is all within the accepted ethics & law. Moreover, it's court tested too.


> you're agreeing and obliged to carry

People are not agreeing though.

They are not agreeing, because there is a perfectly reasonable ethical and legal principle called fair use, which society has determined allows people to engage in limited use of other people's IP, no matter what the license says.

> Carry the license, or don't use

Or, instead of that, people could reasonably use fair use, and ignore the license, as fair use exists for many good legal and ethical reasons.

And no, you do not get to extend that out, to doing anything you want to do, just because there is a reasonable exception called fair use.

> do not emit GPL derived code for non-GPL code bases

Or, actually, yes do this. This is allowed because of the reasonable ethical and moral principle called fair use, which allows people to ignore your license.


I will agree to disagree on your overly broad definition of fair-use which consists of ingesting a whole code base and using its significant parts for another code base with or without derivation while disregarding the attached license to its whole and/or parts.

Thanks for the discussion, and have a nice day.

I may not further comment on this thread from this point.


> training on with GPL/LGPL licensed code without consent is not

That’s actually fine (kind of the idea of specifying a license). What is not fine is using that code in non-GPL licensed code.


> That’s actually fine...

Actually yes. I'm not against the tech. I'm against using my code without consent for a tool which allows to breach the license I put my code under.

IOW, if Copilot understood code licenses and prevented intermixing incompatibly licensed code while emitting results for my repository, I might have slightly different stance on the issue.


Your post is a good example of the tu quoque fallacy[1].

[1] https://en.wikipedia.org/wiki/Tu_quoque


Well, it would be fallacious reasoning if I was using this as the basis of an argument.

I didn’t intend to argue anything or draw any conclusions. Just making an observation based on conversations with friends and coworkers.


This is a good example of sealioning.

(I kid)


> I’ve noticed that people tend to disapprove of AI trained on their profession’s data, but are usually indifferent or positive about other applications of AI.

In other words: the banal observation that people care far more when their stuff is stolen than when some stranger has their stuff stolen.


I look at IP differently.

For copyright, the act of me creating something doesn't deprive you of anything, except the ability to consume or use the thing I created. If I were influenced by something, you can still be influenced by that same thing - I do not exhaust any resources I used.

This is wholely different from physical objects. If I create a knife, I deprive you of the ability to make something else from those natural resources. Natural resources that I didn't create - I merely exploited them.

Because of this, I'm fine with copyright (patents are another story). But I have some issues with physical property.


I can think of two explanations for that off the top of my head.

The first is that people only recognize the problems with the things that they're familiar with, which you would kind of expect.

The other option is that there's a difference in the thing that people object to. My impression is that artists seem to be reacting to the idea that they could be automated out of a job, where programmers are mostly objecting to blatant copyright violation. (Not universally in either case, but often.) If that is the case, then those are genuinely different arguments made by different people.


> For myself, I am skeptical of intellectual property in the first place. I say go for it.

If we didn't live in a Capitalist society, that would be fair. But we currently do. That Capitalist society cares little about the well being of artists unless it can find a way to make their art profitable. Projects like DALL-E and Midjourney pillage centuries of human art and sell it back to us for a profit, while taking away work from artists who struggle to make ends meet as it is. Software Developers are generally less concerned about Copilot because they're still making 6 figures a year, but they'll start to get concerned if the technology gets smart enough that society needs less Developers.

An automated future should be a good thing. It should mean that computers can take care of most tasks and humans can have more leisure time to relax and pursue their passions. The reason that artists and developers panic over things like this is that they are watching themselves be automated out of existence, and have seen how society treats people who aren't useful anymore.


I don't know specifically what DALL-E was trained on, but if it's art for which the artists' have not consented to it being used to train AI then that's problematic. I haven't seen any objections to DALL-E on that basis specifically though, whereas all the discussion of Copilot is around the fact that code authorship & Github accounts are not intrinsically tied together, making it impossible to have code authors consent to their code being used, regardless of what ToS someone's agreed to.

> For myself, I am skeptical of intellectual property in the first place. I say go for it.

I'm in a similar boat but this is precisely the reason I object so strongly to Copilot. IP has been invented & perpetuated/extended to protect large corporate interests, under the guise of protecting & sustaining innovators & creative individuals. Copilot is a perfect example of large corporate interest ignoring IP when it suits them to exploit individuals.

In other words: the reason I'm skeptical of IP is the same reason I'm skeptical of Copilot.


Stable Diffusion and DallE were both trained on copyrighted content scraped from the internet with no consent from the publishers.

It's quite a common complaint because some of the most popular prompts involve just appending an artist's name to something to get it to copy their style.


> For myself, I am skeptical of intellectual property in the first place. I say go for it.

Me too. I think copyright and these silly restrictions should be abolished.

At the same time, I can't get over the fact these self-serving corporations are all about "all rights reserved" when it benefits them while at the same time undermining other people's rights. Microsoft absolutely knows that what they're doing is wrong. Recently it was pointed out to me that Microsoft employees can't even look at GPL source code, lest they subconsciously reproduce it. Yet they think their software can look at other people's code and reproduce it? What a load of BS.

I'll forgive them for going for it the second copyright is gone. Then it won't be a crime for any of us to copy Windows and Office either. You bet we're gonna go for it too.


> Then it won't be a crime for any of us to copy Windows and Office either. You bet we're gonna go for it too.

Don't worry. At that time all of the available hardware will refuse to run any software unless it comes with a signed license from one of the big three.


> [people] are usually indifferent or positive about other applications of AI

That sounds like the pro-innovation bias: https://en.m.wikipedia.org/wiki/Pro-innovation_bias


> I’ve noticed that people tend to disapprove of AI trained on their profession’s data, but are usually indifferent or positive about other applications of AI.

This is a fascinating observation and I think there's a lot of truth to it. But maybe our inference should be that these systems mistreat each of us, even if it's difficult to see unless it's falling on you.

Maybe a more important question than whether or not this is a violation of intellectual property is whether this is a violation of human dignity, not that it's illegal (though in this case, it may be) but that it's extremely rude in a way that we don't necessarily have the vocabulary for yet.


When it comes to solving a problem, I want it to emit whatever solves the problem.

When it comes to being an AI that understands coding concepts, I don't want it to regurgitate code verbatim.

When it comes to being a product, I don't want it to plagiarize.


Again, as a lawyer, I think it's really important to focus on intent. The Constitution gives us "To promote the progress of science and Useful Arts" (which we've expanded.)

So question one in the back of our heads should be "Are we promoting progress here?" That most often means protecting the little guy, and that's why I think it's mostly necessary, and also must be evaluated very skeptically.


Define progress.

Good luck.


I mean, I don't have to do it alone. There's this thing called "the law" that's put in a little work on this :)

Fair enough; from a legal framework that's a highly practical way to move forward. I don't think many people like to point at the current status quo, with all its flaws, and immediately think "the courts will help me!"; but you did mention you are a lawyer and I respect that you are working in the bounds of what is possible rather than what is ideal.

While I personally wouldn’t care about it, I can understand someone taking offense at copilot for spitting out their code verbatim and claiming it isn’t theirs.

Neither GPT nor Dall-e produces content that anyone can point to and say “they are laundering MY work”.

The closest we’ve been to that point is the image generators spitting out copyright watermarks, but they are not clearly attributable to any one single image (afaik).


I am strongly against intellectual property, but I don’t like this idea that any one of us will get in big trouble for openly violating IP restrictions, but if one of these big companies scoops up copyrighted works for their AI it’s fine? The double standard is unfair. This all seems like a great opportunity for big companies to encourage the growth of Creative Commons, which would benefit everyone, but instead they’re making large private datasets only they control.


I think you might be into something with your conclusion.

Nonetheless that’s problematic for folks relying on the copyright as of now.

I feel for artists here, devs won’t go hungry or jobless.


> I know artists who are vehemently against DALL-E, Stable Diffusion, etc. and regard it as stealing, but they view Copilot and GPT-3 as merely useful tools.

An example: https://twitter.com/DaveScheidt/status/1578411434043580416

> I also know software devs who are extremely excited about AI art and GPT-3 but are outraged by Copilot.

The fear is not unwarranted though. I can clearly see AI replacing most jobs (not just in tech) but art, crafts, music and even science. There probably will be no field untouched by AI in this decade and completely replaced by next decade.

We have multiple extinction events for humanity lined up: Climate Change, Nuclear Apocalypse and now AI.

We will have to not just work towards reducing harm to the Planet, but also work towards stopping meaningless Wars and figuring out how to deal with unemployment and economic crisis that is looming on the horizon. The only ones to suffer in the end would be the "elites" (or will they be the first depending on how quickly Civilization goes towards Anarchy?).

Can't say for sure. But definitely gloomy days ahead.


I often quote this comment regarding AI advances and jobs [0]:

> Yes, many of us will turn into cowards when automation starts to touch our work, but that would not prove this sentiment incorrect - only that we're cowards.

>> Dude. What the hell kind of anti-life philosophy are you subscribing to that calls "being unhappy about people trying to automate an entire field of human behavior" being a "coward". Geez.

>>> Because automation is generally good, but making an exemption for specific cases of automation that personally inconvenience you is rooted is cowardice/selfishness. Similar to NIMBYism.

It's true cowardice to assume that our own profession should be immune from AI while other professions are not. Either dislike all AI, or like it. To be in between is to be a hypocrite.

For me, I definitely am on the side of full AI, even if it automates my job away, simply because I see AI as an advancing force on mankind.

[0] https://news.ycombinator.com/item?id=32461138#32463198


It’s not hypocrisy to think some jobs shouldn’t be automated. I don’t teach, but I definitely want human teachers teaching my kin, not AI teachers.

Perhaps, or perhaps not. We have not yet seen the true reach of pedagogy of AI. If AI can teach better than humans (something like the Matrix's brain uploading of training), then I will want to do that than have a human teach me.

I'm actually fine with both. I think copyright/IP related to software needs to be toned down a lot. Software patents abolished.

In my opinion the only thing that should be an infringement regarding code is copying entire non trivial files or entire projects outright.

A 100 line snippet should not be copyrighteable. Only the entire work, which you could think as the composition of many of those snippets.


For what it's worth, I think it's all very impressive and amazing but also really sketchy. Or at least, I think the developers of these systems need to be very careful about what they are allowed to do with what content, and I don't trust that they are doing that, because of articles like this one and others.

I know much less programmers offended by copilot than artists offended by StableDiffusion.

This is a mostly irrelevant red herring setting up professions against each other. Instead we should cooperate on a costly yet necessary decision of instituting a basic income, especially prioritizing professions about to be superseded by modern ML.

Obviously, our decision-making class views the topic of instituting a realistic basic income right now as something extremely unpleasant, and so it goes.

People who helped to bootstrap the AI should be compensated, at the very least by being able to live a modest lifestyle without having to work. Simple as.


Does it cost money to produce high quality training sets? Yes. Would an organization or individual be willing to pay for samples for their data set? Absolutely. It seems pretty easy to discern that value is being taken from people.

The last time I happened to point this out[1], all I got was a bunch of HNers nitpicking the words I chose, but not addressing the core issue.

I have to assume this is just people being protective of their own profession and consequently, setting up a high bar for what constitutes as performance in that profession.

[1] https://news.ycombinator.com/item?id=32895251#32895709


There's a substantial difference between being trained and being overfit to repeat training data 1:1. Overtraining is a bug of a model, not a feature. For example, Stable Diffusion 1.4 is overtrained on one specific Aivazovsky painting (among some others by other authors, like Mona Lisa, or Sunflowers - Van Gogh painted several of those). Copilot was famously overtrained on Carmack's fast square root code, so they had to block it programmatically after receiving bad publicity. Both are not intended by model authors, this is a flaw.

I must be in the minority of programmers that I really really like Copilot but am indifferent about Stable Diffusion/Dall-E/midJourney.

Copilot on Python makes me x5 more productive. I used Copilot in Beta for a year and continue paying for it now.

For example: I can make a command line data wrangling script for a novel data set in a few minutes with a few prompts with full complement of extras (proper argparse parameters with sane defaults, ready to import etc etc). # reasonable comments included for free as well

Before copilot I could do the same in about 20-30minutes but my code would be a mess with little commenting. I would spend 30-60 minutes just looking up docs for various libraries.

Now without Copilot, if all I was doing was writing data wrangling scripts 4 hours a day I could approach this Copilot like productivity for a single task.

However with Copilot I can switch problem domains very quickly and remain productive.

Interestingly, on something like CSS or Javascript - Copilot is helping only slightly, maybe because my local training set is insufficient and my web-dev prompts are too generic.

So I think AI can be fantastic force multiplier in a skillset that you already are reasonably familiarity. I can handle the 5-10% wtf Python code that Copilot produces.

I do not particularly like copyrights and do wish Copilot had been trained on private Microsoft code as well.


Not your repo, not your code.

I celebrate Microsofts shameless plundering of Github to create new products that increase productivity. The incredible thing is that people trusted Microsoft to use their code on their terms to begin with. This is a company who has been finding ways to make open source code into a proprietary product since the 90s.

Nobody can stop people from replicating what Microsoft did in the long run anyways. Eventually any consumer with enough access to source code will be able to make their own copilot. Even if copilot is criminalised Microsoft can just sell access to the entire GitHub dataset and let other people commit the "crime". Then you're right back where we started with having to sue the end users of copilot for infringement instead of Microsoft.

Use private repos or face the inevitability that copilot-like products will scrape your code.


Ok. So instead of whining about it on Twitter sue GitHub. No matter what you think of Copilot, establishing some case law on AI-generated code will be beneficial to everyone.

Whining about it on Twitter = free and easy

Suing Github = signing up for a ~decade long incredibly expensive and time-consuming legal battle against one of the richest companies in the world

There may be a slight difference in effort between these two options.


Not to mention Microsoft could countersue using their enormous patent war chest, which they have a history of doing[0]

[0] https://techcrunch.com/2012/03/22/microsoft-and-tivo-drop-th...


It goes beyond code. Also photos, art, text, etc. Be careful what you wish for. Whether you like it or not, with a stroke of a pen Congress or the Supreme Court in the US could probably wipe out the legal use of a huge amount of the training data used for ML.

Good.

Good! Large corporations shouldn’t be able to profit off of other people’s data without consent or compensation.

Great! I assume you believe all search engines should be illegal then?

Search engines don't sell the information of others; they sell certain metadata of that information, namely, the location of that information.

And excerpts of that information in many cases.

Accessing a computer system without permission is illegal. Search engines operate under the assumption that they have permission to access any public available server unless explicitly forbidden.

If a company or person assume they got copyright permission to any work public accessible then they will quickly find out that such assumption is wrong, and that they require explicit permission.


>Search engines operate under the assumption that they have permission to access any public available server unless explicitly forbidden.

And why should opt-out be a reasonable norm? To be clear, the internet (among many other things) breaks down if every exchange of information is opt-in. Sharing of photographs taken in public places is another example. But the internet basically functions because people share information on an opt-out basis (that may or may not even be respected).


You're going to chip in, right? And help fundraise? Because it would take millions of dollars to mount a successful legal attack over this.

The repo he linked to on twitter is a public repo though. Am I missing something?

https://twitter.com/DocSparse/status/1581462433335762944


I think they're more concerned about it repeating code w/o ownership/copyright labels

Public != copyright free.

> The repo he linked to on twitter is a public repo though. Am I missing something?

I dunno the title says it used public code when it was meant to block public code.


Same issue with Stable Diffusion/NovelAI and certain people's artwork (eg Greg Rutkowski) being obviously used as part of the training set. More noticeable in Copilot since the output needs to be a lot more precise.

Lawmakers need to jump on this stuff ASAP. Some say that it's no different from a person looking at existing code or art and recreating it from memory or using it as inspiration. But the law changes when technology gets involved already, anyway. There's no law against you and I having a conversation, but I may not be able to record it depending on the jurisdiction. Similarly, there's no law against you looking at artwork that I post online, but it's not out of question that a law could exist preventing you from using it as part of an ML training dataset.


> Some say that it's no different from a person looking at existing code or art and recreating it from memory or using it as inspiration.

Hah, no, the model encodes the code that it was trained on. This is not "recreating from memory", this is "making a copy of the code in a different format." (Modulo some variable renaming, which it's probably programmed to do to in order to obscure the source of the code.)


If I hear a song on the radio and then sing it, I still won't own the copyright.

Not really, unless you can produce a verbatim copy of existing artwork out of stability's stable diffusion.

"I cannot produce proof but I dislike the argument, so I must downvote" - the HN community being the rational actor that it always is

> Lawmakers need to jump on this stuff ASAP

Maybe not as soon as possible, probably a lot of other problems that need immediate attention.

But some attention maybe? Nobody likes it if someone does a repost of a post, but some reposting could be seen as going "viral" so as a creator you're cool with it. If someone else does the whole going "viral" with your work, maybe it's less cool.

In spirit of the thread, if co-pilot generates some code to create a stable diffusion prompt explicitly for "Greg Rutkowski" art and it writes code your friend wrote for a small gig, well can you re-use his code for your own gig?

Can "Greg" even claim it was his art in your opininion, even though it's probably not a verbatim copy


Now here's a question:

Suppose we trained the open AI model on the entire corpus of pop hits from about 1960 onwards.

What are the chances it would get sued for copyright infringement.

If the derivative nature is clear in the same model being trained on popular song, then it should be the same for code (or visual art, or a number of other domains).

Not arguing for current copyright law, just pointing out the inconsistencies.

For that matter, what would happen if you asked Copilot for a set of Java headers. Asking for a friend!


This isn't the situation Copilot is in.

A more analogous situation would be if the AI model occasionally returned the entirety of "Baby One More Time" by Britney Spears. Yes, I think you'd be sued if you passed off Baby One More Time as an original work just because you got it from an open AI tool.


Yes, lawmakers should come ASAP and make everybody pay rents that will ultimately not even be pocketed by the median creator/developer. What a fantastic vision! What a feast for the capitalist beast …

Is the code in question even covered by copyright in the first place? It seems utilitarian in nature.

Oh, the comments! Those are covered by copyright for sure.


You know, I make it a habit of not trying to get upset by downvotes but this is really absurd. What am I saying that is incorrect? Am I being rude? What exactly do you disagree with?

Like, should I just stop interacting with people on this website? Is that the intent? To make me just go away?

> It seems utilitarian in nature

That has no bearing on copyright.


It absolutely does.

Here’s like the first link after a DuckDuckGo search for “copyright utilitarian”:

However, copyright law does not extend to useful items. Therefore, complications may arise when sculptural works are also “useful” items. In these instances, copyright law will protect purely artistic elements of a useful article as long as the useful item can be identified and exists independently of the utilitarian aspects of the article (this concept is sometimes called the “separability test”). 17 U.S.C. §. A “useful article” is an article that has a purpose beyond pure aesthetic value.

https://www.rtlawoffices.com/articles/can-i-copyright-my-des...

Is it safe to assume the rest of the downvotes were from people who were also incorrect?


It's interesting if the consequence of this will be people open sourcing things way less. Would give another layer of irony to openais name.

if you're posting your code publicly on the web, its hard to get upset that people are seeing/using it

It's under the very specific license though. With your logic it's OK to train AI on a leaked Windows code then. It is/was publicly on the web.

code licenses will be irrelevant in a few years if you are able to refactor anything you want using ai.

Unless lawyers from the music industry step in.

While I agree with the first portion of your rebuttal, the second portion makes no sense as leaked code is not "you" putting it on the internet. It would be a nefarious actor doing so.

Yes? Its production code that is supposed to work (keyword supposed). I'd like the code-suggestor to also be trained on AAA games source leaks.

Can I introduce you to a concept called copyright?

Is it fine if an author publishes a short story publicly on the web for someone else to submit it to a contest as their own work?


Please don’t use plagiarism as an argument for keeping copyright. They are not related. Plagiarism is already illegal or disallowed for obvious reasons, but copyright is a different matter.

I brought up plagiarism because it's what happened in this case. Maybe I shouldn't've mentioned copyright at all.


What might be going on here is that Copilot pulls code it thinks may be relevant from other files in your VS Code project. You have the original code open in another tab, so Copilot uses it as a reference example. For evidence, see the generated comment under the Copilot completion: "Compare this snippet from Untitled-1.cpp" - that's the AI accidentally repeating the prompt it was given by Copilot.

None

I just tested it myself, and I most certainly do not have his source open, and it reproduced his code verbatim with just the function header in a random test c file I created in the middle of a rust project I'm working on.

Ah ok.

Do you have Public Code Suggestions disabled? I tried in an empty editor and it wouldn't do it.


I didn't feel like weighing into that Twitter thread, but in the screenshot one will notice that the code generated by Copilot has secretly(?) swapped the order of the interior parameters to "cs_done". Maybe that's fine, but maybe it's not, how in the world would a Copilot consumer know to watch out for that? Double extra good if a separate prompt for "cs_done" comingles multiple implementations where some care and some don't. Partying ensues!

Not to detract from the well founded licensing discussion, but who is it that finds this madlibs approach useful in coding?


In my opinion Copilot helps with the easy, boring stuff by typing what I likely would have typed anyway. The harder the code to write, the less likely I'd be to lean on Copilot.

The only algorithm I'll accept from copilot is a rot13 function, it got the rot26 one wrong.

And the easy, boring, legally unproblematic stuff can already be handled by good IDE completion features. As a bonus those are actually tested and deterministic.

I use an IDE with plenty of smart completion. Copilot offers considerably more (otherwise I wouldn't use or pay for it).

Well, this can impose a serious risk to companies and their cloud strategy based on GitHub.

Can these enterprises really make sure, that their code won’t be used to train Copilot? I am skeptical.


What keeps him from suing if he is so sure?

Those pretty little licenses are a waste of storage if no one enforces them.


Money. Suing is often survival of the richest.

Of course it does. What are you going to do? Sue them?

Not entirely sure how this could happen! “naikrovek” assured me not three days ago on this very site that I was “detached from reality” [1] for thinking that this would happen again.

To be fair I thought it might be at least a week or two.

[1]: https://news.ycombinator.com/item?id=33194643


I just tested it myself on a random c file I created in the middle of a rust project I'm working on, it reproduced his full code verbatim from just the function header so clearly it does regurgitate proprietary code unlike some people have said, I do not have his source so co-pilot isn't just using existing context.

I've been finding co-pilot really useful but I'll be pausing it for now, and I'm glad I have only been using it on personal projects and not anything for work. This crosses the line in my head from legal ambiguity to legal "yeah that's gonna have to stop".


what proprietary code? the guy on Twitter is seeing his own GPL code bring reproduced. nothing proprietary there.

do you have the "don't reproduce code verbatim" preference set?


He owns the copyright to the code, and the code is not in the public domain, therefore it is proprietary code.

That's not how anybody uses the word proprietary when dealing with software licensing. It's a term of art that stands in contrast to open source licenses.

For the record, I don't typically think in terms of the open source community.

I grant that if most people are using it one way here I was likely wrong for the way it is typically used by the normal open source community, I followed up with a reply saying it would likely be more correct for me to have said "improperly licensed" to be included in the training set.

Still it being private means it probably shouldn't be in the training set anyway regardless of license, because in the future, truly proprietary code could be included, or code without any license which reserves all right to the creator.


that is not what proprietary means and you know it.

Sorry it would likely be more correct to say "improperly licensed" code and not proprietary. Still for someone like me, the possibility of having LGPL, or any GPL licensed code generated in their project is a solid no thanks. I know others may think differently but those are toxic licenses to me.

Not to mention this code wasn't public so it's kind of moot, having someone's private code be generated into my project is very bad.

As to the option, I do not, I wasn't even aware of the option, but it's pretty silly to me that's not on by default, or even really an option. That should probably be enabled with no way to toggle it without editing the extension.


None

You said, "don't reproduce code verbatim" as the option. Maybe you should use the actual name of the option if you want people to know what you're talking about. The "with public code blocked" option I am aware of, and is distinctly different from what you said. Don't be a dick, and if you want to be a dick, be right the first time or piss off.

well unlike you I am human. I say things differently at different times.

The code is not GPL but is copyrighted in his name.

some of it is lgpl and some of it is gpl. code contributed by others is licensed differently.

The license for other code, even when in the same repo, doesn't apply to the license on the code that is under discussion here.

It is also allowed to be used under LGPL terms: https://github.com/DrTimothyAldenDavis/SuiteSparse/blob/mast...

But that doesn't make it any better.


Yes, but I was pointing that out to my parent poster, who erronesouly said the code was GPL licensed.

GPL'd code has a copyright owner.

those two things exist at the same time.

try reading a licence now and again!


Slow down there with the snarky comments. I never said GPL'd code doesn't have a copyright owner.

Any code is proprietary by default. GNU GPL license lifts some restrictions, in excange of more code, but it doesnt work when license is broken. Look at cases about GPL violation.

Copiloot doesn't obey GPL license, so they need to obtain written permission and pay license fees to be able to use code in their product.


Thank you for that, though it seems you replied to the wrong comment. That doesn't have to do with what I wrote.

Searching for the function names in his libraries, I'm seeing some 32,000 hits.

I suspect he has a different problem which (thanks to Microsoft) is now a problem he has to care about: his code probably shows up in one or more repos copy-pasted with improper LGPL attribution. There'd be no way for Copilot to know that had happened, and it would have mixed in the code.

(As a side note: understanding why an ML engine outputs a particular result is still an open area of research AFAIK.)


Yeah that's a mess, but that's way too much legal baggage for me, an otherwise innocent end user, to want to take on. Especially when I personally tend to try and monetize a lot of my work.

I understand there's no way for the model to know, but it's really on Microsoft then to ensure no private, or poorly licensed or proprietary code is included in the training set. That sounds like a very tall order, but I think they're going to have to otherwise they're eventually going to run into legal problems with someone who has enough money to make it hurt for them.


Agreed. Silver lining: MS is now heavily incentivized to invest in solutions for an open research problem.

Open source code is open source when it license is obeyed only. When license is not obeyed, e.g. copyright notice is no reproduced, it should be treated as private code, except for dual-licensed code.

Of course the model can know if the code is repeated in multiple repositories with different licences. The people who maintain copilot simply don't care to make it do so.

Expanding on that, even if Microsoft sees the error of their ways and retrains copilot against permissively licensed source or with explicit opt-in, it may get trained on proprietary code the old version of copilot inserted into a permissively licensed project.

You would have to just hope that you can take down every instance of your code and keep it down, all while copilot keeps making more instances for the next version to train on and plagiarize.


> Microsoft

> sees the error of their ways

You must be new here.


I thought the same thing. But then shouldn't CP look at things it's not supposed to use and see if that's happened? How is that any different than you committing your API to Platform X and shortly thereafter Platform X reaches out to you...because GH let them know?

This is exactly what I was thinking. It's still a legal headache for Microsoft but it's not like they're just blatantly ignoring the license

"It's too hard" isn't a valid reason for me to not follow laws and/or social norms. This is a predictable result and was predicted by many people; "oops we didn't know" is neither credible nor acceptable.

It’s not “oops we didn’t know” it’s, “someone published a project under a permissive license which included this code.”

If your standard is “Github should have an oracle to the US court system and predict what the outcome of a lawsuit alleging copyright infringement for a given snippet of code would be” then it is literally impossible for anyone to use any open source code ever because it might contain infringing code.

There is no chain of custody for this kind of thing which is what it would require.


Exactly, the chain of custody is absolutely required for this to be legal because no oracle can exist. It must be able to attribute exactly who contributed the suspect code. It must be able to handle the edge case where some humans might publish code without permission.

Either that or we effectively get rid of software copyright as copilot can be used (or even claim to be used) to launder code of license restrictions. Eg No I didn't copy your code, I used copilot and it copied your code so I did nothing wrong.


Right, so we need a system for when a dev goes and grabs code-snippets from blogs and open-source freely licensed projects on e.g. github in which they can say that the code is from so-and-so source? So like a way to distribute and inherit git blame?

This takes place with or without copilot. The problem would be people copying code and releasing it under a different license.

This reminds me my 4 year old daughter. She often comes from kindergarten with new toys. When i ask here, where did she get it, she tells that her friend gave this as a gift to her. When i dig deeper and ask around, i turns out that the friend who were gifting her things were not real owners of the gift. I see why i could be difficult for children to understand concept of ownership and that you should not gift things to others that are not your own.

So in this case copilot just looks at the situation like that someone gifted me this, and does not question if the person gifting was the real owner of the gift.


> and does not question if the person gifting was the real owner of the gift

If you can figure out a method of determining whether someone owns the code that doesn't involve, "try suing in court for copyright infringement and see if it sticks" then we're kinda stuck. Because just because a codebase contains an exact or similar snippet from another codebase doesn't mean that snippet reaches the threshold of copyrightable work. Or the reverse being that just because two code snippets look wildly different doesn't mean it's not infringement and detecting that automatically is solving the halting problem.

The thing you want for software to actually solve this is chain of custody which we don't have. If you require everyone assume everyone else could be lying or mistaken about infringement then using any open source project for anything becomes legal hot water.

In fact when you upload code to Github you grant them a license to do things like "display it" which you can't do if you don't actually own the copyright or have a license so even before the code is ever slurrped into Copilot the same exact legal situation arises as to wether Github is legally allowed to host the code at all. Can you imagine if when you uploaded code to Github you had to sign a document saying you owned the code and indemnifying Microsoft against any lawsuit alleging infringement o boy people would not enjoy that.


I'll flip it around. If you can't figure out if the code is properly copyrighted, and can't afford to face consequences, don't use it.

If someone created an AI for making movies, and it started spitting out star wars and marvel stuff, you can bet them saying "we trained it on other materials that violate copywriter" wouldn't be enough. They are banking on most devs not knowing, caring or having the ability to follow through on this.

I am going to make a robot that burns your house down. You might think this is unethical, but what you expect me to do? Implement an oracle to the US court system?

You might think it's unreasonable to build such a house-burning robot, but you have to realize that I actually designed it as a lawn-mowing robot. The robot will simply not value your life or property because its utility function is descended from my own, so may burn your house down in the regular course of its duties (if it decides to use a powerful laser to trim the grass to the exact nanometer). Sorry neighbor.

What do you expect me to do? NOT build this robot? How dare you stand in the way of progress!


It doesn't matter that there is not way for copilot to know what happened, doing something illegal because hundreads of people did it before it's never a valid excuse under the rule of law; nor it is "I didn't know it was illegal". Regardless if it's copying code without permission or jaywalking.

Well yes, there'd be no way for the copilot model, as currently specified and trained, to know.

But it IS possible to train a model for that. In fact, I believe ML models can be fantastic "code archaeologists", giving us insights into not just direct copying, but inspiration and idioms as well. They don't just have the code, they have commit histories with timestamps.

A causal fact which these models could incorporate, is that we know data from the past wasn't influenced by data from the future. I believe that is a lever to pry open a lot of wondrous discoveries, and I can't wait until a model with this causal assumption is let loose on Spotify's catalog, and we get a computer's perspective on who influenced who.

But in the meantime, discovering where copy-pasted code originated should be a lot easier.


Ah, a plagiarism checker that can understand simple code transformation and find the original source? Sounds like a good idea for patent trolls and I have no idea about how/if copyright laws can be apply in this case. Does copying the idea but not copying the code verbatim constitutes copyright violation?

The patent troll version of the algorithm needs the victim's bank balance as input too. In fact that's probably all it needs.

It would be much more valuable for people who care about the truth.


There'd be no way for Copilot to know that had happened? What? YT uses Content ID. GH could set up a similar program for OSS.

>GH could set up a similar program for OSS.

What a nightmare.

I'd say that constant code copying is massively pervasive, with no regard to licensing, always has been, and that's not really a bad thing, and attempts to stop it are going to be far more harmful than helpful.


> There'd be no way for Copilot to know that had happened

All lines are associated to a commit, which has author/commit date. A reasonable guess as to which snippet was made first can be done


> his code probably shows up in one or more repos copy-pasted with improper LGPL attribution.

Can Copilot prove that and link to the source LGPL code whenever it reproduces more than half a line of code from such a source?

Because without that clear attribution trail, nobody in their right mind would contaminate their codebase with possibly stolen code. Hell, some bad actor might purposefully publish a proprietary base full of stolen LGPL code, and run scanners on other products until they get a Copilot "bite". When that happens and you get sued, good luck finding the original open source code both you and your aggressor derive from.


>> his code probably shows up in one or more repos copy-pasted with improper LGPL attribution

That is why Copilot should have always been opt-in (explicitly ask original authors to provide their code to copilot training). Instead, they are simply stealing the code of others.


So I’m working on a side project and it’s hosted on GitHub, does this mean the code (which I consider precious and my own) can just be stolen and injected into my competitors codebase who is using co-pilot?

If this is the case, I can imagine people migrating of GitHub very quickly. I can also imagine some pretty nice lawsuits opening up.


A good rule of thumb is if you're worried about code being copied you really shouldn't put it on github. Even if most large companies respect copyright, that small studio in Russia certainly won't.

How would the small studio access my private repo ?

They wouldn't. If you have it in a private repository and you are the only one with access to it, you'll likely not run into this issue.

From other comments, this developers "private" code was found in 30k+ public repositories with public attribution which is what created this issue.

Presumably your private code is not also present or leaked to any public repositories.


The, this brings up a very, very interesting question, people have stolen code or leaked it into public repositories, and Microsoft are building a product that references that code.

What does

> with "public code" blocked

mean? Are you able set a setting in GitHub to tell GitHub that you don't want your code used for Copilot training data? Is this an abuse of the license you sign with GitHub, or did they update it at some point to allow your code to be automatically used in Copilot? I'm not crazy about the idea of paying GitHub for them to make money off of my code/data.


The option to omit "public code" means it should, in theory, omit code that is licensed under such banners as the GPL. It does not mean "omit private repositories".

I should have pointed out that those were two separate questions.

Public code is a weird name to use for that use case.


> We built a filter to help detect and suppress the rare instances where a GitHub Copilot suggestion contains code that matches public code on GitHub. You have the choice to turn that filter on or off during setup. With the filter on, GitHub Copilot checks code suggestions with its surrounding code for matches or near matches (ignoring whitespace) against public code on GitHub of about 150 characters. If there is a match, the suggestion will not be shown to you. We plan on continuing to evolve this approach and welcome feedback and comment.

From the FAQ https://github.com/features/copilot/


That is for people using Copilot. I’d like a setting that tells GitHub to not scan my code at all. And I am curious about that sneaking into the terms in between me signing up and paying and them taking code for free.

I have also never heard of “public code” being used in that way.


> I’d like a setting that tells GitHub to not scan my code at all.

What about people forking/mirroring your code? Or people merely contributing code? There is no one-to-one correspondence between copyright holders and Github users.

Copilot should just comply with the license, that's it.


It seems they don't comply with licenses. And it's not possible to fork or mirror to private repositories.

It would be a lot less trouble for everyone if it was just a per repository setting.


A good first approximation of such a setting is to never publish on Github in the first place.

Of course, someone else can still upload your elsewhere-published code to GH. You cannot win.


WTF is «public code»?

It prints this code because you have it open in another editor tab. Wish people who don't know at all how it works stopped acting all outraged when they're laughably wrong.

Can you link to more info about this? If this is accurate, many people aren't aware.

Code Snippets Data

Depending on your preferred telemetry settings, GitHub Copilot may also collect and retain the following, collectively referred to as “code snippets”: source code that you are editing, related files and other files open in the same IDE or editor, URLs of repositories and files paths.

https://github.com/features/copilot/#faq


> OpenAI Codex was trained on publicly available source code and natural language, so it works for both programming and human languages. The GitHub Copilot extension sends your comments and code to the GitHub Copilot service, and it relies on context, as described in Privacy below - i.e., file content both in the file you are editing, as well as neighboring or related files within a project. It may also collect the URLs of repositories or file paths to identify relevant context. The comments and code along with context are then used by OpenAI Codex to synthesize and suggest individual lines and whole functions.

"How Does Copilot Work"


> It prints this code because you have it open in another editor tab.

People upthread have reproduced and demonstrated that that's not the issue here.

EDIT: Actually, OP says "The variant it produces is not on my machine." - https://twitter.com/DocSparse/status/1581560976398114822

> Wish people who don't know at all how it works stopped acting all outraged when they're laughably wrong.

Physician, heal thyself.


Hot take, AI will steal all our jobs. Get over it.

None

None

Feel free to send patches.

Why would I waste time doing this?

Because you were already willing to waste time panning his code on a public forum. Maybe do something constructive instead of destructive if your time is so precious.

> Maybe do something constructive instead of destructive if your time is so precious.

How does the first part of this follow the second? Being constructive likely takes more time.


Ok, post your code for us to scrutinize

“ AI-focused products/startups lack a business model aligning the incentives of both the company and the domain experts (Data Dignity)”

https://blog.barac.at/a-business-experiment-in-data-dignity

Yes I am quoting myself


I would imagine the root problem here is people taking copyrighted code, pasting it in their project and disregarding the license. To me this seems common, especially when it comes to toy, test and hobby projects.

I don't see how copilot or similar tools can solve this problem without vetting each project.


That's an entirely plausible explanation, but it doesn't mean that Microsoft has any less of a legal nightmare on their hands.

I'm not really sure what I think about this. How responsible should Microsoft be for someone's badly licensed code on their platform? If they somehow had the ability to ban projects using stolen snippets of code, I don't think I'd dare to host my hobby projects there.

If you can't trust that the code in a project is compatible with the license of the project then the only option I see is that copilot cannot exist.

I love free software and whatnot, but I have a feeling this situation would've been quite different if copilot was made by the free software community and accidentally trained on some non free code..


> I love free software and whatnot, but I have a feeling this situation would've been quite different if copilot was made by the free software community and accidentally trained on some non free code..

Precisely. Would it be okay for me to publish some code as GPL because my buddy gave it to me and promised that it was totally legit and I could use it and it definitely wasn't copy-pasted from one of the Windows source leaks?

> If you can't trust that the code in a project is compatible with the license of the project then the only option I see is that copilot cannot exist.

It might be possible to feed it only manually-vetted inputs, but yes; as it currently is, Copilot appears to be little but a massive copyright-infringement engine.


> Precisely. Would it be okay for me to publish some code as GPL because my buddy gave it to me and promised that it was totally legit and I could use it and it definitely wasn't copy-pasted from one of the Windows source leaks?

But where do you draw the line? What if you accidentally came up with the same or similar solution to something in windows? The code might not be from your friend either, it could be from N steps of copy paste, rework, reformating, refactoring, etc.


> But where do you draw the line? What if you accidentally came up with the same or similar solution to something in windows?

Yes, I agree that it's unclear how to deal with that in the general case at scale. Although cases like OP make me think that we could maybe worry about the grey area after we've dealt with the blatant copies.

> The code might not be from your friend either, it could be from N steps of copy paste, rework, reformating, refactoring, etc.

Well, my personal tendency would be to apply the same standard to Microsoft that they would apply to us. How many steps of removal is needed to copy MS proprietary code and it be okay?


> Yes, I agree that it's unclear how to deal with that in the general case at scale. Although cases like OP make me think that we could maybe worry about the grey area after we've dealt with the blatant copies.

The way I see copilot's output is that it's already in the grey zone. As with other models like this there are no snippets in the model. I can for example generate similar looking code to the cs_transpose function in Lua if I nudge it a bit. To me this seems equivalent of someone remembering exactly how a function works (to some extent..) and being able to write it in whatever language without copy pasting.

So the output as far as I understand is very grey. Maybe there's something in the training part that can be discussed, but as I mentioned earlier I'm not sure what else you can do other than check the license of some code or avoid creating copilot in the first place.


> I'm not really sure what I think about this. How responsible should Microsoft be for someone's badly licensed code on their platform?

That's a really hard undersell of responsibility on the part of Microsoft/Github.

It seems as though they did approximately zero work to verify any of the code wasn't infringing. Things they could have tried but apparently didn't:

1) Ask developers to opt-in to copilot scanning of their repositories, and alongside that have them certify that they hold copyright over all lines of code included in the repository.

2) Use a training dataset of only public repositories listed under applicable pre-identified licensing schemes, from established groups. e.g.: *bsd licensed code from the various BSD OSes.

3) Sought out examples from standard libraries in other programming languages with suitable licenses.

It seems like they did nothing and just hoped. I can't see how anyone would try to rely on this thing in a commercial context after its proven to do this over and over. The well has been poisoned.


> How responsible should Microsoft be for someone's badly licensed code on their platform?

That's actually a very real problem that mega money has been spent on. The same legal problem appears on sites like YouTube around fair use and copyright. In terms of fair use that doesn't apply here see:

https://softwareengineering.stackexchange.com/questions/1217...

Regardless platforms are partially responsible for the content that their users upload into them. Most try to absolve themselves of this responsibility with their terms of service but legally that's just not possible.

Personally I'm an advocate for fair use but I'm also an advocate for strong copyright laws and their enforcement. In the short time the internet has been available to most people in the world there is a habit of stealing others work and claiming it as your own. Quite often this is for some financial gain.


Exactly. They need to vet each project

Given how common and wide-spread misattribution of code is on GitHub, I'd say there is a strong argument (moral rather than legal--I'm not an IP lawyer and will leave judgements regarding legal liability up to the professionals) that they can be held responsible for this mess exactly because it is such a well-known issue and that rolling out copilot without addressing this (most likely as you suggest by actually spending more resources on vetting projects and tidying up training data) amounts to gross negligence on the part of GitHub since there is good reason to believe this will exasperate this problem significantly.

This shows how copyright is all screwed up. Let's say the code in question is based on a published algorithm, maybe Yuster and Zwick, (I did not check).

What exactly gives Davis a better claim to the copyright than the inventors of the algorithm? Yes, I know software is copyrightable while algorithms are not, but it is not at all clear to my why that should be the case. The effort of translating an algorithm into code is trivial compared to designing the algorithm in the first place, no?


You can patent an algorithm if you want to protect it.

(in some countries)

To be honest, it would probably benefit all of humanity if we stopped rewriting the same code to then fix the same bugs in it, and instead just used each other's algorithms to do meaningful work.

I work for a large tech company whose lawyers definitely care that my code doesn't train an AI model somewhere much more than I do. On the contrary, I would really like to open source all of my work - it would make it more impactful and would demonstrate my skills. It makes me a bit sad that my life's work is going to be behind lock and key, visible to relatively few people. Not to mention that the hundreds of thousands of work hours, energy and effort that will be spent to replicate it all over my industry in all other lock-and-key companies makes the industry as a whole tremendously inefficient.

I hope that AI models like Copilot will finally show to the very litigious tech companies that their intellectual property has been all over the public domain from the start. And we can get over a lot of the petty algorithm IP suits that probably hold back all tech in aggregate. We should all be working together, not racing against each other in the pursuit of shareholder value.

Historically, mathematicians used to keep their solutions secret in the interest of employment in the middle ages. So there used to be mathematicians that could, for example, solve certain quadratic equations but it took centuries before all humanity could not benefit from this knowledge. I believe this is what is happening with algorithms now. And it is very counter-progress in my opinion.


At the same time, a monoculture of implementations can lead to shared bugs with 100% of implementations which is also not good.

True, copyright is screwed up and completely incompatible with the 21st century. We should abolish it so that these silly questions of data ownership become irrelevant.

However, until that happens, Microsoft and GitHub cannot get away with blatant copyright infringement like this. No one is interested in their poor excuses either. People get sued and DMCA'd out of existence for far lesser offenses, yet Microsoft gets away with violating the license of every free software and open source project out there? That's fucked up.


Algorithms cannot be copyrighted. What is copyrighted is the creative expression of an algorithm. The variable names, the comments, choosing a for loop vs a while loop, or a ternary operator over an “if”, the order of arguments to a function, architectural decisions, etc.

Copyright is formed when a human makes a choice about equivalent ways of implementing an algorithm.


Also this depends on jurisdiction.

Is there a jurisdiction that allows purely algorithms to be copyrighted? As far as I know, usually algorithms come under the umbrella of patents (in jurisdictions that allow software patents) rather than copyright.

For example, it would interfere with e.g. copyright of scientific/mathematical papers if algorithms were copyrightable, as mathematicians would not be able to extend another mathematician’s ideas without first gaining permission.


What constitutes a copyrightable creative piece varies. (And in a lot of places of course algorithms can't be patented either)

Drunk conspiracy theory: Nat knew Copilot would be a complete nightmare and bailed.

I think people may be drastically over-valuing their code. If it was emitting an entire meaningful product, that would be something else. But it’s emitting nuts and bolts.

If the issue is more specifically copyright infringement, then leverage the legal apparatus in place for that. Their lawyers might listen better.

This is not a strongly held opinion and if you disagree I would love to hear your constructive thoughts!


I mean it starts like this, but if Copilot gets a pass, companies might just use AI as a way to launder code and avoid complying with Free licenses.

Having worked for a couple of big companies with IT, you should know they are effectively all breaking the law already in this regard (except for maybe hardware companies) because it’s basically impossible to enforce and no one cares.

The best way to make sure your code isn’t copied is not to publish it.


Can you name one of these big, rich, and careless companies, please?

I'll name a counterexample: Google (used to work there) is very careful with the provence of external code, to the point that for simpler things it's often easier to write something internally than use the standard external thing.

I can, roughly. One of the big international US based financial institution. Zero real concern for any licensing associated with software, across multiple teams I'd worked on in multiple lines of business. You find a library that works, you use it. Present in systems that touch dollars in the trillions per week.

I always found this weird while I was working at this company, but then, they have no reason to care about ephemeral threats that have never been brought to bear in a meaningful way. No consequences = no reason to spend literal billions retooling the entire tech side of your company over a decade.


Uh, vaguely? [Someone who isn't me] is aware of this happening at an american retailer.

It basically happens like this:

"Oh this code solves our problems and has a nice community around it for network effects!"

**developers proceed to adopt codebase without checking the license**

**months later**

"Oh, huh this license has some interesting language in it..."

Then the employee doesn't mention it; because the risk of having to re-do a bunch of work feels higher than the risk of getting in trouble for violating a license. Basically, unless it's Oracle; people just kinda shrug it off as a "wontfix".

My whole thing is that any system depending on people to read and follow a license is quite flawed in terms of enforcement, and is largely designed specifically so that powerful encumbents can make claims, not individual developers.

Laws have to be enforced or people will ignore them. If there's no practical way to enforce a law that doesn't involve violating freedoms - you're kinda fucked.


But then those companies can get in legal trouble, not Github.

To some extent I agree with your opening. That is, plenty of cases CP is showing how mundane most code is. It's one commodity stitched to another stitched to another.

That's not considering any legal / license issues, just a simple statement about the data used to train CP.


If they're that trivial and valueless, Microsoft should have no problem coming up with their own training sets instead of stealing them en masse from the public.

If I create something, I get to define the terms of its use, reproduction, distribution, etc. "Value" plays no part in whether someone can appropriate and distribute that creation without permission from the creator.


>If it was emitting an entire meaningful product, that would be something else. But it’s emitting nuts and bolts.

Take a single C file or even a long function from the leaked Windows NT codebase and include it in your code. See how happy Microsoft will be with it. They spent millions of dollars on their legal teams. Eroding copyright protections will harm the weakest most. How many open source contributors can afford copyright lawyers?


Copyright makes no such distinction

> I think people may be drastically over-valuing their code. If it was emitting an entire meaningful product, that would be something else. But it’s emitting nuts and bolts.

Please refrain yourself from this kind of blatant gaslighting. You're not the one to assess its value or usefulness and your point is at most tangential to the issue. The problem is that the model systematically took non-public domain code without any permits from the author, not whether it's useful or not. It's worth to hear this complaint and Copilot team should be more accountable for this problem since this could lead to more serious copyright infringement fights for its users.


The OP was a professor of mine, and his library represents the product of thousands of hours of research. Probably every line in there is extremely valuable.

This is about standards. Laws for thee and not for me? It's just particularly hypocritical that the same companies that will sue anyone for violating their copyright have no issue violating copyright themselves.

I suppose on the one hand you are right, people may well over-value their code. However the argument isn't really about the value or any monetary damage done through this. It's about a violation of ownership and trust.

Right or wrong, copyright doesn't care about how valuable something is. Everything is equally (not in reality but in theory) protected. GitHub is a platform many people have trusted with protecting ownership of their copyrighted code through reasonable levels of security.

I think the big discussion point here is around ensuring that this tool is acting correctly and respecting rights of an individual. It's very easy for a large company to accidentally step on people and not realise it or brush it away. People want to make sure that isn't happening and right now there are some very compelling examples where it looks like this is happening. The fact that this isn't opt-in and there's no way to opt-out your public repositories means the choice has been taken away from people. Previously you were free to license your code as you see fit, now we have some examples of where that license may not be being respected as a result of an expensive GitHub feature.

I think this is where the conversation is centring. It's not about whether your code is valuable or not. It's whether a large company is making profit by stepping on an individuals right of ownership or not.

On the note of leveraging legal apparatus to figure it out I think you're right. The problem is what individual open source maintainer is going to have the funds to bring a reasonable equal legal challenge to such a large organisation? I maintain a relatively well used open source project and I sure as hell don't. Realistically my option is to either spend a lot of personal time and resources to challenge it (if I think wrong-doing is happening) or just suck it up. Given that there's no easy way to figure out if wrong-doing is happening because it's all in the AI soup, it makes it even harder to consider that approach.

I think the point is a lot less about the value of the code, and much more about a massively organisation playing hard and fast with an individuals rights.

None of this is to say GitHub have actually done anything wrong here. I'm sure we'll figure that out in time, but it would be great if they could figure out a way to provide more concrete explanations.


Github copilot is a paid product.

It doesn't matter if I think my code is valuable, it's that Github is using everyone's code for their own profit - without opt-in, attribution, or paying a license.


If the code is public, my guess is that someone else stole it and added it to an open source repo without authorization. Microsoft may have then picked it up from there.

This just means that if you use Copilot for work, you're exposing yourself and/or your employer to unknown legal liability. =)

Not sure how you’d get caught if your code is kept private.

If you are releasing binaries that are publicly accessible it is possible to get caught.

Statically linked binaries for example have parts of libraries embedded into them. There exists tools that can analyze the binary and try to detect signatures from a shared library in the binary.

In the past there were (and probably still are) companies who provided services to help with finding people who have linked in your code so you could take whatever action you wanted against them. I can't recall a specific company name right now but a little bit of Googling would likely bring up some examples.


Yeah but it's not going to be a part of a library. It'll be a few lines of code from a library.

If the code is ostensibly available under permissive licenses, a well-intentioned human will also make the same mistake.

Even if you use it for personal projects.

To be safe, we'd have to get Microsoft to agree to indemnify users (if they really believe using this is safe, they should do so), or wait until a court case on copyright as it regard to training corpus for large models is decided and appeals are exhausted.


People copy-paste garbage from the internet left and right. Maybe copilot will finally push companies to actually properly review the code their employees are pushing.

Other than the legality of this code being copied almost verbatim - who is the person who would use it? A person who would use it could also write it. If they cannot write it, why would they ever use this very specific code that magically appeared?

for the same reason people copy random snippets from stackoverflow without understanding how they work. There's a large amount of people who care more about getting a job done than really understanding how the tools work they're using.

There is absolutely zero doubt in my mind that copilot et al will lead to the absolute proliferation of half baked code even more than all the other mundane ways to copy&paste do.


> There is absolutely zero doubt in my mind that copilot et al will lead to the absolute proliferation of half baked code even more than all the other mundane ways to copy&paste do.

Agreed. That was my point.


As some other commenters have noted, it seems like the copyrighted code is being copied and pasted into many other codebases (shadowgovt says they found 32,000 hits), which are then (illegally) representing an incorrect license.

So obviously the source of the error is one or more third parties, not Microsoft, and it's obviously impossible for Microsoft to be responsible in advance for what other people claim to license.

It does make you wonder, however, if Microsoft ought to be responsible for obeying a type of "DCMA takedown" request that should apply to ML models -- not on all 32,000 sources but rather on a specified text snippet -- to be implemented the next time the model is trained (or if it's practical for filters to be placed on the output of the existing model). I don't know what the law says, but it certainly seems like a takedown model would be a good compromise here.


> So obviously the source of the error is one or more third parties, not Microsoft, and it's obviously impossible for Microsoft to be responsible in advance for what other people claim to license.

No. Look at the insane YouTube copystrike situation. Why shouldn't Microsoft be held to the same standards?


I can't tell what you're arguing. You seem to be suggesting that YouTube is doing insane things, and that therefore... Microsoft should also be doing insane things?

Also a major problem with YouTube is not with DCMA itself, but it how it implements the system to allow for abusive takedowns without repercussions for the abusers.


Uhh, I'm gonna have to disagree hard on this take:

> So obviously the source of the error is one or more third parties, not Microsoft, and it's obviously impossible for Microsoft to be responsible in advance for what other people claim to license.

Copilot is Github's product, and Microsoft owns Github. They are responsible for how that product functions. In this case, they should be held responsible for the training data they used for the ML model. Giving them the benefit of the doubt here, at minimum they chose which random third parties to believe were honest and correct. Without giving them the benefit of the doubt, they lied about what data sets they used to train it.

To try and put it simpler, let's say a company comes along and tells the world "we're selling this cool book-writing robot, don't worry it won't ever spit out anyone else's books" and then the robot regurgitates an entire chapter from Stephen King's Pet Sematary, is that the fault of Stephen King or the person selling the robot?


> Giving them the benefit of the doubt here, at minimum they chose which random third parties to believe were honest and correct.

Well probably no, they didn't pick and choose at all, they just "chose" everyone who put code online with a license. Which is a legal statement of ownership by each of those people, and implies legal liability as well.

> is that the fault of Stephen King or the person selling the robot?

Well, there's certainly an argument to be made that it's neither -- it's the fault of the person who claimed Stephen King's work as their own with a legal notice that it was licensed freely to anyone. That person is the one committing theft/fraud.

The point is that with ML training data, such a vast quantity is required that it's unreasonable to expect humans to be able to research and guarantee the legal provenance of it all. A crawler simply believes that licenses, which are legally binding statements, are made by actual owners, rather than being fraud. It does seem reasonable to address the issue with takedowns, however.


> Well probably no, they didn't pick and choose at all, they just "chose" everyone who put code online with a license. Which is a legal statement of ownership by each of those people, and implies legal liability as well.

What you're describing is a choice. They chose which people to believe, with zero vetting.

> The point is that with ML training data, such a vast quantity is required that it's unreasonable to expect humans to be able to research and guarantee the legal provenance of it all.

I'm not sure what you're presenting here is actually true. A key part of ML training is the training part. Other domains require a pass/fail classification of the model's output (see image identification, speech recognition, etc.) so why is source code any different? The idea that "it's too much data" is absolutely a cop-out and absurd, especially for a company sitting on ~$100B in cash reserves.

Your argument kind of demonstrates the underlying point here: They took the cheapest/easiest option and it's harmed the product.

> A crawler simply believes that licenses, which are legally binding statements, are made by actual owners, rather than being fraud. It does seem reasonable to address the issue with takedowns, however.

Yes, and to reiterate, they chose this method. They were not obligated to do this, they were not forced to pick this way of doing things, and given the complete lack of transparency it's a large leap of faith to assume that their training data simply looked at LICENSE files to determine which licenses were present.

For what it's worth, it doesn't seem that that's what OpenAI did when they trained the model initially in their paper[1]:

    Our training dataset was collected in May 2020 from 54 mil-
    lion public software repositories hosted on GitHub, contain-
    ing 179 GB of unique Python files under 1 MB. We filtered
    out files which were likely auto-generated, had average line
    length greater than 100, had maximum line length greater
    than 1000, or contained a small percentage of alphanumeric
    characters. After filtering, our final dataset totaled 159 GB.
I have not seen anything concrete about any further training after that, largely because it isn't transparent.

[1]: https://arxiv.org/pdf/2107.03374.pdf


It seems to me this would lead to a "Copyright Jesus" who could fraudulently publish all copyrighted works in the world under MIT. Then, before they are taken down, each ML model can train on the works and launder the output into actual MIT. Is this absurd?

I don't think that Microsoft can claim to be blameless here "because it is too hard".

If we have 32 000 copies of the same code in a large database with a linking structure betwen the records then we should be able to discern which are the high provenance sources in the network, and which are the low provenance copies. The problem is after all, remarkedly similar to building a search engine.


There is no formal linking structure in many, if not most cases. Ctrl+V is the weapon of choice of many a programmer. To say nothing of somebody then adding superficial changes to the code to, for instance, fit their personal style or adapt it into their project. And then of course on top of it, Github is not the alpha and omega of code. The original code have been published anywhere, or even nowhere in a case such as theft.

Then there's also parallel discovery. People frequently come to the same solution at roughly the same time, completely independently. And this is nothing new. For instance, who discovered calculus? Newton or Leibniz? This was a roaring controversy at the same time with both claiming credit. The reality is that they both likely discovered it, completely independently, at about the same time. And there's a whole lot more people working on stuff than than in Newton's time!

There's also just parallel creation. Task enough people with creating an octree based level-of-detail system in computer graphics and you're going to get a lot of relatively lengthy code that is going to look extremely similar, in spite of the fact that it's a generally esoteric and non-trivial problem.


They could have spent effort (read: money) to validate the content of the corpus the model is trained on.

For DALL-E and Stable Diffusion, the model size is an order of magnitude smaller than the total size of all the training set images? So it's not possible for the model to regurgitate every image in the training set exactly?

For Copilot, is there a similar argument? Or its model is large enough to contain the training set verbatim?


How small is DALL-E/SD, compared to say, training dataset images shrank to 120x120, JPEG compressed at q=0.3, compressed as .tar.bz2?

> The data can comfortably be downloaded with img2dataset (240TB in 384, 80TB in 224)

https://laion.ai/blog/laion-5b/

Not exactly what you asked, but hopefully useful? The model weights are about 4 GiB I believe.


IIRC, 2.5 billion images were used to create a 4.5GB dataset. That is less than 2 bytes per original image.

Mentioning DALL-E, you've hit on something.

The world seems slightly mad about these things that produce "almost" pictures from text. We forgive DALL-E when it produces a twisted eye or an impossible perspective, because its result is "close enough" that we recognise something and grant the image intention.

So now you've got me waiting for DALL-Ecode. Give DALL-Ecode a description, it produces code.

"DALL-Ecode: Code that is sufficiently close to what you'd expect that you'll try to use it."

"DALL-Ecode: Code that looks like it does what is needed."

"DALL-Ecode: Good enough to compile, good enough to get through a code review (just not good enough to get through testing)."


Seems simple enough to start addressing. Don't sue Microsoft, subpoena them as part of your suit against unnamed companies violating the license. Request information on all public and private repositories that were generated in part using Copilot and which contain relevant code from which the licensing info has been stripped.

After all, Microsoft may not itself be infringing so there may not be a cause of action against them by the copyright holders - but there's probably cause against the (unknowing) infringers and they may have cause.


> Request information on all public and private repositories that were generated in part using Copilot and which contain relevant code from which the licensing info has been stripped.

No court is ever going give you that subpoena nor would it even be possible to comply with it even if granted. You might get “show me all the repositories used in the training data for Copilot that contain that snippet.”


No court is ever going give you that subpoena nor would it even be possible to comply with it even if granted.

Quite possible, but the details could be tweaked to be more viable. The underlying message of "your tools have done us wrong and we're going to drag your customers who benefitted in" could still get through.


If we try enough permutations then we might eventually reproduce either Office or Windows source code. Then MS will take notice.

So.. say Microsoft retrained Copilot on code only explicitly marked as open-source. As an activist or vandal you could start publishing proprietary code with fraudulent license files to pollute Copilot again.

This could be terribly fun.


Doesn't seem like a problem to me. GitHub Copilot is a tool like a search engine. If I Google for some code snippet and copy it into my codebase, I can't blame Google, Github, and Stack Overflow. Instead I can only blame myself.

In this way, making sure I'm not writing proprietary code is important.

Fortunately it looks like Copilot only uses open repositories so that's good.


Still no way to opt out?

Jesus.


Well, here's a question. Is GitHub violating the LGPL by including the code in their copilot data? Or is it Copilot users who end up using the regurgitated code?

I guess GitHub is violating it the instant their servers send verbatim snippets like this to developers without the copyright notice. And then the Copilot users are also violating it when (if) they release their code that contains verbatim snippets and no notice.

This is a pretty good live experiment on how useful these open source licenses really are. This guy find his license is being violated, so what does he do? If complaining on Twitter seems like the best course of action then maybe we need to rethink the system.


If the license requires attribution and Github is not providing it (which it isn’t), then yes, Github is violating the license.

Right, so who's going to do something about it? People seem to be complaining about it on Twitter but I'm not sure what the expected result is.

Raising awareness is usually a step before a class-action lawsuit.

> Is GitHub violating the LGPL by including the code in their copilot data?

If Microsoft’s fair use analysis is correct, not actionably (at least under US law) because it is within the fair use exception to copyright.

> Or is it Copilot users who end up using the regurgitated code?

These aren't exclusive options; either, both, or neither could be true.


Not sure if it shows because he have this other code on his computer (only)

Howdy, folks. Ryan here from the GitHub Copilot product team. I don’t know how the original poster’s machine was set-up, but I’m gonna throw out a few theories about what could be happening.

If similar code is open in your VS Code project, Copilot can draw context from those adjacent files. This can make it appear that the public model was trained on your private code, when in fact the context is drawn from local files. For example, this is how Copilot includes variable and method names relevant to your project in suggestions.

It’s also possible that your code – or very similar code – appears many times over in public repositories. While Copilot doesn’t suggest code from specific repositories, it does repeat patterns. The OpenAI codex model (from which Copilot is derived) works a lot like a translation tool. When you use Google to translate from English to Spanish, it’s not like the service has ever seen that particular sentence before. Instead, the translation service understands language patterns (i.e. syntax, semantics, common phrases). In the same way, Copilot translates from English to Python, Rust, JavaScript, etc. The model learns language patterns based on vast amounts of public data. Especially when a code fragment appears hundreds or thousands of times, the model can interpret it as a pattern. We’ve found this happens in <1% of suggestions. To ensure every suggestion is unique, Copilot offers a filter to block suggestions >150 characters that match public data. If you’re not already using the filter, I recommend turning it on by visiting the Copilot tab in user settings.

This is a new area of development, and we’re all learning. I’m personally spending a lot of time chatting with developers, copyright experts, and community stakeholders to understand the most responsible way to leverage LLMs. My biggest take-away: LLM maintainers (like GitHub) must transparently discuss the way models are built and implemented. There’s a lot of reverse-engineering happening in the community which leads to skepticism and the occasional misunderstanding. We’ll be working to improve on that front with more blog posts from our engineers and data scientists over the coming months.


Later in the thread he stated the code was not on the machine he tested copilot with.

Copilot training data should have been sanitized better.

In addition: any code that is produced by copilot that uses a source that is licensed, MUST follow the practices of that license, including copyright headers.


Right - but if someone pushes the same code to github and changes the licence file to say "public domain", what's the legally correct way to proceed? What's the morally correct way to proceed?

Legally, if you're publishing a derived work without legitimate permission then you're civilly liable for statutory + actual damages, the only thing you're avoiding is the treble damages for wilful infringement.

Morally I'd say you should make a reasonable good faith effort to verify that you have a real license for everything you're using. When you're importing something on the scale of "all of Github" that means a bit more effort than just blindly trusting the file in the repository. When I worked with an F500 we would have a human explicitly review the license of each dependency; the review was pretty cursory, but it would've been enough to catch someone blatantly ripping off a popular repo.


How do you know GH didn't? Maybe they only included repos with LICENSE.MD files which followed a known permissive licence?

What if a particular piece of code is licensed restrictively, and then (assuming without malice) accidentally included in a piece of software with a permissive license?

What if a particular piece of code is licensed permissively (in a way that allows relicensing, for example), but then included in a software package with a more restrictive licence. How could you tell if the original code is licensed permissively or not?

At what point do Github have to become absolute arbiters of the original authorship of the code in order to determine who is authorised to issue licenses for the code? How would they do so? How could you prove ownership to Github? What consequences could there be if you were unable to prove ownership?

That's before we even get to more nuanced ethical questions like a human learning to code will inevitably learn from reading code, even if the code they read is not permissively licensed. Why then, would an AI learning to code not be allowed to do the same?


The “it’s really hard” argument isn’t a very good argument in my opinion?

If we hold reproductions of a single repository to a certain standard, the same standard should probably apply to mass reproductions. For a single repository, it’s your responsibility to make sure it’s used according to the license.

Are there slightly gray edge cases? Of course, but they’re not -that- grey. If I reproduced part of a book from a source that claimed incorrectly it was released under a permissive license, I would still be liable for that misuse. Especially if I was later made aware of the mistake and didn’t correct it.

If something is prohibitively difficult maybe we should sometimes consider that more work is required to enable the circumstances for it to be a good idea, rather than starting from the position that we should do it and moulding what we consider reasonable around that starting assumption.


If someone uploads something and says 'hey, this is some code, this is the appropriate licence for it', it is their mistake, it is in violation of Github's terms of service, and may even be fraudulent. [0].

I'm also not sure that Copilot is just reproducing code, but that's a separate discussion.

> If I reproduced part of a book from a source that claimed incorrectly it was released under a permissive license, I would still be liable for that misuse. Especially if I was later made aware of the mistake and didn’t correct it.

I don't believe that's correct in the first instance (at least from a criminal perspective). If someone misrepresents to you that they have the right to authorise you to publish something, and it turns out they don't have that right, you did not willingly infringe and are not liable for the infringement from a criminal perspective[1]. From a civil perspective, likely the copyright owner could still claim damages from you if you were unable to reach a settlement. A court would probably determine the damages to award based on real damages (including loss of earnings for the content creator), rather than anything punitive if it's found that

Further, most jurisdictions have exceptions for short extracts of a larger copyrighted work (e.g. quotes from a book), which may apply to Copilot.

This is my own code, I wrote it myself just now. Can I copyright it?

``` function isOdd (num) { if (num % 2 === 0) { return true; } else { return false; } } ```

What about the following:

``` function isOddAndNotSunday (num) { const date = new Date(); if (num % 2 === 0 && date.getDay() > 0) { return true; } else { return false; } } ```

Where do we draw the line?

[0]: https://docs.github.com/en/site-policy/github-terms/github-t... [1]: https://www.law.cornell.edu/uscode/text/17/506


Your question can actually be answered legally. I'm not a lawyer so I'm not going to tell you what those answers are, but there are pretty well established mechanisms to determine if a function is trivial enough to warrant being copyrighted (a lot of this was explored in the SCO vs. IBM saga)

> From a civil perspective, likely the copyright owner could still claim damages from you if you were unable to reach a settlement. A court would probably determine the damages to award based on real damages (including loss of earnings for the content creator), rather than anything punitive if it's found that

There are statutory damages on top of your actual damages. $50k per act of infringement. No reason for the copyright holder to settle for less when it's an open and shut case.

> Further, most jurisdictions have exceptions for short extracts of a larger copyrighted work (e.g. quotes from a book), which may apply to Copilot.

Quotes do not automatically get an exception just because they're taken from a larger work, they might be excepted either because they were de minimis (essentially because they were too short to be copyrightable) or because they were fair use (which is a complex question that takes into account the purpose and context, which Copilot is very unlikely to satisfy because it's not quoting other code for the purpose of saying something about it).

> Where do we draw the line?

Circuit specific; some but not all circuits use the AFC test. It sounds like this code was both long enough and creative/innovative enough to be well on the wrong side of it though.


I am not sure about statutory damages.

As I understand it, the complainant may CHOOSE to request the court to levy statutory damages rather than actual damages at any point, but is not entitled to both actual AND statutory (17 U.S. Code § 504)

It also seems to be absolutely capped at 30K per infringement, not 50, and ranges up from $750. It also seems that if the "court finds, that such infringer was not aware and had no reason to believe that his or her acts constituted an infringement of copyright, the court in its discretion may reduce the award of statutory damages to a sum of not less than $200."

I think you are probably right that this specific function is copyrightable though, but taken overall, I think Microsoft's lawyers have probably concluded that they would win any challenge on this. Microsoft have lost court battles before though, so who knows?


> How do you know GH didn't? Maybe they only included repos with LICENSE.MD files which followed a known permissive licence?

Since copilot famously outputs GPL covered code… no, we have proof they didn't do that.


I think you've missed my point.

If you write some code and release it under the GPL. Then I take your code, integrate it into my project, and release my project with the MIT licence (for example), it may be that Copilot was only trained on my repo (with the MIT licence)

The fault there is not on Github, it's on me. I was the one who incorrectly used your code in a way that does not conform to the terms of your licence to me.

I don't think the fact that Copilot outputs code which seems to be covered under the GPL proves that Github did not only crawl repositories with permissive licences when training Copilot.


You keep track of where each external dependency, file and code snippet come from, link to the source, link to the source license.

If someone has lied about the license of something down the chain of links, he's the one on the hook for it.

If you have licensed code in your software and no license to show for it or cannot produce the link to it then you're on the hook.

And here's the issue at hand copilot must have seen that code under permissive license somewhere, but now cannot produce a link to it.


> If someone has lied about the license of something down the chain of links, he's the one on the hook for it.

In this case, all you have on them is an email address. Pretty sure you're still on the hook.


It is responsibility of the entity (Microsoft in this case) publishing the code to make sure that they have the right to publish. The Linux kernel generally requires non anonymous contributions for that reason. As a guarantee that the person has the right to contribute.

> It is responsibility of the entity (Microsoft in this case) publishing the code to make sure that they have the right to publish.

This would basically kill github as an idea. I like the ability to be able to push some personal project to github and don't really give a fuck about technical copyright violations and I think the same is true for 90% of developers.


If you want a massive corpus of training data theb you can create it by hand like grandpappy used to do rather than just thieving it whilst telling yourself it is fine.

Your response is much appreciated. The dust will settle eventually, it seems distrust is the de facto way which people interpret new technologies.

IMO someone should be responding to the Twitter thread itself. I’m not sure why the official response is here when the copyright owner raised the complaint on Twitter.


It looks like you linked to the original, not the reply. There are so many replies to the original tweet that it's hard to find any specific reply.

The replies appear to be here:

https://nitter.net/ryanjsalva/with_replies


This doesn’t at all address the primary issue, which is one of licensing.

Is it a valid defense against copyright infringement to say “we don’t know where we got it, maybe someone else copied it from you first?”

If someone violated the copyright of a song by sampling too much of it and released it in the public domain (or failed to claim it at all), and you take the entire sample from them, would that hold up in a legal setting? I doubt it.


It does address it, although not that clearly. This happens all the time with news media. They will post a picture and say they got permission from X person, but X person actually didn't even own the copyright in the first place. That doesn't make any of it okay, but it does mean that the organization has legal cover in this case and the worst that will happen is that they'll have to take the content down. In GitHub's case if that same code snippet is found in other repo's that have different licensing then it's difficult to really prove who owns the copyright, it's a legal issue between the original copyright owner and the person that re-distributed the work. They can submit a DCMA takedown notice for the other repo's. But it's pretty unlikely Github gets into any legal trouble as long as they can prove that they got the snippet from someone else.

If that's true, than Github is just "washing its hands". Not at all reassuring for copyright holders and users of copilot.

That code seems to appear in thousands of repositories on GitHub, I’m sure some of them haven’t copied the license.

The vast majority of people who would use a matrix transform function they got from code completion (or from a GitHub or stack overflow search) probably don’t care what the license is. They’ll just paste in the code. To many developers publicly viewable code is in the public domain. Code pilot just shortens the search by a few seconds.

Microsoft should try todo better (I’m not sure how), but the sad fact is that trying to enforce a license on a code fragment is like dropping dollar bills on the sidewalk with a note pinned to them saying “do not buy candy with this dollar”


I still remember the days when we hand billion dollar lawsuits over 20 lines of code (Oracle vs Google).

If CoPilot makes everyone see how ridiculous that is, that's a win in my book.


What’s the most github could reasonably be expected to do? Identify if multiple licenses are found for the same code then maybe it should be flagged for review or the most restrictive license applied.

Check timestamps of commits of replicated code to find the original.

That would only work if the original was uploaded to GitHub before the copies. Like, somebody could copy from GitLab or BitBucket. And git histories don’t always help if they’re not copied over.

But copyright law doesn't really care about how you prevent infringement, just that it doesn't happen. Isn't it up to Github to come up with a way to do it, or otherwise not do it at all?

GitHub just needs to show they have taken reasonable precautions, and if a conflict is identified, that they remediate it without undue delay.

It’s not a binary all perfectly or nothing at all. The law looks at intent and so doesn’t punish mistakes or errors so long as you aren’t being malicious or reckless or negligent.


Github is protected by section 230, which states:

> No provider or user of an interactive computer service shall be treated as the publisher or speaker of any information provided by another information content provider

So the act of hosting copyrighted content is not actually a copyright violation for Github. They're not obligated to preemptively determine who the original copyright owner of some piece of code is, as they're not the judge of that in the first place. Even if you complain that someone stole your code, how is Github supposed to know who's lying? Copyright is a legal issue between the copyright holder and the copyright infringer. So the only thing Github is required to do is to respond to DMCA takedown notices.


Timestamps of commits can't be trusted, just like commit authors.

Github can only trust push timestamps.


If it's possible for video and audio content (ContentID, YT), then I don't see why it shouldn't be possible for OSS.

Do we want that though? I personally believe copyright as implemented today is harmful. The fact that code largely is able to dodge this could be seen as arguing we should be laxer with copyright, rather than arguing for strict enforcement of copyright on code.

The point is that CoPilot should not emit a word-for-word copy of someone else's work because that is called plagiarism.

Yes. GitHub can get away with "oh well, we're all learning" because if the code is violating copyright, it's the user who is infringing directly by publishing it, not GitHub via Copilot. Either the user would have to bring a case against GitHub demonstrating liability (good luck) or the copyright holder would have to bring a case against GitHub demonstrating copyright violation (again, good luck). Otherwise this is entirely between the copyright holder and the Copilot user, legally speaking.

Of course if someone does manage to set a precedent that including copyrighted works in AI training data without an explicit license to do so, GitHub Copilot would be screwed and at best have to start over with a blank slate if they can't be grandfathered. But this would affect almost all products based on the recent advancements in AI and they're backed by fairly large companies (after all, GitHub is owned by Microsoft and a lot of the other AI stuff traces back to Alphabet and there are a lot of startups funded by huge and influential VC companies). Given the US's history of business-friendly legislation, I doubt we'll see copyright laws being enforced against training data unless someone upsets Disney.


Yeah, so if a news agency publishes a picture without knowing where it came from, the originator can sue them for violating copyright.

There is no “I don’t know who owns the IP” defense: the image has a copyright, a person owns that copyright, publishing the image without licensing or purchasing the copyright, is a violation. The fine is something like $100k per offense for a business.


FWIW this in consequence means you can't legally use Copilot without becoming liable to copyright violations because it's essentially a black box and you have no insight into where the code it generates originated and even if it isn't a 1-to-1 copy it might be a "derivative work".

This is why I'm gnashing my teeth whenever I hear companies being fine with their employees using Copilot for public-facing code. In terms of liability, this is like going back from package managers to copying code snippets of blogs and forum posts.


> using Copilot for public-facing code

Why this restriction on public-facing code? Are you OK with Copilot being used for "private"/closed source code? I get that it would be less likely to be noticed if the code is not published, but (if I understand right) is even worse for license reasons.


I don't advocate people use Copilot for anything but hobby toy projects.

I have lower expectations of the rigor with which companies police their internal codebases, though. Seeing Copilot banned for internal use too is a pleasant surprise. Companies tend to be a lot more "liberal" in what kind of legal liabilities they accept for their internal tooling in my experience.


Do you think that as part of this Github discovered that essentially everyone was in violation of copyright? That copyright of material without public knowledge or review (which exists in music, but not most code), is basically unenforceable?

Then they decided to wade in and build a house of cards where the cards are everyone else’s code, just waiting for the grenade pin puller and we’ve potentially witnessed the moment?

That’s the only thing that makes sense to me here. They don’t care because opening the issue will bring down everyone else with them.


I may be misinformed but my understanding of copyright is that it protects the 'expression' of something (like an algorithm or recipe) so someone can rewrite a copyrighted chunk of code into another language and be free of the original copyright, while also able to assert their own copyright on their new expression.

If that is true then one way to get around copyright restrictions on existing code is to create a new language.


fascinating idea, copilot could do the translations internally and also work torwards widening the pool of suggestions to all languages instead of the individual lamguage a user is using (bit then again, they might be writing in the "new" language already

Turn the parties in this argument around and see if you think it still holds.

J. Random Hacker acquires and uses a copy of some of GitHub's, or Microsoft's source. When sued, the defense says that the code was not taken directly from GH/MS, just copied from a newsgroup where it had been posted. Does this get J. off the hook?


Was J using automated methods based on false claims of ownership by the newsgroup posters, with no direct knowledge of the violation? If so J should not be punished.

> Is it a valid defense against copyright infringement to say “we don’t know where we got it, maybe someone else copied it from you first?”

I mean, in humans it's just referred to as 'experience', 'training', or 'creativity'. Unless your experience is job-only, all the code you write is based on some source you can't attribute combined with your own mental routine of "i've been given this problem and need to emit code to solve it". In fact, you might regularly violate copyright every time you write the same 3 lines of code that solve some common language workaround or problem. Maybe the solution is CoPilot accompanying each generation with a URL containing all of the run's weights and traces so that a court can unlock the URL upon court order to investigate copyright infringement.

> If someone violated the copyright of a song by sampling too much of it and released it in the public domain (or failed to claim it at all), and you take the entire sample from them, would that hold up in a legal setting? I doubt it.

In general you're not liable for this. While you still will likely have to go to court with the original copyright holder's work, all the damages you pay can be attributed to whoever defrauded or misrepresented ownership over that work. (I am not your lawyer)


> > Is it a valid defense against copyright infringement to say “we don’t know where we got it, maybe someone else copied it from you first?”

> I mean, in humans it's just referred to as 'experience', 'training', or 'creativity'. Unless your experience is job-only, all the code you write is based on some source you can't attribute combined with your own mental routine of "i've been given this problem and need to emit code to solve it". In fact, you might regularly violate copyright every time you write the same 3 lines of code that solve some common language workaround or problem.

Aren't you moving the goal posts? This is not 3 lines, but instead is 1 to 1 reproducing a complex function that definitely has enough invention height to be copyright able.


With high probability, what's happened here is this code is an important piece of code-infrastucture in that it's copied into a fair number of places. Which means humans are copying it without attribution or downstream of someone who did while relevant license is not propagated anywhere near as reliably.

It doesn't change licensing issue but it does mean people are already copying and using copyrighted code without respecting original license and no AI involved.

There should be a way to reverse engineer code LLMs to see which core bits of memorized code they build on. Another complex option is a combination of provenance tracking and semantic hashing on all functions in code used for training. Another option (non-technical) is a rethinking of IP.


>With high probability, what's happened here is this code is an important piece of code-infrastucture in that it's copied into a fair number of places. Which means humans are copying it without attribution or downstream of someone who did while relevant license is not propagated anywhere near as reliably.

The original poster said it was in a private repository.

>It doesn't change licensing issue but it does mean people are already copying and using copyrighted code without respecting original license and no AI involved.

I don't get the argument. Many people are copying/pirating MS windows/MS office. What do you think MS would say to a company they caught with unlicensed copies and they used the excuse "the PCs came preinstalled with Windows and we didn't check if there was a valid license"?


Humans have creativity.

The first C developers wrote C code despite lacking a training set of C code.

AI can't do that. It needs C code to write C code.

See the difference here?


The training set for C was algol and a bunch of other languages.

AI could be used to create languages based on design criteria and constraints like C was, but it does bring up the question of why one of the constraints should be character encodings from human languages if the final generated language would never be used by humans...

I mainly think it's funny watching all of these Rand'ian objectivists reusing ever excuse used by every craftsman that was excised from working life...machines need a machinist, they don't have souls or creativity, etc.

Industry always saw open source as a way to cut cost. ML trained from open source has the capability to eliminate a giant sink of labor cost. They will use it to do so. Then they will use all of the arguments that people have parroted on this site for years to excuse it.

I'm a pessimist about the outcomes of this and other trends along with any potential responses to them.


The problem here is that copilot explots a loophole that allows it to produce derivative works without license. Copilot is not sophisticated enough to structure source code generally- it is overtrained. What is an overtrained neural network but memcpy?

the problem isn't even that this technology will eventually replace programmers: the problem is that it produces parts of the training set VERBATIM, sans copyright.

No, I am pretty optimistic that we will quickly come to a solution when we start using this to void all microsoft/github copyright.


> Is it a valid defense against copyright infringement to say “we don’t know where we got it, maybe someone else copied it from you first?”

If you do something, it's ultimately you who has to make sure that it is not against the law. "I didn't know" is never a good defense. If you pay with counterfeit cash, it is you who will be arrested, even if you didn't know it was counterfeit. If you use code from somewhere else (no matter if it's by copy/pasting or by using Copilot), it is you who has to make certain that it doesn't infringe on any copyright.

Just because a tool can (accidentally) make you break the law, doesn't mean the tool is to blame (cf. BitTorrent, Tor, KaliLinux, ...)


Yet, plenty of tools of this caliber have been made illegal (in some parts of the world).

Indeed, and people always (rightfully) complain loudly against the outlawing of these tools, and in many cases they have been successful. Yet here it's the opposite for some weird reason.

BitTorrent doesn't automatically download a pirated copy of Lion King when you ask it for something to watch...

BitTorrent (and, to a larger degree, EDonkey) did and still do that. Who tells you that what you're downloading is indeed what you think it is. You can click on a magnet link that claims to download a Debian ISO just to find out later that it's something else entirely. To make matters worse, BitTorrent even uploads to potentially hundreds of other clients while you're still downloading, so while downloading something might not be illegal in your jurisdiction, uploading/distributing most certainly is, and you can get into lots of trouble for uploading (parts of a) copyrighted wortk to hundreds or thousands of other users

BitTorrent is certainly not a good example to follow, but I do think that copilot is more wrong.

They should definitely include disclaimers and make seeding opt-in (though I don't know how safe you are legally when you download a Lion King copy labeled Debian.iso). That said, they don't have the information necessary to tell whether what you're doing is legal or not.

Copilot _has_ that information. The model spits out code that it read. They could disallow publishing or commercially using code generated by it while they're sorting it out, but they made the decision not to.

AI is hard, but the model is clearly handing out literal copies of GPL code. Github knows this and they still don't tell you about it when you click install.


It doesn't matter if the information is there or not, since an algorithm cannot commit a copyright violation. There is at least one human involved, and the human is the one who is responsible.

A car has all the information that it's going faster than the speed limit, or that it just ran a red light. But in the end it's the driver who is responsible. It's not the tool (car, Copilot) that commits the illegal act, it's the user using that tool


In the case of Copilot, you don't even have a speedometer.

So your point is that removing the speedometer from your car and then claiming "I didn't know I was driving too fast!" will make it somehow not your responsibility?

It is still your responsibility to know and obey the traffic laws, the same as it is your responsibility to obey the copyright laws....


> You can click on a magnet link that claims to download a Debian ISO just to find out later that it's something else entirely

This is just fear mongering, the same exact thing can happen with a web browser, I click a link to view an image of a cat but... oops, it was actually a Getty copyrighted picture of a dog! Oh nooooo.

On the web that sort of thing is actually common, but bit torrent? I have never downloaded a torrent to find it was something other than what I expected. Never have I seen a movie masquerading as a Debian ISO. That's nothing more than a joke people use to make light of their (deliberate) copyright infringement.

Furthermore, is there even any bit torrent client that will recommend copyrighted content to you, rather than merely download what you tell it to? I've not seen one. Search engines, in my browser, do that sort of recommendation but bit torrent clients do what I tell them to. Including seeding to others, which is optional but recommended for obvious reasons.


> I click a link to view an image of a cat but... oops, it was actually a Getty copyrighted picture of a dog! Oh nooooo

Sorry, what?

Downloading copyrighted content is very, very rarely the problem.

It's the uploading (the sharing!) of copyrighted content where you actually get into trouble.


If you actually care, then simply configure your client to leech. Every client I've ever used or heard of supports this.

But more to the point, getting tricked into seeding a copyrighted movie by a torrent masquerading as a Debian ISO isn't something that actually happens. That's absurd FUD.


Errm, you posted:

> "This is just fear mongering, the same exact thing can happen with a web browser, I click a link to view an image of a cat but... oops, it was actually a Getty copyrighted picture of a dog! Oh nooooo."

No-one cares whether you download an open-sourced photo of a cat or a copyrighted photo of a dog.

Why would anyone claim that?

It's a terrible comparison to torrents.


I don't know whether the "Numerical Recipes" publisher actively defends their copyright of the code in the books but it would be an interesting test case.

Can you clarify what you mean when you describe Github as an LLM maintainer? What LLM does github maintain?

Probably more accurate to say, "LLM service provider." Ultimately, GitHub distributes a derivative of OpenAI's Codex – though, the version in production has been tuned considerably.

Thank you for the response ( especially since it does not read like a corporate damage control response ).

I will admit that I am conflicted, because I can see some really cool potential applications of Copilot, but I can't say I am not concerned if what Tim maintains is accurate for several different reasons.

Lets say Copilot becomes the way of the future. Does it mean we will be able to trust the code more or less? We already have people, who copy paste stack overflow without trying to understand what the code does. This is a different level, where machine learning seems to suggest a snippet. If it works 70% of time, we will have a new generation of programmers management always wanted.


I firmly believe that GitHub Copilot isn't a replacement for thinking, breathing, reasoning developers on the other side of the keyboard. Nor is it a replacement for best practices that ensure proper code quality like linting, code reviews, testing, security audits, etc.

All the research suggests that AI-assisted auto-complete merely helps developers go faster with more focus/flow. For example, there's an NYU study that compared security vulnerabilities produced by developers with and without AI-assistend auto-complete. The study found that developers produced the same number of potential vulnerabilities whether they used AI auto-complete or not. In other words, the judgement of the developer was the stronger indicator of code quality.

The bottom line is that your expertise matters. Copilot just frees you up to focus on the more creative work rather than fussing over syntax, boilerplate, etc.


This is very much a standard damage control. Notice how the drone completely ignored actual problems and instead derailed the whole thread with fake ones and broken analogies.

> The OpenAI codex model (from which Copilot is derived) works a lot like a translation tool. When you use Google to translate from English to Spanish, it’s not like the service has ever seen that particular sentence before. Instead, the translation service understands language patterns (i.e. syntax, semantics, common phrases). In the same way, Copilot translates from English to Python, Rust, JavaScript, etc. The model learns language patterns based on vast amounts of public data

As I understand, this isn't proven is it?

We don't know that the model isn't simply stitching and approximating back to the closest combination of all the data it saw, versus actually understanding the concepts and logic.

Or is my understanding already behind times?


> This is a new area of development, and we’re all learning. I’m personally spending a lot of time chatting with developers, copyright experts, and community stakeholders to understand the most responsible way to leverage LLMs.

Given that there have been major concerns about copyright infringements and license violations since the announcement of Copilot, wouldn't it have been better to do some more of this "learning", and determine what responsibilities may be expected of you by the broader community, before unleashing the product into the wild? For example, why not train it on opt-in repositories for a few years first, and iron out the kinks?


> why not train it on opt-in repositories for a few years first, and iron out the kinks?

Ha ha. Because then the product couldn’t be built. Better to steal now and ask forgiveness later, or better yet, deny the theft ever occurred.


If Copilot was designed with any ethics in mind, it would have been an opt-in model.

Instead, they scoured and plagiarized everyone's source code without their consent.


Because the ethical opt-in model builders are still working on putting together their cleanly sourced dataset.

Copyright infringement is not theft in the most important sense that matters. Theft is normally negative sum, copyright infringement is almost always positive sum.

Had to find this after a long time

IT Crowd Piracy Warning https://www.youtube.com/watch?v=ALZZx1xmAzg


And why not train it on microsoft windows and office code?

That is a rather good question.

Because then your Re4ct code would look like this:

    export default class USERCOMPONENT extends REACTCOMPONENT<IUSER, {}> {
    constructor (oProps: IUSER){
      super(oProps);
    }
    render() {
      return (  
        <div>
          <h1>User Component</h1>
            Hello, <b>{This.oProps.sName}</b>
            <br/>
            You are <b>{This.oProps.dwAge} years old</b>
            <br/>
            You live at: <b>{This.oProps.sAddress}</b>
            <br/>
            You were born: <b>{This.oProps.oDoB.ToDateString()}</b>
        </div>
        );
      }
    }

Easy pal, we don't want to multiply that shyte.

Exactly, it would actually benefit many C/C++ programmers. Some components of NT are very high quality, why not wash their license if the aim is to empower the programmers and also make some profit?

> And why not train it on microsoft windows and office code?

As a thought experiment, if one were to train a model on purely leaked and/or stolen source code, would the use of model step effectively "launder" the code and make later partial reuse legit?


Only if it's not microsoft's leaked code, I guess :)

I read that the Amazon equivalent of GitHub Copilot does respect licensing properly, maybe you can talk to them about adopting their approach.

> When you use Google to translate from English to Spanish, it’s not like the service has ever seen that particular sentence before.

But that is exactly how it works. Translation companies license (or produce) huge corpuses of common sentences across multiple languages that are either used directly or fed into a model.

Third party human translators are asked to assign rights to the translation company. https://support.google.com/translate/answer/2534530


Their are lots of sheets properly licensed that show the notes to play ‘stairway to heaven’. Many intro to guitar books, etc. If I publish myself playing that song without the copyright owners permission (and typically attribution) I am looking at some very, very negative outcomes. The fact that there are many copies correctly licensed (or not) does not obsolve me of anything. Curious how this any different?

> If I publish myself playing that song without the copyright owners permission

Music licensing is bonkers but AFAIR (at least in the UK) I think you're allowed to do covers without explicit permission[1] - you'll have to give the original writers/composers the appropriate share of any money you make.

[1] Which is why you (used to?) get, e.g., supermarkets playing covers of songs rather than the originals because it's cheaper.


What's the appropriate share?

In the UK, at least, it seemed to depend on several decades of accumulated rules and whatnots that only the PRS understood[1] (but I haven't been involved in anything related to music licensing for a few years and even then it was baffling.)

[1] Things like "was it on the radio or a TV show or a live performance or a recording? who was the composer? which licensing region was it in?" etc.


Are variable names randomized before being trained on? If so, that could prove something else is going on because all of the variable names Copilot outputted were the same.

Copilot is good at naming variables, I don’t think you should randomise them.

I'm thinking Copilot may be good at naming variables and still use randomized variable naming in their training set

Copilot is the largest disrespect to open source software I have ever seen. It is a derivative work of open source code and it is not released under the same license. It is also capable of laundering open source code. Congratulations for working on the "extinguish" phase of embrace, extend, extinguish for open source.

I really wonder what all those people who said Microsoft acquisition of Github was a good thing for opensource think now. I'm sure there will still be mental gymnastics involved.

Claiming Github's Copilot is Microsoft's "Extinguish" step against open source _is_ the mental gymnastics.

It really isn't mental gymnastic. Copilot, the model, is a program that is a derivative work of open source code. It should be open source.

That is another discussion. He is not claiming that it should be open source - he is claiming it is created to destroy open source.

I didn't claim such a thing, my claim is Microsoft's purchase of Github is not good for open source.

He is me. Allowing large companies to ignore licenses and giving them a tool to launder licensed code at scale is a significant threat to the integrity of open source licenses.

there it is

The chilling effect of this decision is something everybody who uses open source software should be worried about.

I'm worried that this will harm open source, but in a different way: lots of people switching to unfree "no commercial use at all" licenses, special exemptions in licenses, and so on. I'm also worried that it'll harm scientific progress by criminalizing a deeply harmless and commonplace activity such as "learning from open code" when it's AIs that do it. And of course retarding the progress of AI code assistance, a vital component of scaling up programmer productivity.

From an AI safety perspective, I'm also worried it will accelerate the transition to self-learning code, ie. the model both generating and learning from source code, which is a crucial step on the way to general artificial intelligence that we are not ready for.


Horrible framing. AI is not learning from code. The model is a function. The AI is a derivative work of its training material. They built a program based on open source code and failed to open source it.

They also built a program that outputs open source code without tracking the license.

This isn't a human who read something and distilled a general concept. This is a program that spits out a chain of tokens. This is more akin to a human who copied some copywritten material verbatim.


The brain is a function. You're positing a distinction without a difference.

The code that the tweet refers to is not open source. That's the scandal here.

LGPL is open source, the scandal is violating the license agreement.

Industry always saw open source as a cost cutting measure.

I think the real lesson to learn is if you look at the sheer amount of energy (wattage) used to replace humans it's clear that brains are really calorie efficient at doing things like producing the kinds of code that Copilot creates...but it doesn't matter because eliminating labor cost will always be attractive no matter what the up front cost is to do it. They literally can't NOT do it based on the rules of our game.

If it wasn't MS it would be someone else and is...you think IBM isn't doing this? Amazon? GTFOH. So is every other large company that has a pool of labor that is valued as a cost.

Maybe a better question would be how and why major parts of human life are organized in ways that are bad for the bulk of humanity.


> Copilot is the largest disrespect to open source No, its owner, Microsoft Coproration is. Remember, what they did with CodePlex archives?

But they ain't some kind of special villains, its today's monopoly market kicked in. Selling startuprs to Yahoo comes with consequences.

> capable of laundering open source code That's an exaggeration. Copilot is still a dumb machine which accidentally learned to mimic the practice of borrowing intellectual property from human coders.


The way that copilot seems to understand not just variables from the file you’re working with but functions classes and variables from all the files in your folder is incredible!

That’s a whole lot of words that don’t address TFA at all.

2 things:

1. you make it out like a translation from e.g. English to Spanish wouldn't fall under copyright. That's incorrect, in most juristictions I am aware of, it actually fall under the copyright of the original work and fall under its own copyright.

2. When will copilot be released open source, it is pretty clear by now that it is a derivative of all the OSS code so how about following the licensing?


Your long long paragraph about neighboring code editors is disproven: https://news.ycombinator.com/item?id=33227395

You’re really not going to solve this problem with marketing (“blog posts”) or some pro-Github story from data scientists. You need a DMCA / removal request feature akin to Google image search and you need work on understanding product problems from the customer perspective.


Hey Ryan! Have you ever done any reading on the Luddites? They weren't the anti technology, anti progress social force people think they were.

They were highly skilled laborers who knew how to operate complex looms. When auto looms came along, factory owners decided they didn't want highly trained, knowledgeable workers they wanted highly disposable workers. The Luddites were happy to operate the new looms, they just wanted to realize some of the profit from the savings in labor along with the factory owners. When the factory owners said no, the Luddites smashed the new looms.

Genuinely, and I'm not trying to ask this with any snark, do you view the work you do as similar to the manufacturers of the auto looms? The opportunity to reduce labor but also further the strength of the owner vs the worker? I could see arguments being made both ways and I'm curious about how your thoughts fall.


What alternative are you suggesting?

Things turned out pretty great economy-wise for people in the UK. So that's a poor example even if Luddites didn't hate technology. Not working on the technology wouldn't have done the world any favours (nor the millions of people who wore the more affordable clothes it produced).

I personally think it'd be rewarding to make developers lives easier, essentially just saving the countless hours we spend googling + copy/pasting Stackoverflow answers.

Co-pilot is merely just one project in this technological development, even if a mega-corp like Microsoft doesn't do it ML is here to stay.

If you're concerned that software developers job security is at all at risk from co-pilot than you greatly misunderstand how software engineering works.

Auto-completing a few functions you'd copy/paste otherwise (or rewrite for the hundredth time) is a small part of building a piece of software. If they struggle with self-driving cars, I think you'll be alright.

At the end-of-the-day there's a big incentive for Github et al to solve this problem, a class action lawsuit is always an overhanging threat. Even if co-pilot doesn't make sense as a business and these pushback shut it down I doubt it will go away.

I'm personally confident the industry will eventually figure out the licensing issues. The industry will develop better automated detection systems and if it requires more explicit flagging, no-one is better positioned to apply that technologically than Github.


The luddites were right. Everything is worse because they did not win, which is obvious if you spend a second thinking about what them not winning means: more concentrated wealth, more disenfranchised workers.

Yet the public narrative centers around perceived anti-technologism and implied anti-comfortism, wholly ignoring the societal underpinnings of the issue: an increase in power and income inequality amounting to disenfranchisement.

Who wrote and popularized that narrative? The industrialists with printing presses.

Tax is supposed to deal with this to some extent but the rich have the resources to avoid it!

We've also steadily lowered tax rates over the last 50 years. Many countries are at historically low tax burdens despite rising inequality and no evidence of this improving economic growth.

It’s difficult for me to see how 2022 is worse than 1816 all things considered.

Everything is worse than it could have been now, not directly compared to 200 years ago.

I find it very hard to believe you didn't understand the suggestion.


>> Everything is worse than it could have been now

Prove it.


Durability, quality and reparability. Most fabrics build today are so fragile and needs to be replaced soon, despite the advance in material and weaving.

Most highly qualified workers loves what they do and would stand for keeping they’re output quality up. On the contrary interchangeable cheap workers have no real incentive to that. The factory’s manager is left alone in charge to balance quality versus cheapness, and the last comes with obsolescence (planned or not), which is good for business.


> Most fabrics build today are so fragile and needs to be replaced soon

Because that's what people want. You can get high quality clothes for much cheaper than you could in 1816, but people prefer disposable clothes so they can change their look more often. This is just producers responding to demand.


“People want so producers responds” is a nice but candide eco theory, 2022 looks more like “producers pay for marketing that makes people want” oh and by the way, who’s really paying for the marketing at the end ?

Properly sourcing high quality stuff is incredibly difficult for consumers. Price is not a good discriminator, unfortunately. This is a problem everywhere but for clothes in particular.

https://theweek.com/feature/briefing/1016752/the-real-cost-o...

Maybe not right this moment but our actions have consequences in the future.

For those who only see the next quarter, they're stoked.

For those who understand infinite growth is impossible and would simply like a livable world, they're horrified.


It would indeed be an outstanding catastrophe if 200 years of the most incredible scientific and technological progress yielded a worse result. Of course, that is entirely not the point (none of the times this trope comes out). What is being argued is that 2022 as it is is worse than 2022 as it could be.

In other words: things improved because of technology and despite the societal/economic framework, not because of it.


The first few decades of the 19th century were exceptionally grim in the UK though. Poverty and inequality both increased and a reactionary government enacted draconian policies curtailing freedom of speech as Britain was probably closest to the brink of a social revolution as it ever was. It took several decades for things to actually start improving for most common people and most of the actual progress in that area only occurred in the 1940s and 50s.

See https://en.m.wikipedia.org/wiki/Peterloo_Massacre for example


> If you're concerned that software developers job security is at all at risk from co-pilot than you greatly misunderstand how software engineering works.

I think you are vastly underestimating how many professionally employed software developers are replaceable by copilot at this very moment. The managers are not caught up yet and you seem to be lucky not having to work with this type of dev, but I have had 1000s of people I interacted with in a professional capacity over the decades who can be replaced today. Some of those realised this and moved to different positions (for instance, advising how to use ML to replace them: if you cannot beat them…).

I mean of course you are right in general but there are millions of ‘developers’ who just look everything up with Google/SO, copy paste and change until it works. You are saying this will make their lives better, I say it will terminate their employment.

Anecdote: I know a guy who makes a boatload of money in London programming but has no understanding of things like classes, functional constructs, functions, iterators (he kind of, sometimes, understands loops) etc. He simply copies things and changes them until it works: he moved to frontend (react) as there he is almost not distinguishable from his more capable colleagues because they are all in a ‘put code and see the result’ type of mode anyway and all structures look the same in that framework, so the skeleton function, useXXX etc is all copy paste mostly anyway.


So are you going to report him or you are just whining about why life is unfair? What he does in order to do his job is none of your business as you don't know how his life is behind the scenes.

I read the gp comment as an example of the type of engineer that can be replaced by copilot. Nothing more.

Which doesn’t actually seem like a great loss imho…

I will admit I’m kind of a “throw stuff at the wall and see what sticks” kind of coder but nobody is paying me boatloads of money to poke at some program until it stops segfaulting, would be nice though.


Indeed; that was the intended message. I don’t go reporting random people who slack off but yet complete their work; that would be a really busy job… I think there is even a word for that now in English (I am dutch). I will try to find it.

Why would I report him? He is doing what he is asked to do?

Who or what would replace them ? If you got rid of these developers, how would those who did the firing know what they’re doing ?

I use copilot to do things that I would’ve hired people for. I create tests and put comments in my code and copilot comes up with pages of dreary boring shit that would take me 0 pleasure or brainpower but would take a lot of work to just go through.

A real good example is mapping objects: let’s say you have a deep nested object from an ERP and you need to map that to another system(s). This is horrible work and copilot just generates almost everything for it if it knows the input and output objects; it ‘knows’ that address = street and if it is not it will deduct it from the models or comments or both; if there is a separate house number and stuff, it’ll generate code to translate that. I used to hire people for that; no longer; it just pops, I run the tests and fix some thing here and there.


I have to be honest I've not used it but it truly sounds incredible that it can do things as well as you say.

So you write tests and copilot generates code you shove into production with little overhead ?

Do you read the code thoroughly (kind of negating having it generated for you?), or just have blind faith in it because tests are green and just YOLO it into production ?

I'd feel pretty uneasy deploying code that:

  * I, or a trusted peer has not written.

  * Hasn't been reviewed by my peers.

  * Code I, or my peers don't understand fairly well.
That's not to say I think me or my colleagues write code that doesn't have problems, but I like to think we at least understand the code we work with and I believe this has benefits beyond just getting stuff done quickly and cheaply.

In other words, I have no problem using code generated by co-pilot, but I'd feel the need to read and review it quite thoroughly and then I sort of feel that negates the purpose, and it also means it pulls my back into the role of doing work I'd hire someone else to do.


But I do review and test it and it is mostly 80% ok. It even learns your style of coding. Like said; it works best for stuff that is heavy on code but low on thought.

Do you enjoy working like this? Having CP generate things correctly 80% of the time and then having to scrutinize whatever is generated and look for problems?

Genuine question, not being snarky.


> ‘put code and see the result’

Isn't this basically all UI programming? :D

Joking aside, I see this 'person X doesn't know anything, but they are still delivering' attitude quite a bit on HN now. They clearly know something, and projects like co-pilot will make them even more effective.

I think the opposite of you - that projects like co-pilot will further lower the barriers of entry to programming and expand those who program. I also think that like all ease of programming advances in the past, business requirements will continue to grow at the edges where those who care about the craft will still be required.


Oh I do believe you are right, I just don’t think this is a thing just anyone can learn: many ‘outsourcing’ programmers/coders don’t really understand what they are working on; they just finish tasks. I have no stats, but in companies I worked/work with, it is the vast majority. They don’t know or care about the business goals, they just perform tasks and then go home. This is almost already replaceable by copilot.

Like I said; it is a great thing for me but I don’t believe developers without talent and/or rigorous foundations will make it. Go on Upwork and try to find someone who can do more than the same work (mostly copy paste) that they always did. In an interview when you ask someone to use map/reduce to create a map/dict, they will glaze over. This is the norm, not the exception, no matter the pay. Some of them have 10 years experience but cannot do anything else than make crud pages. This will end as copilot makes lovely .reduce and linq art from a human language prompt.


> Not working on the technology wouldn't have done the world any favours (nor the millions of people who wore the more affordable clothes it produced).

Did you read the comment you're replying to at all? It says

>The Luddites were happy to operate the new looms, they just wanted to realize some of the profit from the savings in labor along with the factory owners.

Now maybe you agree maybe you disagree. But if you're just talking past the person you're replying to... what's the point?


what happened with auto looms is cloth became cheap which allowed more people to have clothing and expanded the fashion industry by 1000x

Sure glad thse Luddites didn't get their way


I can guess the modern slaves producing our cheap clothes would have an opinion on that.

It's true that a lot of the fashion industry appears to be pretty disgusting. But that doesn't mean that it would be better if fewer of the steps were automated.

Local communities would be stronger and more resilient with more local crafts-people providing meaningful labor within said communities.

Currently, everything is extraction and the US is rotting from the inside out because of it.


People can do that right now; they just choose not to. I don't think you can do simple moralising at the state of things and point the blame at a single cause.

Or people would consume less and therefore be poorer.

So you read the comment which specifically made the argument that they didn't want to prevent the new looms from being used and concluded that they wanted to prevent the new looms from being used?

It's probably also opened up a whole niche market for artisan "handwoven" fabrics.

Sadly that's probably a modern thing and not something that people wanted / cared about immediately once everyone lost their jobs.


The Luddites can use auto looms on their own and be their own owners. Labor force wants salary + profit but entirely missing from accepting loss. Funny stuff.

"risk" -_-' Because no private business has ever been bailed out or received any public funding, right?

Yes, that is also bad. Public bodies choosing to give private companies money because otherwise they'd fail is bad stewardship of our taxes.

But also this point is silly. Plenty of money and effort is risked and lost with no bailout. Bailouts are extremely unusual in the grand scheme of things.


Not rarer than bankrupcy.

Are you sure? According to Statistica there were over 700k bankruptcies in the US from 2000-2020 [0]. How many bailouts have there been?

[0] https://www.statista.com/statistics/817918/number-of-busines...



But - the only reason anyone makes money (other than tax money) is because they're useful to someone else. Almost all of the clothing industry companies make money from large numbers of people buying their clothes. So they are useful to us.

Similarly, the reason Europe put 30% of its populace "out of work" by industrialising agriculture is why we don't have to all go work in fields all day. It is a massive net positive for us all.

Moving ice from the arctic into America quickly enough before it melted was a big industry. The refrigerator put paid to that, and improved lives the world over.

Monks retained knowledge through careful copying and retransmission of knowledge during the medieval times in the UK. That knowledge was foundational in the incredible acceleration of development in the UK and neighbouring countries in the 18th and 19th centuries. But the printing press, that rendered those monks much less relevant to culture and academia, was still a very good idea that we all still benefit from today.

Soon, millions of car mechanics who specialise in ICE engines will have to retrain or, possibly, just be made redundant. That may be required for us to reduce our pollution output by a few percent globally, and we may well need to do that.

The exact moment in history when workers who've learned how to do one job are rendered obsolete is painful, yes, and they are well within their rights to what they can to retain a living. But that doesn't mean those workers are somehow right; nor that all subsequent generations should have to delay or forego the life improvement that a useful advance brings, nor all of the advances that would be built on that advance.


> the only reason anyone makes money (other than tax money) is because they're useful to someone else.

Stealing, scamming, gambling, inheriting, collecting interest, price gouging, slavery, underpaying workers, supporting laws to undermine competitors… Plenty of ways to make money without being useful—or by being actively harmful—to someone else.

> Almost all of the clothing industry companies make money from large numbers of people buying their clothes. So they are useful to us.

We don’t need all that clothing, made by monetarily exploiting people in poor countries and sold by emotionally exploiting people in rich countries under the guise of “fashion”. The usefulness line has long been crossed, it’s about profit profit profit.


>sold by emotionally exploiting people in rich countries under the guise of “fashion”.

only emotionally crippled people like fashion, if they were healthy they would all dress in gray unitards and march in formation towards the glorious future!

hey I too have often been carried away by my own rhetoric but come on!


This was an amusing comment.

We definitely don't need to change wardrobes entirely 2x per year, at great cost in externalities such as pollution from all the shipping. I'm sure you understood that this is the point.

I assumed that this was the point but that latexr had been carried away by their rhetoric, like perhaps to the point of sounding a little bit loony, hence the line:

>hey I too have often been carried away by my own rhetoric but come on!


> only emotionally crippled people like fashion

Please don’t straw man¹. That’s neither what I said, nor what intended to convey, nor what I believe.

¹ https://news.ycombinator.com/newsguidelines.html


> Plenty of ways to make money without being useful—or by being actively harmful—to someone else.

I don't equate, say, "making money" with "stealing money". I mean the way people do things within the law. Inheriting is different; the money is already made. Interest is being useful to someone else, via the loan of capital.


> I don't equate, say, "making money" with "stealing money". I mean the way people do things within the law.

Laws shouldn't be equated to ethics. There have been and will be countless ways to make money legally and unethically in any society.


I've no idea what your point is in relation to the topic. Stealing money is against the law, so that already rules it out from "making money". That was my point.

> I mean the way people do things within the law.

The examples considered that: gambling, collecting interest, price gouging, underpaying workers, supporting laws to undermine competitors.


As I say, interest is being useful to someone else, via the loan of capital.

Gambling - I don't do it, but I'd need more specifics to see why gambling is bad in this sense. It's a voluntary pursuit that I think is a bad idea, but that doesn't make it illegal.

Price gouging is still being useful, just at a higher price. Someone could charge me £10 for bread and if that was the cheapest bread available, I'd buy it. If it is excessive and for essential goods, it is increasingly illegal, however. 42 out of 50 states in the US have anti-gouging laws [0], which, as I say, isn't what I'm talking about. I'm talking about legal things.

Underpaying workers - this certainly isn't illegal, unless it's below minimum wage, but also "underpaying" is an arbitrary term. If there's a regulatory/legal/corrupt state environment in which it's hard to create competitors to established businesses, then that's bad because it drives wages down. Otherwise, wages are set by what both the worker and employer sides will bear. And, lest we forget, there is still money coming into the business by it being useful. Customers are paying it for something. The fact that it might make less profit by paying more doesn't undermine that fundamental fact.

As for supporting laws to undermine competitors, that is something people can do, yes. Microsoft, after their app store went nowhere, came out against Apple and Google charging 30% for apps. Probably more of a PR move than a legal one, but businesses trying to influence laws isn't bad, because they have a valid perspective on the world just as we all do, unless it's corruption. Which is (once more, with feeling) illegal, and so out of scope of my comment. And again, unless the laws are there to establish a monopoly incumbent, which is pretty rare, and definitely the fault of the government that passes the laws, the company is still only really in existence because it does something useful enough to its customers that they pay it money.

[0] https://www.law360.com/articles/1362471


You see this argument over and over again but it’s the exception that proves the rule.

Most of the time when it’s made it’s just papering over yer another situation where a surplus is being squeezed out of a transaction by a parasitic manager class using principal-agent problem dynamics.

The people who invented this stuff are always trying to tell you they’ve invented the cotton gin or something when in fact they’ve just come up with a clever way to take someone else’s work and exploit it.


Would for you to present a concrete example of this. Genuinely curious.

What was described wasn't the principal-agent problem. If I'm an employee and my job becomes simpler or more productive through an automation investment by someone else, I don't think I deserve part of the increased profit unless I'm part of a profit-sharing agreement that would also see me absorb losses.

> unless I'm part of a profit-sharing agreement that would also see me absorb losses

And how many workers even have the possibility of an arrangement like this, i.e. a worker-owned cooperative?

Yes, that is exactly the point. When a labour-saving technological development comes along, it's payday to the capital-having class and dreary times for the labour-doing class.


And it's good for everyone down the line, because the good being produced becomes more affordable and better. It might be hard to zoom out from these current times when we can expect continual progress, but this is one of the only reasons why anything ever gets better.

I'm from the UK, and we used to make motorbikes. They got - correctly - outcompeted by Japanese bikes in the 1950s that were built with more modern investment and tooling. If Japan hadn't done that, we'd have more motorcycle jobs in the UK, and terrible motorcycles that still leaked oil because the seam of the crankcase would still be vertical and not horizontal.

I'm not saying anything about this process is perfect and pain-free, but it seems that a lot of the things we have now are because of processes like this. Should Tesla sell through dealerships instead of direct to consumers? I think the answer is, "Tesla should do what's best for its customers", and not "Tesla should act to keep dealership jobs and not worry about what's best for its customers."

Businesses exist for their customers and not their employees, and having just been part of a business that, shall we say, radically downsized, I've seen a little of the pain of that. Thankfully it was a high tech business, and as the best employment protection is other employers, and there are loads of employers wanting tech skills I've seen my great colleagues all get new jobs. But I think it's ultimately disempowering to think of your employer like a superior when it should feel like an equal whose goals happen to coincide with yours for a while.


In the case of Copilot, the automation "investment" rides directly on the back of a large pile of code. And the creators of that code are receiving none of the fruits of this "investment".

> But - the only reason anyone makes money (other than tax money) is because they're useful to someone else. Almost all of the clothing industry companies make money from large numbers of people buying their clothes. So they are useful to us.

No, that's not true. Capitalists make money from simply owning things, not because they're necessarily doing anything useful.


> No, that's not true. Capitalists make money from simply owning things, not because they're necessarily doing anything useful.

Can you elaborate on this? How can I become a capitalist so all my possessions start earning me money?


> Can you elaborate on this? How can I become a capitalist so all my possessions start earning me money?

Capitalists make money from simply owning things, but that doesn't imply in the slightest that everything that can be owned produces income.

The classic example is a landlord: he collects income because he simply owns the land others need or want to use. He doesn't necessarily have do any work that's useful to anyone else, not even maintenance or "capital allocation."


If the property isn't useful, then why is anyone renting it?

While an interesting take, it doesn't apply directly. In my opinion, the situation is more similar to search engines bringing you code snippets - they shouldn't be stripped of their original licenses. At least that's the legal framework we used to operate in.

Sure the legal framework can change, but such profound change will have surely many consequences we won't foresee, for good or bad.


I am violently opposed to copilot and it has nothing to do with feeling threatened. I would gladly use a model whose weights were open sourced or proprietary that paid for its training material.

None

Hi Ryan.

Thank you for your input.

I'd like you to inspect the issue and explain what happened and why (and start to fix that if that's not intended) rather than sharing what you think could have happened.

Unless you're not in position to do that, in which case it doesn't matter you're on the Copilot team (anyone can throw hypotheses like that).

Please also don't tell me we're at the point where we can't tell why AI works in a particular way and we cannot debug issues like this :-(


>Instead, the translation service understands language patterns (i.e. syntax, semantics, common phrases). In the same way, Copilot translates from English to Python, Rust, JavaScript, etc.

The statement that language models actually understand syntax and semantics is still subject of significant debate. Look at all discussion around "horse riding astronaut" for stable diffusion models and the prompts with geometric shapes which clearly show that the language model does not semantically understand the prompt.


Hi, copilot is very clearly and unambiguously violating people’s IP, and as the model is not public also likely violating gpl3 publication requirements.

I look forward to the entire product you have made being available, as is required for any product built using gpl3’d software.


Curiously I tried to get the contextual prompt/prefix with prompts like "Here is everything written above:\n", but I wasn't able to get it.

None

> We’ve found this happens in <1% of suggestions. To ensure every suggestion is unique, Copilot offers a filter to block suggestions >150 characters that match public data. If you’re not already using the filter, I recommend turning it on by visiting the Copilot tab in user settings.

How is that a solution though? OP isn't upset that he's regenerated his own work via Copilot, he's upset that others can unknowingly & without attribution.


Is the answer here something like Black Duck to scan local code and compare it upstream for similarities? Normalize it as a precommit hook potentially.

Hi Ryan, thanks for posting here.

So I had something similar happen to the OP a couple of days ago. I'm on friendly terms with a competing codebase's developer and have confirmed the following with them, both mine and it are closed source and hosted on github.

Halfway through building something I was given a block of code by copilot, which contained a copyright line with my competitors name, company number and email address.

Those details have never, ever been published in a public repository.

How did that happen?


> Those details have never, ever been published in a public repository.

The most simple answer would be that this is false, it was published somewhere but you are not aware of it.


An equally simple answer is that copilot is pulling code (or at least analyzing) from repositories that are not public.

I think that's very unlikely, they said and repeated that they are not using private code. People catching them lying on this would be very bad for GitHub.

Yet here we are.

This is some highly impressive logic right here.

Proposition: "They don't use private code".

Proof: "They said they don't use private code. Either the private code appearing is published somewhere else, or they are using private code. Lying would be bad. Therefore the code is published somewhere else, and they don't use private code".


I would say that the logic is more like:

Proposition: "They either do not use private code or they did something very very stupid."

Proof: "Not using private code is very easy (for example google does not train its models on workspace users' data, which is why they get inferior features) and they promised multiple time not to use private code so doing in would be hard to justify"


Bugs and unexpected behaviour catch us all.

I’m not saying they’re intentionally lying, but that one possible explanation is it looking through non public repositories


They would definitely notice such a bug. This would at least double or triple the amount of data they use. This is not something you can do by mistake.

Is it possible to verify with GitHub code search (cs.github.com)?

IMO that doesn’t absolve Microsoft at all. If someone uploads ripped MP3s to the internet somewhere, it doesn’t mean you could aggregate them, burn CDs and sell them.

Well, they have been published now.

If this can leak so easy, it makes me wonder how safe api keys are. They are supposed to be hidden away, we know, but so is proprietary code.


Hi Ryan. Seeing as we’re going to see more and more of this sort of complaint come up because inherently the licensing situation with open source code is overly complicated, have you considered switching to use code where the licensing is clear: Github and Microsoft’s private codebases?

> If similar code is open in your VS Code project, Copilot can draw context from those adjacent files

I'm concerned that "draw context from" is a euphemism. Does it mean it uses code that's only on your laptop to train its AI?


It's personalised to you. In general, it mimics the project's coding style. (And if the project code is terrible, good luck.)

Does Copilot keep attribution of public code if it reuses it in its suggestions?

For example, if I copy pasted code from someone in my open source project, and the copied code was subjected to required attribution will Copilot keep that attribution when it copies my code again?


> My biggest take-away: LLM maintainers (like GitHub) must transparently discuss the way models are built and implemented

The best way to be transparent about a software implementation is to open source the thing. If that's your take away, this is the only logical thing to do. Blogs posts would be appreciated but are not enough. We can only trust what you say, we cannot verify anything.


There is a very simple solution: Use Microsoft proprietary code for training the model. Keep your hands off open source code.

There you have the "most responsible way".

The GPL should be updated to prohibit code to be used for "learning" (i.e., regurgitating copyrighted fragments).


> This is a new area of development, and we’re all learning.

Being a new area of development doesn't release you from your obligation to make sure what you're doing is ethical and legal FIRST.

> I’m personally spending a lot of time chatting with developers, copyright experts, and community stakeholders to understand the most responsible way to leverage LLMs.

And yet oddly nowhere did the phrase "I reached out to OP to discuss with with them" appear anywhere in your response." Nope. Being part of GitHub's infamous Social Media "incident response" team was more important than actually figuring out what was going on.

You don't even say that you will look into the situation with OP, or speak to them.

waves to all the github employees who will be reading this comment because someone on Github's marketing team links to it


This response expresses some of the things, that are to criticize about MS' Copilot project. But also I don't like the instant attempt to subtle discredit the report by dropping something like "I don’t know how the original poster’s machine was set-up" in the first (or second, if you want to be technical) phrase.

First consider that you made a mistake yourself, _then_ ask, whether the fault could be on the other side. I really dislike this high-horse down-talking tone. Maybe it was not meant to sound like that, maybe this kind of talk has become a habit without noticing. Lets assume that, giving a benefit of a doubt.

Onto the actual matter:

> If similar code is open in your VS Code project, Copilot can draw context from those adjacent files. This can make it appear that the public model was trained on your private code, when in fact the context is drawn from local files. For example, this is how Copilot includes variable and method names relevant to your project in suggestions.

How comes, that Copilot hasn't indicated, where the code came from? How can it ever seem, like the code came from elsewhere? That is the actual question. We still need Copilot to point us to repositories or snippets on Github, when it suggests copies of code (including just renaming variables). Otherwise the human is taken out of the loop and no one is checking copyright infringements and license violations. This has been requested for a long time. Time for Copilot to actually respect rights of developers and users of software.

> It’s also possible that your code – or very similar code – appears many times over in public repositories.

So basically it propagates license violations. Great. Like I said, the human needs to be kept in the loop and Copilot needs to empower the user to check where the code came from.

> This is a new area of development, and we’re all learning.

The problem is not, that this is a new development or that we are all learning. That is fine. Sure, we all need to learn. However, when there is clearly a problem with how Copilot works, it is the responsibility of the Copilot development team to halt any further violations and first fix that problem, before letting the train roll on and violating more people's rights. The way this is being handled, by just shrugging and rolling on, maybe at some point fixing things, is simply not acceptable.


> How comes, that Copilot hasn't indicated, where the code came from?

I can't say for sure about copilot but in general you don't have that kind of information. The problem is a bit like trying to add debug symbols back to some highly optimized binary program.


> I’m personally spending a lot of time chatting with developers, copyright experts, and community stakeholders to understand the most responsible way to leverage LLMs.

This claim rings extremely hollow when your team refuses to do any of the obvious things that developers, experts and community stakeholders in this very thread (and the rest of this website) are telling you. You still haven't open-sourced Copilot. You still haven't trained it on Microsoft internal code such as Windows and Office. You still haven't made the model freely available for anyone to run locally. Until you do any of these things, you are not acting in the interest of the community and you are just exploiting people and their code for your own profit.


> This is a new area of development, and we’re all learning.

Is this the tact your organization would take if someone else’s code completion software was generating proprietary Microsoft’s proprietary code?


The topic of code generation that's near-identical or overwhelmingly-similar to existing code has come up a number of times. While the problem is obvious, it's a bit opaque and poorly communicated.

How has your team defined, specified and clearly articulated these issues with generation?

How do you test your generation to distinguish between fixing a problem vs reducing obvious true positives (i.e. unintentionally making the problem less visible without eliminating it)?

Without some communication on those fronts (which maybe I've just not seen yet), I'm not surprised that you get pushback against your product from people feel like you're taking a cover-our-ass-and-YOLO approach.


Ryan - one word you avoided using in your reply is "license". The fact that Copilot is reproducing code without attribution is not the legal show-stopper here has much as the fact Copilot is reproducing code without documenting what license it falls under.

If your hope is that saying "it came out of our ML model" somehow removes Copilot from the well-established legal framework of licensing, I think you're wrong, and you are creating a minefield that I and others choose to stay well clear of. The revenue from Copilot, and the rest of MS, can probably pay your legal bills, but certainly not mine.


Copyright Rules are not meant to favor a few.

If that's the case then it makes a great case to break the DMCA and steal all the content out there.

That sounds terrible in theory but it's one way to put some of these big money systems in check.

What about all the leaked source code in early 2022. It would make perfect sense for the hackers to remove the copyright and put it in Github for karma. That way, the IP of all now belongs to all. I'm quite sure a few won't agree with this method, but that's exactly how bad it is when removing licenses from software for code predictions.

Finally, Git was meant to allow people to host anywhere, so if the worry is about IP theft, one should stop hosting on Github and move it some place else, if they want it fully private.


My guess? This person's code has been included outside the original repo with license stripped.

That wouldn't save one from DMCA actions if it were music copyrights that they were infringing on. So it shouldn't save one for source code copyright infringement either.

Either we get to have things like Copilot, or we loosen copyright protections a great deal. Is there a third way?



The "AI" that people keep talking about is no different than any other app like MS Word, which is just a piece of software that serve corporation interests. What we are experiencing today is very simple - big players are using people's work for profit without paying one cent or getting any consent, no need to talk about "How". This is a nightmare scenario under today's social and eco system, and even worse at a production level, because in the end it will form a new industry that has nothing to do with experienced people. Take creative work for example, at the current rate most artists will completely decouple from industry in several years while giving all their works for training for free, and those who control the H/W/R&D resources will find ways to profit from model one way or another, resulting in an "AI" companies controlled "creative industry" with few artists left to direct their work. Can't even think of any other examples close to this in modern history, that a small group of people can do whatever they want under the disguise of "Exciting Technology" which in reality is just stealing an entire industry. There's very little to discuss if you ignore the reality of social systems and just focusing on technical details. We don't live in some fairy tale where you can just let computer do your work and enjoy your life.

None

Out of curiosity - has no one sued OpenAI/GitHub for this? I remember seeing threads like this since Copilot was launched. If there was enough legal pressure, I'd imagine OpenAI/Github training this using opt-in repos instead of using the model which they currently have.

I know of an case where they try to get sued by ms and get a precedent by using the same method.

This exact code can be found 1000 times on github and many of those are MIT licensed https://github.com/search?q=%22cs+*cs_transpose+%28%22&type=.... Copilot, or any other developer or person, has no way of knowing where the original implementation came from or it's original license. The cat is out of the bag, get used to it.

Or just don't use Github.

Not using Github doesn't stop others from posting your code on Github with incorrect licenses. It becomes a massive game of whack-a-repo

Or just don't give a hoot.

Takes practice, but it's a skill that can be mastered like any other.


That’s why we should simply accept that companies don’t have to publish source when they include gpl code, right?

This may be an acceptable approach if the code is not produced in a professional context and not of professional quality. One of the keystones of open source is professionals have had an ecosystem where they can deliver value to an open forum but still have at least a semblance of control how their contribution is used via various licenses they can select.

Sounds a lot like Oracle's justification for owning the Java API ( https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_.... ) in which de minimis things like variable and structure declarations were used by Oracle to justify a copyright-maximal approach that would have utterly laid waste to open source development.

The code in question is not something that anyone needs to own. Rather, it's what anyone would write, faced with the same problem. It's stupid to make humans do a robot's job in the name of preserving meaningless "IP rights".


This scenario is specific to neither github nor copilot. It will always happen for any combination of a code generating LLM trained on all publicly available code.

Correct. All of those “models” are simply violating copyright - the post alone demonstrates that the model itself contains that code, so the entire model is also covered by that license.

I would put money on it also containing gpl3 code, which I suspect means that the model itself is probably also required to be public under the terms of gpl3


That doesn't help when someone else mirrors your code to GitHub.

But that's not guaranteed to happen and it still is a step forward.

Indeed, not using GitHub is a step in the right direction.

What I am referring to is GitHub claiming that you are using their resources so they can break your license, when in fact you are not using their resources so they never made that agreement with you.


It will not be GitHub that will get sued. It'll be the developers that use the code without attribution.

The copyright infringement might not matter if code from individual developers is being used - they usually don't sue. But once this happens to say Oracle's copyrighted code... Well, that is going to be interesting.


Both can probably be sued - if you copy a copyrighted image from a website that claimed it was free, you are still violating the copyright if you use it as an example.

I think that logic only works for DeCSS.

There are dozens of companies that ship Linux and other GPL code without providing sources, get used to it!

Yes, they have a way. Even an algorithm given no access to anything but the copilot training data has a way, because it has temporal information: it says where the code appeared first! Github has the data, but doesn't give an easy way to search it, hmmm...

Although we can't rule out a common origin of shared code, including a common origin off github, we can know for sure that old code doesn't copy code from the future.

As to Microsoft and human developers having no clue about a piece of code's origin, thats especially false, since not only do we have timestamps on repositories, we can also easily verify that the code first appeared in the context of the csparse library, by Tim Davis, CS professor at Texas A&M who has worked on sparse matrix numerical methods his entire career.


Strong disagree with your conclusion.

That something is effectively public domain does not make it legal to use. This movie was in a thousand torrents, yet one gets still sued for uploading a kilobyte of it.

That it is hard or impossible to know if it is legal to use does not mean it is ok to do so. You need a source for the license that is able to compensate you for the damages you incur in case their license was invalid.

I'm not happy about either of these points, but that's how it is currently and just closing your eyes and hoping it will go away won't work.


tldr Don't use autopilot for anything work related.

Daily reminder to not use copilot with sensitive codebases. (Or at all but you might like automatic code completions so you do you)

Could there be a happy ending here, where GitHub leverage their code trawling to help @docsparse find the original pirate who stole the cstranspose IP? It would be a bold business decision by GitHub mind you. Plagiarism police is a way more negative product angle than copilot’s current pitch.

I think this will be solved like YouTube music. I as copilot user don’t mind paying 0.1cents for a good matrix transform and the code owner doesn’t mind receiving 0.1cents from thousands of users and github gets 30%

The small issue of provenance there! With music it is clearer who sung it, and nonfamous music is almost worthless financially anyway.

That's not compatible with the licenses used by software projects, especially GPL.

GPL doesn't prevent the copyright holder from also distributing the software under another license if they so choose.

Of course, that would have to be a decision by the copyright holders, which is possibly difficult to provide on GitHub given how easy it makes it to upload other people's work, and you would need the agreement of every contributor unless you do copyright assignment.

And you wouldn't be able to license a project using this service (i.e. with Copilot-generated code) under the GPL. But it's already unclear if you can legally do that today.


Snarky details ;)

I have used Co-pilot when it was in beta and I use Tabnine at this very moment. From my experience Githubs Co-pilot genereates a lot bigger chunks of code and it have always felt like they're just copied from somewhere.

I would never pay for Co-pilot since its Microsoft who owns Github and now my fears about the product seems to have been proven.


People are really missing the forest for the trees here, what I’m getting from all the whining here is that I’ve decided code licensing is stupid and shouldn’t be a thing.

Just a heads-up that the person who writes this is Tim Davis[0], author of the legendary CHOLMOD solver[1], which hundreds of thousands of people use daily when they solve sparse symmetric linear systems in common numerical environments.

Even if CHOLMOD is easily the best sparse symmetric solver, it is notoriously not used by scipy.linalg.solve, though, because numpy/scipy developers are anti-copyleft fundamentalists and have chosen not to use this excellent code for merely ideological reasons... but this will not last: thanks to the copilot "filtering" described here, we can now recover a version of CHOLMOD unencumbered by the license that the author originaly distributed it under! O brave new world, that has such people in it!

[0] https://people.engr.tamu.edu/davis/welcome.html

[1] https://github.com/DrTimothyAldenDavis


In case anyone interprets this literally: if copilot regurgitates literal code it was trained on that doesn't actually give you an unencumbered version.

So how long till new software licenses that prohibit any use of code for model training purposes? I'd be willing to bet there's a significant group of people that won't be happy either its literal or not, the fact that it was used in the training might be enough.

The claim that most people training models make is that what they are doing is sufficiently transformative that it counts as fair use, and doesn't require a license. That means putting something in a software license that prohibits model training wouldn't do anything.

In this case, what the model is doing is clearly (to me as an non-lawyer) not transformative enough to count as fair use, but it's possible that the co-pilot folks will be able to fix this kind of thing with better output filtering.


Prohibiting training would not affect the produced source, but it would make the training itself illegal.

Probably not? Models are trained on restrictively licensed things all the time, such as images that are still in copyright. This is generally believed to be fair use, though I think this has not been tested in court?

We joked about when this day would come… I hope we (tech community) figure it out because Copilot really is a game-changer.

Yeah, I think I’m going to take my personal repos off of GitHub and probably recommend the same thing at our company. Seems to be a liability that our proprietary code could get included in these models and leaked (even if accidentally). I feel similar to this on how I do AI art, which is that I am being taken advantage of by some larger entity so they can make money, even if they’ve been explicitly told it’s illegal.

On the argument of the machine is just leaning like any other human, that is absolute nonsense. Makes me feel ashamed to work in software, the way people can take advantage of other peoples hard work to make a buck with no even request or slightest bit of remorse.


Gitea is very nice: https://gitea.io/en-us/

So before Copilot I could go on GitHub, search for the code, copy and paste it and ignore the license (I know that's bad). No one had a problem with GitHub facilitating this.

But Copilot now does the search/copy/paste for me and suddenly uproar? I'm not sure I follow, the capability/possibility hasn't changed, just the convenience has.


Hey, including copyrighted code, especially from other repositories without a correct license, is an honest mistake. You should be able to file a DMCA request with a list of all the repos containing your code, and then they should just retrain copilot with those repos excluded. Clearly that's what is needed for them to stop distributing your code. /s

Sarcasm aside I think there are several possible legal viewpoints here (IANAL):

1. copilot is distributing copies of code and it's not a safe harbor: Microsoft is directly liable to copyright infringement by copilot producing code without appropriate license/attribution.

2. copilot is distributing copies of code and it's a safe harbor: Microsoft is not directly liable, but it should comply with DMCA requests. Practically that would mean retraining with mentioned code snippets/repositories excluded in a timely manner, otherwise I don't see a way how they could disentangle the affected IP from the already trained models.

3. copilot produces novel work by itself not subject to copyright of the training data: I think this is really a stretch. IANAL, but I think producing novel creative work is a right exclusive to living human beings, so machines can't produce them almost by definition. (There is the monkey selfie copyright case, but at least the "living" there was ticked off).

4. the user of copilot is producing novel work by prompting copilot: it's like triggering a camera. The copyright of the resulting picture is fully owned by the operator, even though much of the heavy lifting is done by the camera itself. Even then, this very much depends on the subject.

IMO option 3 doesn't have a legal standing. Microsoft and users of copilot would very much like if it was option 4 that applied always, but this particular case clearly falls under option 1 or option 2, in which case Microsoft should hold some legal liability, even if they can't always track the correct license ahead of time.


I don't think it's fair to say this it emits the same code. The code on the right is definitely implementing the same algorithm and is generally similar to the code on the left, but it's not identical. IANAL, but I think copyright wouldn't apply in this case.

Imagine a person who would want to implement the same function in their project. They could look at the open source implementation to learn how the algorithm is supposed to work, and would write their implementation. They could end up with the implementation on the right.


Github Copilot is less Jeff Skiles and more Andy Warhol.

Interesting, but probably not what you think. Also stop copyrighting code.

We're probably going to start seeing corporations put sentinel lines in their code, search for it, and C&D anywhere the sentinel lines turn up.

I just have to say, it's quite ironic now that it's happening to software code, people here are understanding and trying to earnestly find solutions vs. saying things like "guys like him are standing in the way of progress" and "someone is worried about losing their jobs!" and "horse and buggies are striking back" and the like.

I know cognitive biases are strong, but amongst a community that is at least reputed to be somewhat intellectual, the lack of similar sympathy for artists who say their work is being stolen is a bit too much of an irony here to ignore.


The weird thing about CoPilot to me is that all the examples seem like bad examples: I don’t want my team writing new functions to count the bits in an int; I want them to call the function to count the bits. The real potential of CoPilot would be if GitHub uses it to design libraries for various languages that do the things people keep wanting done. Imagine if instead of autocompleting the body of `std::size_t count_bits(int)`, it would suggest “Try `#include <GitHubLib/BitTwiddling.h>` and call `GitHubLib::count_bits`.”

that's a great feature suggestion, it can be a lot of work to discover and vet libraries to use

Here is one repository on github with the code:

The repo: https://github.com/Shreeyak/cleargrasp

https://github.com/Shreeyak/cleargrasp/blob/master/api/depth...

It looks like the license of the repo is Apache 2.0


Now if someone would write an app that checks if your licensed code can be found with Copilot, we would go full circle. It could actually be useful to see if your code is used without proper licensing?

I have now moved all my public github projects onto sourcehut, and replaced them on github with public notices explaining why (i.e. this whole copilot saga).

Yes I know that presumably that doesn't make them safer from a copilot visibility point of view ... but I had to do something.


Legal | privacy