Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Because... it's private code. Can the company be 100% certain there are no passwords, DB keys, other company secrets in it? Can they be certain there's no personal employee data? Internal product names? A hundred other similar concerns with proprietary IP? Regardless of how the LLM transforms it the individual bits of data are still there.

On the other hand if the repo is already public on Github then exposing it via an LLM is not introducing any new security risk.



sort by: page size:

I've seen passwords and other sensitive config data committed to public repos in GitHub. I wouldn't even be slightly surprised by people keeping sensitive IP or trade secrets there, on the assumption that if it's a private repo, it must be safe.

Why would they? It's a public repository, nothing confidential or private

see, the whole repo is accessible to the members of the team that are allowed to see the secret - basically the two folks that have root on the machine anyways. There's very limited use in encrypting the repo. There are no SSL keys or any secrets that would require tight security. It's basically our newrelic and some other api keys for reporting services. Even if that repo would be breached you could only start sending fake data to those services.

I'm more concerned about someone hacking the machine than someone hacking github to access the repo and retrieve the newrelic key from there.


I think that's very unlikely, they said and repeated that they are not using private code. People catching them lying on this would be very bad for GitHub.

I imagine the risk here is more about embarrassment than a business threat. Github endpoints are still authenticated, and their business is still in providing a service. Propping up, understanding, and maintaining the leaked codebase to use "for free" probably will never pay dividends. Another company referencing this code base when designing their own software will just expose them to needless risk in the future.

I believe this isn't an existential threat to the company by any means.


If the URL to a private repo is not secure, then a whole bunch of people (including GitHub) have a big problem, and exposing jobmachine is the least of their worries ;-)

Agreed; GitHub documentation refers to repo “visibility,” not “security,” and that is an intentional distinction.

When we signed on with GH as a paying customer over a decade ago, they were quite clear that private repos should not be considered secure storage for secrets. It’s not encrypted at rest, and GitHub staff have access to it. It takes only a few clicks to go from private to public.


I think there are still two concerns here even if everything is intended to eventually become open source.

One, this doesn't address exactly how the repo was compromised. Likely one of the many folks with access had their credentials compromised, but until we know, there may be risk to other projects. Two, as the article mentions, it may not have had all passwords or API keys scrubbed.


I don't understand.

If you're suggesting that IP/code sitting on GitHub servers is somehow at risk of being divulged then I will disagree.


This is a vulnerability where an attacker would be able to add his SSH key to a private repository and pull proprietary source code he was not authorized to see. This is why we pay for Github, not use a free account. We don't want people to be able to walk off with the intellectual property of our company.

That being said, we don't have the resources to deploy a more secure alternative without hamstringing our development capabilities (e.g. no internet connectivity).


> So it shouldn't matter.

It definitely does matter that any secrets generated were already public:

* Emitting secrets from private repos would be a huge confidentiality issue (though really you shouldn't commit code secrets to git at all), as it'd be taking something that's private + exploitable and making it public

* Emitting secrets that are already public doesn't cause the confidentiality issue. Once a secret is out, it's out, and should be changed immediately. By the time it's in Copilot's training set, it'll have already been on search engines/archive sites/black-hat forums/etc.

Tangentially, GitHub do also do some scanning to alert of accidentally committed secrets in repos: https://docs.github.com/en/code-security/secret-scanning/abo...

> 2. Already public unintentionally.

Right, but therefore already compromised and no longer confidential. Copilot isn't leaking any secrets, someone else did by making them public.

> AFAIK, any code that I produce is automatically copyrighted to me. This means if I write something in public and not provide a license, IT IS LEGALLY under the copyright protection provided to me by my country. At least that is the case in US and India which are home to a huge portion of OSS.

Essentially correct, to my understanding. If you're making it public, you'll generally also give some hosting/publishing/distribution rights to the services involved - as specified by their T&C.

> Reproducing it and remixing my work would be illegal

The US has the concept of fair use which provides exceptions for "transformative” purposes. For example: copying and downscaling your image to use as a thumbnail, caching the webpage your work is on, or creating a parody of your work.

Consider Google Books for example, where Google scanned millions of copyrighted books and made them searchable (showing snippets). This was ruled fair use due to being transformative.

Question would be whether code generated by Copilot that falls under this. Ultimately it's up to the courts to decide, but I'd lean in favor of "yes".

> The PII and secrets is enough to find out the license of a repo which would make it easier to prove whether they violated it or not. Don't tell me all OSS that copilot has trained on is only public domain stuff. Even ISC license needs attribution.

Fair use is about unlicensed usage, so if it's fair use then it doesn't need to abide by the terms of the licenses. Even if it's ruled not to be fair use, I think they could still train it on GitHub-hosted code due to the mentioned rights you give them by agreeing to GitHub's T&C.

> How does that change anything I mentioned? Very curious.

Changes your claim of impossibility, so now it's just about whether there's a violation.


I would have expected to see at least some of their source code on Github (https://github.com/Crypho), especially the client-side where the private key and passphrase are handled (as described at https://www.crypho.com/security.html). Without being able to inspect the source, why should anyone trust this company more than, say, Google or Facebook?

It's not about security- the kernel team has already explained the code itself is not vulnerable. Everything is signed by Linus, and hacking github does not a signed release make.

Loading it from HSM to memory/keychain is probably fine too. It's certainly odd it found its way to a repo and makes you wonder how that could have even possibly happened. And what that indicates about their security practices in general.

Github is host to a large percent of US tech IP. Pretty concerning if you extrapolate.


Plus I'm not sure many GitHub users will appreciate the mild security leak.

Managed git services suck at providing security that scales beyond a few devs. Most orgs that use GitHub are exposed to the risk of having their source code leaked by current or past employees.

I'm hoping this leak will have serious financial consequences and bring awareness to that.


Yeah I'm curious about this myself. Doesn't this create a pretty serious security situation? GitHub isn't a small startup anymore, I would assume they'd have some pretty serious security precautions in place.

The flaw is in having this "private" data in a public repo to begin with. If your data are private, don't put your project on github.

That's not really surprising when they are importing to a new repo and they aren't sure there isnt anything confidential (or bad optics) in previous commits or message
next

Legal | privacy