If you don't opt out by Apr 24 GitHub will train on your private repos

by vmg12 316 comments 745 points
Read article View on HN

316 comments

[−] martinwoodward 49d ago
No we won’t. Details here https://github.blog/news-insights/company-news/updates-to-gi...

For users of Free, Pro and Pro+ Copilot, if you don’t opt out then we will start collecting usage data of Copilot for use in model training.

If you are a subscriber for Business or Pro we do not train on usage.

The blog post covers more details but we do not train on private repo data at rest, just interaction data with Copilot. If you don’t use Copilot this will not affect you. However you can still opt out now if you wish and that preference will be retained if you decide to start using Copilot in the future.

Hope that helps.

[−] qaadika 49d ago

>

https://github.blog/news-insights/company-news/updates-to-gi...

> Should you decide to participate in this program, the interaction data we may collect and leverage includes:

> - Outputs accepted or modified by you

> - Inputs sent to GitHub Copilot, including code snippets shown to the model

> - Code context surrounding your cursor position

> - Comments and documentation you write

> - File names, repository structure, and navigation patterns

> - Interactions with Copilot features (chat, inline suggestions, etc.)

> - Your feedback on suggestions (thumbs up/down ratings)

"should you decide to participate.."??? You didn't ask if I wanted to participate. You asked if I didn't.

I didn't get to decide to participate. I had to decide not to. You made me do work to prevent my privacy from being violated.

[−] jffry 49d ago
It's unnecessarily splitting hairs.

> interaction data—specifically inputs, outputs, code snippets, and associated context [...] will be used to train and improve our AI models

So using Copilot in a private repo, where lots of that repo will be used as context for Copilot, means GitHub will be using your private repo as training data when they were not before.

[−] munk-a 49d ago
The initial title and your reply are both too broad to be fully accurate. By April 24th Github will train on private repos (assuming a flag isn't set) but this change is limited to just non-Business/Pro users. So a number of private repos will be effected but it won't automatically affect all private repos (so my panic check on our corporate account wasn't necessary yet).

I am not certain if you're a spokesperson for github - but it's good to be careful in your language. Instead of "No we won't" a lead like "That isn't entirely accurate" would be more suitable. In the end both the original post title and your reply have ended up being misleading.

[−] andoando 49d ago
Thats still pretty bad. Its no longer private if all your code goes through LLM training set and is resurfable to everyone publicly.

Why would I ever use copilot on any code Id want to be kept private? Labling it a private repo and having a tiny clause in the TOS saying we can take your code and show it to everybody is just an upright lie

[−] layer8 49d ago
In the EU, opt-out is not a legally valid way to obtain the necessary consent. How do you plan to handle this?
[−] otterley 49d ago
Hey Martin, can you please work with Product to significantly clarify what is meant by the following language in the settings? Because right now it's nearly impossible for a layperson (or even an average programmer) to understand what this means:

""" Allow GitHub to use my data for AI model training

Allow GitHub to collect and use my Inputs, Outputs, and associated context to train and improve AI models. Read more in the Privacy Statement. """

If the reality is less scary than how it sounds, then the wording needs to be less scary-sounding. It may be that GitHub isn't training models on private repos, but the language certainly suggests that it is. The feedback we're seeing in this post is proof enough of that.

Finally, I read the Privacy Statement, and it's unclear what the applicable language is. "Inputs," "Outputs," and "Associated Context" are terms of art that have no matching definitions in the Statement. (The terms "Outputs" and "Associated Context" don't even appear in the Statement at all. Not even "train.") As an attorney I find this completely baffling.

[−] saghm 49d ago
Yes, you will. This is what the setting says on my account when I clicked the link:

> model training

> Allow GitHub to collect and use my Inputs, Outputs, and associated context to train and improve AI models. Read more in the Privacy Statement

Are you seriously trying to claim that the code isn't input, output, or associated context of Copilot operating on a private repo? What term do you think better applies to the code that's being read as input, used as context, and potentially produced as output?

[−] wewtyflakes 49d ago
If Copilot later adds a feature like "Scan your repo for vulnerabilities using Copilot ", then that would both fit your criteria, and the baiting outrage of the original poster, in one swoop! Of course, Microsoft would _never_ do that, right?
[−] edelbitter 49d ago

> If you don’t use Copilot this will not affect you.

How does this work for a private repository with access granted to additional contributors? Which setting is consulted then?

[−] daveguy 49d ago
Nice try. If you're training on "inputs" to Copilot then you are training on the private repos.

This suspect denial is why I will get my clients moved off of github.

[−] grepfru_it 49d ago
Back in my day someone would post a HN article to the internal slack in order to sway conversation in their favor. Glad to see its still happening! :D
[−] BoredPositron 49d ago
Yes you do? If a user uses any form of copilot in one of his repos except ofc enterprise, says so right in the blog post. These aktshually corporate technicality defense posts aren’t helping, they just end up making you personally look a bit fishy.
[−] ChrisArchitect 49d ago
[−] kepano 49d ago
I've been saying this since 2023

> If your data is stored in a database that a company can freely read and access (i.e. not end-to-end encrypted), the company will eventually update their ToS so they can use your data for AI training — the incentives are too strong to resist

https://news.ycombinator.com/item?id=37124188

[−] landl0rd 49d ago
This headline is false; it will not go take your private repos and dump them into a training dataset. Rather, GitHub will train on your copilot interactions with your private repos. If you do not use copilot, this makes no difference to you, though you should probably still turn it off.
[−] uberman 49d ago
If even one person in a repo does not disable this will copilot have full access to the repo? How can I determine if other members of my team have turned this off or not?
[−] munk-a 49d ago
The only setting I'm seeing is on a per-user basis. Does anyone know how to blanket disable training on an organizational basis?

Is there any information about how much information from an organization managed repo may be trained on if an individual user has this flag enabled? Will one leaky account cause all of our source code to be considered fair game?

[−] hedayet 49d ago
To Github's credit, they have been showing a banner consistently. To my discredit - I never bothered to read that banner until I saw this HN headline
[−] parsimo2010 49d ago
Jokes on them, my private repos are total dog dookie. If nobody but me can see the code then I don't have to worry about style, structure, comments, or any other best practices.

You don't want an LLM trained on my private repos. Trust me.

[−] SunshineTheCat 49d ago
RIP all the people who have been paying Github for years and never happen to see the notice.
[−] w10-1 49d ago
https://github.com/settings/copilot/features

The feature to opt out is at the bottom under privacy: "Allow GitHub to use my data for AI model training"

TIL: you cannot opt out of a copilot-pro subscription. How is it a subscription if I can't cancel?

(Honestly, who has time to evade all these traps? Or to migrate 150+ repo's on 6+ machines...)

[−] mxtbccagmailcom 49d ago
Time to put adversarial code into GitHub to pollute the training set?
[−] kristianp 49d ago
What's a good alternative for free private repos?
[−] sedatk 49d ago
I have an individual GitHub Copilot Pro subscription and also am a member of an Enterprise account that has one of its GitHub Copilot Business seats assigned to me. The opt-out setting doesn't appear on my individual profile anymore. However, I want to be able to use individual GitHub Copilot subscription for my individual work, and it seems like I can't do it anymore as Enterprise has taken over all my preferences. What a mess.
[−] maplethorpe 49d ago
What's the best way to poison my repos to sabotage LLM training? Asking for a friend.
[−] prmoustache 49d ago
While I understand the network effect of github for public project, I don't really understand why one would want to use it for private repos.

There are tons of git providers including free ones that include full gitlab/gitea/forgejo to get similar features to github and there is nothing more easy to self host or host on a vps with near zero maintenance.

[−] _pdp_ 49d ago
Rather than defending this absurd decision, GitHub could instantly win back trust by admitting they f*** up and reversing it entirely.

If they want to incentivise people to contribute their sources and copilot sessions, they could easily make it opt-in on a per-repository basis and provide some incentive, like an increased token quota.

This is not hard.

[−] jmward01 49d ago
They just lost my repos. I can not believe they snuck this in. My level of anger right now is far higher that I ever wanted to feel. I went to API access for anthropic, paying more in the process, to avoid them training on my code. And GH just -adds- this, without telling me? Without a prompt. They are dead to me.
[−] GMoromisato 49d ago
I'm sure this is just me, but I don't mind if AI trains on my public or private repos. I suspect my imagination is just not good enough to come up with downsides.

So far it's been a benefit because coding agents seems to understand my code and can follow my style.

I don't store client data (much less credentials) in my repos (public or private) so I'm not worried about data leaks. And I don't expect any of my clients to decide to replace me and vibe code their way to a solution.

I do worry (slightly) about large company competitors using AI to lower their prices and compete with me, but that's going to happen regardless of whether anyone trains on my code. And my own increases in efficiency due to AI have made up for that.

[−] jacamera 49d ago
Lots of hair splitting in the comments. The service is so unreliable at this point that I don’t trust them to not train on private repos even accidentally. You’re one vibe-coded PR away from having all your data scooped up regardless of any policy or intention.
[−] AndrewKemendo 49d ago
I started self hosting my own git on a digital ocean droplet with Gitea (1). It’s been unbelievably fantastic and trivially easy to manage experience and I can make them public and invite contrib ans do integrations … I see zero downsides

I see no reason to ever go back to holding my code elsewhere.

Don’t forget git is fairly new

When I first started doing production code it was pre-github so we used some other kind of repo management system

This is a perfect example of where the they’re starting to cannibalize their base and now we have the ability to get away from them entirely.

(1) https://about.gitea.com/

[−] bonestamp2 49d ago
Thanks for the heads up, I assumed they had already done this with my data.
[−] yonatan8070 49d ago
How do I opt out of this for my own private repos? I don't see anything related to this as I've got a ton of settings for Copilot itself (I have access to Copilot through my work org)
[−] endofreach 49d ago
How did people forget that github was purchased by that one company?
[−] livinglist 49d ago
Thanks for posting this, I was never made aware of this by GitHub..
[−] maxloh 49d ago
Context: https://github.com/orgs/community/discussions/188488

TLDR: As long as you aren't using Copilot, your code should be safe (according to GitHub).

  What data are you collecting?

  When an individual user has this setting enabled, the interaction data we may collect includes:

  - Outputs accepted or modified by the user
  - Inputs sent to GitHub Copilot, including code snippets shown to the model
  - Code context surrounding the user’s cursor position
  - Comment and documentation that the user wrote
  - File names, repository structure, and navigation patterns
  - Interactions with Copilot features including Chat and inline suggestions
[−] bsza 49d ago
I've been encrypting my private git repos for a while because I had suspected they were going to do something like this.

https://github.com/flolu/git-gcrypt

It's very easy to set up and integrates nicely into git. Obviously only works if you don't need Actions or anything that requires Github to know what's in your repo (duh).

[−] Esophagus4 49d ago
There’s a lot of furor in this thread, but people felt the same way when Google Street View came out. Eventually they worked through most of the thorny bits and people use Street View now.

I suspect MSFT is in a similar spot. If they don’t train on more data, they’ll be left behind by Anthropic/OAI. If they do, they’ll annoy a few diehards for a while, they’ll work through the kinks, then everyone will get used to it.

[−] rrgok 49d ago
I'm gonna put a license fee on all my repos. 10% of revenue if my private repos have been used for AI training. 5% on all my other repos.
[−] sethops1 49d ago
When Louis Rossmann started describing tech leadership as having a "rapist mentality" I brushed him off as being sensationalist. But actions like this make me think more and more he's right. The product managers pushing for changes like this are despicable scum.
[−] mrled 49d ago
I'm curious about specific consequences of this. I tend to think the importance of code secrecy has always been exaggerated (there are specific exceptions like hedge fund strategies and malware), even more so now in this post-Claude world. Does anyone have specific things they're trying to avoid by opting out of this?
[−] JonChesterfield 49d ago
Don't give your code to Microsoft if you don't want them to have your code.

This setting will make no difference to whether your code is fed into their training set. "Oops we accidentally ignored the private flag years ago and didn't realise, we are very sorry, we were trying to not do that".

[−] torben-friis 49d ago
How's the codeberg experience nowadays? I think it's finally time to switch for me.
[−] bolangi 49d ago
Hah, github can have my crap code. Anyone trained on it will be in for a world of hurt :-)
[−] tartoran 49d ago
If you opt out Github will probably still train on your private repo. Just migrate.
[−] shevy-java 49d ago
Microslop tries to make money off of our data on github. Not a big surprise though.
[−] Sohcahtoa82 49d ago
I wonder how effective it would be to sabotage the training by publishing deliberately bad code. A FizzBuzz with O(n^2) complexity. A function named "quicksort" that actually implements bogosort. A "filter_xss" function that's a no-op or just does something else entirely.

The possibilities are endless. I thought of this after remembering seeing a post a couple months ago about how it doesn't take a significant amount of bad data to poison an LLM's training.

[−] rakel_rakel 49d ago
I'm looking forward to the class action lawsuit, even if only to establish a precedent!

I don't have much hope, but I wish that ignoring software licensing and attribution at scale becomes harder than it currently seems.