While I appreciate the earnest defense of FOSS and the, in all fairness, totally warranted suspicion of Microsoft, given its history, I found the attitude of this article to be very sour and, actually, in bad faith. Let me address the three questions which they posed to MS:
> 1. What case law, if any, did you rely on in Microsoft & GitHub's public claim, stated by GitHub's (then) CEO, that: “(1) training ML systems on public data is fair use, (2) the output belongs to the operator, just like with a compiler”? In the interest of transparency and respect to the FOSS community, please also provide the community with your full legal analysis on why you believe that these statements are true.
I mean, I'd be floored if any corporate lawyer let anyone at [large company] answer this kind of question outside of an actual lawsuit. They are essentially asking the opposing team's lawyers to do all this work for them, for free. This is followed by an "obvious[ly]" correct (I'm being ironic) interpretation of the refusal to answer: that MS is wrong but just won't admit it. But go back and re-read the question. The question was architected to produce this impression if it wasn't answered. That's a sign of a bad faith question, rather than a question with intent to learn the answer.
> 2. If it is, as you claim, permissible to train the model (and allow users to generate code based on that model) on any code whatsoever and not be bound by any licensing terms, why did you choose to only train Copilot's model on FOSS? For example, why are your Microsoft Windows and Office codebases not in your training set?
Other commenters have discussed this one already. There is a perfectly reasonable and legitimate explanation here: The do not want to do anything that remotely risks exposing trade secrets, and that's a separate concern from potentially accidentally violating a license. Suppose the model was trained on all these public repos + MS's private repos. Someone else can come along and train their own model on the public code; now they have two code generators whose outputs can be compared to reveal secret information about MS's training set. This time, the article guesses well at the answer: MS cares more about itself than others. Sure. Why would it be expected not to?
> 3. Can you provide a list of licenses, including names of copyright holders and/or names of Git repositories, that were in the training set used for Copilot? If not, why are you withholding this information from the community?
I think this question is bad faith too. It starts by asking "can you". Then, if the answer is "no, we can't", reinterprets the answer as "no, we won't" ("withholding" is an intentional act). It is disingenuous to imply that someone who cannot do something is, therefore, intentionally refusing to do so. In the analysis of the lack of response, the article (finally) admits that it is speculating wildly, backpaddles on the implied claim that MS is refusing to provide this information, and instead takes a different approach: MS scientists can't answer because they are not good scientists. But wait, here's the kicker:
> ... so they don't actually know the answer to whose copyrights they infringed and when and how.
Busted! The authors have essentially demonstrated the question is in bad faith by suggesting that the answer to the question, "Whose data did you use?", is the same as the answer to the question, "Whose copyright did you violate?", which is a logical connection made possible only by the underlying presupposition that MS is totally incorrect in its assertion about fair use in question 1. The framing of all these questions suggests to me that the authors were already firmly convinced of their guesses as to the answers/non-answers _at the time of posing the questions_.
If they actually waited for a whole year expecting a response, that's on them. I'm with MS on the decision not to engage here, even if I share all these qualms about Copilot.
1: What other kind of faith in Microsoft would be even remotely warranted?
2) "perfectly reasonable and legitimate explanation ... risks exposing trade secrets ... a separate concern from potentially accidentally violating a license"
2 a: Sure, they may be separate concerns, but Microsoft is acting -- and you, by arguing for them, at least im- but AFAICS explicitly endorsing their viewpoint -- as if their interest obviously overrides everyone else's. Why should it? They're the ones who want to do this, so why shouldn't their code be the one exposed to any risk? If you want to test if some newfangled house-building material is really as fire resistant as its manufacturers claim, you can set fire to your own house, not your neighbour's. Also, there's only one Microsoft whose interests would be put at risk if they use their own code; but they chose to expose how many others?
2 b: For someone complaining about "bad faith" on the part of others, "potentially accidentally" is some mighty fine weasel wording. What's "accidental" about intentionally building a product and intentionally training it on a bunch of code written by others? (They didn't just randomly press some keys and say "Oops, let's see if it gets trained on our code now, or everybody else's", did they?)
3) "starts by asking 'can you". Then, if the answer is 'no, we can't', reinterprets the answer as 'no, we won't' ('withholding' is an intentional act"
3 a: The word"can" has several valid usages in English. If I say "Can you pass me the salt, please?" and you don't, then you are (assuming you have no severe physical handicap that's stopping you) intentionally withholding the salt from me.
3 b: Even if Microsoft is actually unable to provide the asked for data, the question arises: How come they are? They've built this product. Not building that traceability into it was their choice. Why did they choose not to?
If anyone is "busted" here, it seems to me that's you: Busted as a Microsoft shill.
> 1. What case law, if any, did you rely on in Microsoft & GitHub's public claim, stated by GitHub's (then) CEO, that: “(1) training ML systems on public data is fair use, (2) the output belongs to the operator, just like with a compiler”? In the interest of transparency and respect to the FOSS community, please also provide the community with your full legal analysis on why you believe that these statements are true.
I mean, I'd be floored if any corporate lawyer let anyone at [large company] answer this kind of question outside of an actual lawsuit. They are essentially asking the opposing team's lawyers to do all this work for them, for free. This is followed by an "obvious[ly]" correct (I'm being ironic) interpretation of the refusal to answer: that MS is wrong but just won't admit it. But go back and re-read the question. The question was architected to produce this impression if it wasn't answered. That's a sign of a bad faith question, rather than a question with intent to learn the answer.
> 2. If it is, as you claim, permissible to train the model (and allow users to generate code based on that model) on any code whatsoever and not be bound by any licensing terms, why did you choose to only train Copilot's model on FOSS? For example, why are your Microsoft Windows and Office codebases not in your training set?
Other commenters have discussed this one already. There is a perfectly reasonable and legitimate explanation here: The do not want to do anything that remotely risks exposing trade secrets, and that's a separate concern from potentially accidentally violating a license. Suppose the model was trained on all these public repos + MS's private repos. Someone else can come along and train their own model on the public code; now they have two code generators whose outputs can be compared to reveal secret information about MS's training set. This time, the article guesses well at the answer: MS cares more about itself than others. Sure. Why would it be expected not to?
> 3. Can you provide a list of licenses, including names of copyright holders and/or names of Git repositories, that were in the training set used for Copilot? If not, why are you withholding this information from the community?
I think this question is bad faith too. It starts by asking "can you". Then, if the answer is "no, we can't", reinterprets the answer as "no, we won't" ("withholding" is an intentional act). It is disingenuous to imply that someone who cannot do something is, therefore, intentionally refusing to do so. In the analysis of the lack of response, the article (finally) admits that it is speculating wildly, backpaddles on the implied claim that MS is refusing to provide this information, and instead takes a different approach: MS scientists can't answer because they are not good scientists. But wait, here's the kicker:
> ... so they don't actually know the answer to whose copyrights they infringed and when and how.
Busted! The authors have essentially demonstrated the question is in bad faith by suggesting that the answer to the question, "Whose data did you use?", is the same as the answer to the question, "Whose copyright did you violate?", which is a logical connection made possible only by the underlying presupposition that MS is totally incorrect in its assertion about fair use in question 1. The framing of all these questions suggests to me that the authors were already firmly convinced of their guesses as to the answers/non-answers _at the time of posing the questions_.
If they actually waited for a whole year expecting a response, that's on them. I'm with MS on the decision not to engage here, even if I share all these qualms about Copilot.
reply