I didn't find reading largely correct but still often wrong code is a good experience for me, or it adds up any efficiency.
It does do a very good job in intelligently synthesize boilerplate for you, but be Copilot or this AlphaCode, they still don't understand the coding fundamentals, in the sense causatively, what would one instruction impact the space of states.
Still, those are exciting technology, but again, there is a big if whether such machine learning model would happen at all.
I don't think it's possible for copilot to improve on this problem. It doesn't actually understand the code, it's just statistical models all the way down. There's no way for copilot to judge how good code is, only how frequently it's seen similar code. And frequency is not the same thing as quality.
This is not such a big problem in reality because the output of Copilot can be filtered to exclude snippets too similar to the training data, or any corpus of code you want to avoid. It's much easier to guarantee clean code than train the model in the first place.
I might have missed it today in your articles or comments here--it's been a hectic day--but has there been some study of just how different code would be given that the students are using the same text from questions? Is there randomization intrinsic to Copilot, or is it just because minor variations in textual input causes code to be so different?
My wife taught CS, she did catch cheaters pre-Copilot, and my first thought it that she probably would enter test questions and print out a reference sheet for Copilot generated results.
The thing is, the training data is "everything on GitHub". That contains a quite large amount of student assignments that are poorly and incompletely done.
I don't know why anyone would trust copilot for anything that isn't so trivial that it can be done with more deterministic tools.
I haven't used copilot but your experience sounds exactly like what I would expect. Since AI is based on prediction, it makes sense that broader predictions would be less accurate. I think stringing together output from a lot of smaller predictions would yield better results. Which, at the end of the day, means that a human + AI will always be more productive than AI on its own. At least for the foreseeable future.
I find his comment on Copilot to be enlightening. He says it’s not always right, but he still finds it useful. Kinda like that old saying in stats “all models are wrong, but some models are useful”. Language models are really helpful when you have enough experience to be able to judge when it fails, I do feel sorry for novices who don’t have enough experience yet to do that.
Well I'd hope that is what is going on in Copilot. It definitely does seem to be trained on my code to some extent, but it doesn't have anything I'd call a semantic understanding of it.
reply