This is actually really similar to something I've been wanting to build for a long time. In my case I've thought it would be useful to have a way to calculate the likelihood for a given change to break things based on the history of breaking changes in the same file or area of the file. Basically a riskiness score for each change. The risk score could be associated with each PR and would provide a signal for reviewers about which code should get a bit of extra attention as well as highlighting the risky changes when they are being deployed.
The tricky part would be tracking the same part of the code as it moves up and down because of insertions/deletions above it which would cause problems for a naive algorithm based on line numbers.
Just doing it at the file level, like this does, might be good enough to be useful though.
If I was doing an attack like this, I'd wait for an opportunity to slip it into a huge, mundane change.
For example, changing the repository from Mercurial to Git. Splitting or combining two repositories. Moving lots of files between directories. Running an autoformatter over the entire codebase. Something like that.
Every code review tool I've seen there's /some/ way you can get it to highlight thousands of trivial changes. And I don't know about you, but I can't promise I'd spot 1 evil change among 1000 trivial ones.
Absolutely. Codemonkey.ai analyzes every source code commit and correlates each with an issue tracking system to understand when defects are introduced, features added, etc.
We build a predictive model using the prior commit history as a training set to predict the risk of new commits going forward. This model leverages over 50+ metrics that we derive from each commit.
With this model we can do the following:
* Correlate code commits/changes to features, stories, and work items to understand the quality and risk impact that those things have on the codebase
* Utilize clustering to provide insights around developer productivity and quality impact to the codebase
* Utilize clustering to help with sizing of future features based on historical actuals
These are just a few of the areas where we are using ML within Codemonkey.ai today. In the future our goal is to tie additional information from production environments into this model as well.
Yep, it wouldn't be too surprising if it does already exist. It'd also potentially function something like a more advanced diff tool that can provide time-saving efforts to those who use it.
If mature and reliable enough it could actually be used to predict which changes are likely to cause future problems, and then the code reviewer(s) would have to decide how they'd like to act based on that information.
This is nonsense. The number of lines of change of code has absolutely nothing to do with the amount of risk.
A single line change can easily break an entire system. Applying this size fallacy, however is just as dangerous. Certain classes of changes should be grouped together so they can be reviewed with the context required.
I've had the unfortunate task of working in environments where I've been forced to artificially break up my PR's to suit some arbitrary rule, no doubt because an article like this got read at some point.
What ends up happening is instead of having everything related to a specific changeset or feature in a single PR/commit, you now have it strewn across multiple PR's/commits, non-nonsensically, and with artificial borders most of the time. This makes review more difficult and creates additional work to support the partial implementation working at each slice.
Process should not replace thinking. If a feature warrants it, who cares if the PR is 2,000 lines or 12. Not every task is small and not every task is large. As long as the scope is well defined, many times it does make sense to do tasks in whole rather than as a sum of parts.
We do it on a symbol level after statically analyzing each change, and everything in the monorepo daily. Our remedy to high risk changes is to run more tests, client tests not unit tests. Sometimes there are 100k client tests to pick, so we rank them and run a small subset.
It is a hard problem. One interesting observation is that there is a culprit symbol or two in the culprit change, but its connectivity is very similar to non culprits in the same change.
Another observation is that the transitively modified callgraph after a change is pretty big, a depth of 50 is not unusual. It is hard to get many useful signals out of it beyond amount of overlap in transitively affected symbols between change and test.
We found file level and build target level to be too coarse, but AST symbols are working.
Sounds super cool! I’ll definitely show this to our devops guy in the company. However sounds like there might be a lot of edge cases that don’t include any code changes, e.g. some environment variable changes - have you thought about it?
Sounds good, although it would have to be context aware. For example, code that often gets removed in a production environment might be dissimilar to choose that is typically removed in dev or testing.
There are also other triggers of code removal and refactoring that are outside the code base, such as an organisation migrating to a different platform. An AI trained on a large public commit history could encourage a general shift towards already-established big players, punishing smaller organisations.
I think it's harder to safely do changes to development machines, because the coupling between components is greater. If you want to change, eg., libxml or something like that, lots of processes that you don't know about might be effected, and all hell can break loose.
By contrast, I was generally nuking random userland processes, which no process (or user) would mourn or miss. I think that is a lot safer. There were cases where we would touch something important, like the CreateRemoteThread stuff, but that was a relatively small amount of our code, which rarely changed, and again, it had very little interction with anyone else's code.
It's also possible that we _did_ create a lot of havoc, but I didn't know. I think that's less likely because I think we would have noticed the loss of revenue, but it's possible.
If there's tooling that can do this efficiently it sounds like it can be a win since all changes in implementation details become explicit and can potentially be more easily tracked.
It helps a lot though. One of many small incremental changes, or one big monster change, which do you think is likely to break a system? And which is going to be easiest to diagnose and fix?
I wonder if there is a qualitative analysis of the commits. Aka, it changed a line of comment vs it introduced a new feature or refactored and increased long term viability, etc.
In your example (change a function signature and change all of its callers in one commit), would it require code reviews and approvals from every affected projects?
This is one of the reasons why we have infrastructure as code now, so system changes can be reviewed and tested just like application code, and more types of accidents can be reverted via source control :)
In this case I think a lot of people are wondering what they changed. Would it be possible to extract that info from the paths modified, filtered with some handwritten rules to classify the changes? For example, did they commit driver code for their own hardware, or did they touch other areas of the kernel? How much of each?
reply