> The more predictable a work process becomes, the quicker people can do it too.
The most highly-voted question on Stack Overflow[1] relates to branch predictability. There's an almost 10× speed-up when the branch predictor of the asker's CPU is fully engaged.
Modern branch predictors are extremely powerful, with TAGE and perceptrons achieving 99.5%+ prediction accuracy.
Given the classic stack overflow branch predictor question, you are underselling things quite a lot - a factor of 6 on those older processors in question, and who knows on modern processors.
> Branches are often faster than branch-free methods, since they can speculate certain data dependencies away.
And often they're slower, since predicting some branches is often impossible due to the data :) So the processor ends up miss-predicting most of the time, thinking it knows what happened last time.
It's about how sorting + random access can be faster than random access, because you introduce locality of reference. And in this case, it absolutely is faster to sort the data first and then do the work, even counting the sorting.
Edit: Aw crap, adrianN beat me to this a few screenfuls down, sorry! But, it is a different post, so maybe this is still helpful. :D
I think now branch prediction on processors can go either way, just as long as it's biased one way or another. If it's not biased one way or another, try to reduce out branches into vectorizable instructions (or pray that GCC takes care of it for you).
Branch prediction isn't a GCC thing, it's a processor thing.
I think you misunderstood me. I agree that the StackOverflow post has to do with branch prediction. I just meant that I don't think that's why the earlier poster thinks it's parallel to the situation described in the article.
Since this is the third recent item on HN featuring the impact of branch prediction on code performance, let's make sure we keep our heads on straight: a good algorithm will outperform a bad algorithm that's tuned for branch prediction.
Quicksort infamously abuses branch prediction, yet is still the standard against which all other sorting algorithms are compared. Some work is being done to improve Quicksort's branch prediction performance [1], but that's mostly focusing on choosing better pivots.
A perfect sorting algorithm, something like a very large sorting network, will necessarily make any processor's branch predictor give up and go home -- and that'll still be faster to execute than a naive algorithm tuned for branch prediction.
> Like [..] CPU Branch Prediction Idiocy
Can you elaborate on that or point me to some sources as I thought branch prediction was a good thing for speed (until now)
He's talking about branch prediction because it's intimately tied with speculative execution. He's trying to explain why the standard 'branchy' version of binary search has faster performance than one might naively expect, thus why the speedup from increasing MLP (memory level parallism) is less than one might expect.
When the processor reaches a conditional branch (the 'if' statement comparing the test value), rather than waiting for the actual results of the comparison to be known (and thus waiting for the test value to be loaded), it blindly guesses one of the branches and continues executing instructions as fast as it can. It doesn't commit the results of these instructions ('retire' the instructions) until the real answer is known, but crucially, it executes the loads sooner than they would otherwise be executed.
Most of these early loads are along the wrong path and the load is wasted, but half the time, the first guess is right, and half the time, the branch after that, etc. The end result is that branchy binary search ends up roughly twice as fast as a branchless version that waits for the loads to complete and efficiently makes only the loads that need to be made.
> first, modern CPUs have invested a lot of silicon into branch prediction, so if a function pointer has been called recently it will likely be predicted correctly the next time and called quickly
Huh, TIL. Branch prediction is normally about predicting which branch an `if` would take. But apparently this applies to indirect jumps as well: https://stackoverflow.com/a/26240197/1082652
The question I have, which I do not see answered after a quick glance of the top responses, is: What is the total time of sorting + good branch prediction loop, versus not sorting and bad branch prediction loop? Does the good branch prediction save more time than it cost to sort the array, on modern processors? What about old processors with shorter pipelines?
The predictors work by caching which branches were taken, and assuming the same outcome is likely to happen next time the branch at that address is checked. That particular branch is checked all the time, and unless you actually have >2GB strings the result is always the same.
Only branches which dynamic runtime behavior are bad for performance, like binary search when the same branch is taken/not taken randomly, depending on the data and the key being searched. Branches with stable runtime behavior are OK thanks to branch prediction. Examples of good branches: `while( i < 10000 )`, `if( string_length < INT_MAX )`
Technically the branch prediction benefit comes from partitioning the array around the critical value 128. Sorting is not necessary. Partitioning may be faster.
reply