Not really; there's no rounding problem with x << 1. x * 2, x+x, x<<1 are all equivalent. The real reason it doesn't use the shift instruction on x86 is because it takes 3 bytes to encode the instruction whereas adding a register to itself takes two. But like it says, for powerpc it does use a shift as there's no difference in instruction lengths on a RISC.
random aside: gcc usually uses the LEA instruction for this sort of thing, which lets you compute any combination of {1,2,4,8} * x + y + n in one cycle where x and y are registers and n is a constant number. So even with the LEA instruction you can use either lea eax, [eax*2] or lea eax, [eax+eax] to compute this. And so on. I think add eax, eax is most likely what it does though.
But all of these use a bit shift instead of the (I)DIV instruction. I wasn't saying compilers can't be stupid, but the AI-generated comment explicitly stated that the shift operation was more efficient than division, not that it would result in fewer instructions emitted.
midpoint_rep_lodsb_handwritten:
endbr64
movq $0x8000000000000000, %rax
xorq %rax, %rsi ;convert two's complement to biased
xorq %rax, %rdi
addq %rsi, %rdi
rcrq $1, %rdi
xorq %rdi, %rax ;convert back
ret
(sorry if I got the syntax wrong, AT&T is just horrible)
Ah, the olden days of 8086. When xor ax, ax was faster than mov ax, 0 and so was xor ax, bx; xor bx, ax; xor ax, bx to swap instead of using a temporary.
Some processors also have ror and rol instructions that accomplish shifting in the shifted out bits. 8086 also had rotate with carry flag: rcr and rcl. Aids implementing sign-extended shift right and arbitrary precision math.
At least on x86 processors, most idiomatic bit manipulation expressions like `word & -word` or `word & (word - 1)` compile to a single fast ALU instruction. __builtin_clz and similar are comparatively more expensive on many architectures.
With a DSP core it often gets even better, the bit shift might be free. Say you have the instruction ACC = ACC + P << PM, that could compile to a single instruction, if you're not writing assembly directly.
Additionally if you're ever working directly with hardware, writing drivers, or firmware, bit-shift operations are often the best (and most common) way to get stuff done.
I think the most probable reason for this instruction is for calculating parity bits. This would need to be done fast so it makes sense that there would be a CPU instruction to do most of the work.
Right, unpacking numbers like that is pretty efficient in AVX512.
Still, modern processors have BMI2. For some practical applications, the PEXT instruction is pretty comparable, here’s an example: https://stackoverflow.com/a/72106877/126995
That only applies to adds, subtracts, and register moves. 16-bit Booleans, shifts/rotates, multiplies, and load/store still need to be done with multiple instructions.
It's an FPU instruction with some setup and teardown instructions associated.
I usually bow to Agner Fog on this:
"On Core2 65nm, FSQRT takes 9 to 69 cc's (with almost equal reciprocal throughput), depending on the value and precision bits. For comparison, FDIV takes 9 to 38 cc's (with almost equal reciprocal throughput), FMUL takes 5 (recipthroughput = 2) and FADD takes 3 (recipthroughput = 1). SSE performance is about equal, but looks faster because it can't do 80bit math. SSE has a super fast approximate reciprocal and approximate reciprocal sqrt though.
On Core2 45nm, division and square root got faster; FSQRT takes 6 to 20 cc's, FDIV takes 6 to 21 cc's, FADD and FMUL haven't changed. Once again SSE performance is about the same."
On x86 it's actually mixed: scalar shifts behave as you describe, but vectorised logical shifts flush to zero when the shift amount is greater than the element size!
So x86 actually has both behaviors in one box (three behaviors if you could the 32-bit and 64-bit scalar things you mentioned separately).
This is an example of where UB for simple operations actually helps even on a single hardware platform: it allows efficient vectorization.
It's very impressive that the synthesizer came up with that instruction sequence. The technique used is too slow for general compiler optimization (tens of seconds to an hour for short programs) but useful for specialist problems.
This is worth trying on bit-pushing algorithms such as crypto and compression algorithms. Those use so much CPU time that the effort is justified. It might also be useful for generating code that uses MMX/SSE2/3/4 instructions.
To be fair, the choice of xor eax, eax is 'faster' than mov eax, 0. The xor option is only two bytes versus the five of mov, which can have I$ and decode width concerns. And back in the day, five bytes was pretty awkward for up to 32 bit data buses, where you'd need at least two cycles to even read the mov instruction. That's why existing code tended to use xor eax, eax in the first place, even before Intel started adding rename hardware to their processors.
The only issue is that for RISC, all these instructions are of equal length, so flipping them around would gain you very little, or more likely zero effect unless you are chasing some corner case thing like "XOR instruction value compresses slightly better than ADD because.."
a = v ^ 0x30303030lu // normalize to digits 0xXX0a0b0c
b = (a << 12) | a // combine into XXXXXbacb0c
idx = (b >> 12) & 0xfff // get bac
res = lookup[idx]
reply