But all of these use a bit shift instead of the (I)DIV instruction. I wasn't saying compilers can't be stupid, but the AI-generated comment explicitly stated that the shift operation was more efficient than division, not that it would result in fewer instructions emitted.
midpoint_rep_lodsb_handwritten:
endbr64
movq $0x8000000000000000, %rax
xorq %rax, %rsi ;convert two's complement to biased
xorq %rax, %rdi
addq %rsi, %rdi
rcrq $1, %rdi
xorq %rdi, %rax ;convert back
ret
(sorry if I got the syntax wrong, AT&T is just horrible)
I sometimes write optimized low-level code, but many of these books, and internet articles, are rather old. CPU architectures have converted to just two, AMD64 and ARM64. They have tons of useful instructions like sign extension and integer division.
They also have less trivial instructions equivalent to some of these tricks, only faster. Couple examples.
To turn off the rightmost 1-bit in a word, on AMD64 there’s BLSR instruction from BMI1 set.
ARM CPUs have a fast instruction to reverse bits. AMD64 do not, but they have a fast instruction to flip order of bytes allowing to reverse these bits faster than what’s in that book.
These tricks are only useful very rarely. For instance, SIMD instructions can’t divide vectors of integers. Some older GPUs can’t divide FP64 numbers but most of them can multiply FP64, and all of them can divide FP32. For these exotic use cases, these tricks are still relevant.
Ah, the olden days of 8086. When xor ax, ax was faster than mov ax, 0 and so was xor ax, bx; xor bx, ax; xor ax, bx to swap instead of using a temporary.
Some processors also have ror and rol instructions that accomplish shifting in the shifted out bits. 8086 also had rotate with carry flag: rcr and rcl. Aids implementing sign-extended shift right and arbitrary precision math.
At least on x86 processors, most idiomatic bit manipulation expressions like `word & -word` or `word & (word - 1)` compile to a single fast ALU instruction. __builtin_clz and similar are comparatively more expensive on many architectures.
Ha, AMD Am29000 had single-step MUL and DIV instructions, that is, they did a single addition/subtraction and shift; to actually divide two numbers you literally wrote a sequence of 32 identical (except of the very first/last one) MUL or DIV instructions: look at [0], sections "7.1.6. Integer multiplication" and "7.1.7. Integer division" on pp. 203–207.
No compiler (MSVC, gcc, icc) outputs the "bts" instruction for operations like:
bitfield[val / 32] = val % 32;
which could implicitly perform the modulo.
Using the intrinsic provided significant performance improvements, and we got still more when the rest of the inner loop was rewritten purely in assembly.
> * Inefficient instructions are replaced with more efficient instructions. For example gcc will for a simple x % 19 generate no less than 16 instructions instead of a single div/idiv. This is probably still faster, but it may still be detrimental if it's not in a hot path. It should be noted that gcc emits this even at -O0.
That only applies to adds, subtracts, and register moves. 16-bit Booleans, shifts/rotates, multiplies, and load/store still need to be done with multiple instructions.
> Same instruction (JALR) used for both calls, returns and register-indirect branches (requires extra decode for branch prediction)
JALR for call and return used the same opcode but are two different instructions, there is no need for "extra logic" for the decode or for branch prediction.
The lack of "register + shifted" could easily be circumvented by adding an extension for "complex arithmetic instructions".
And macro op fusion is a common solution that already exists in modern CPUs to increase the number of stages in the pipeline.
> Multiply and divide are part of the same extension
An extension can easily be partially supported in hardware (e.g. multiplication) and leave the other instructions emulated in software (e.g. divisions)
> No atomic instructions in the base ISA. Multi-core microcontrollers are increasingly common
But some microcontrollers do not need atomics. And if you are designing a microcontroller that does, just include the atomic extension.
Many of the criticisms made here are incorrect and are due more to a misunderstanding of RISC-V than to RISC-V design flaws
Good question. The instruction set is so large and is not orthogonal and it difficult to know the "best way" to do things. I am always perplexed at the instruction selection compilers make, but the compiler writers are often pretty clever.
It’s mostly the immediates, I mean, how could I not mention the immediates. The other parts are indeed arranged fairly straightforwardly, but trying to read off a 5-bit register field crossing a byte boundary when the instruction is written in little endian—or even just a three-bit field somewhere in the middle to figure out what the instruction even is—is less than pleasant.
Again, I recognize this makes things simpler, not more difficult, for the hardware, and even a compiler’s emitter will spend very little code dealing with it. But it’s not easy on the eyes.
reply