Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

But all of these use a bit shift instead of the (I)DIV instruction. I wasn't saying compilers can't be stupid, but the AI-generated comment explicitly stated that the shift operation was more efficient than division, not that it would result in fewer instructions emitted.

    midpoint_rep_lodsb_handwritten:
        endbr64
        movq   $0x8000000000000000, %rax
        xorq   %rax, %rsi    ;convert two's complement to biased
        xorq   %rax, %rdi
        addq   %rsi, %rdi
        rcrq   $1, %rdi
        xorq   %rdi, %rax    ;convert back
        ret
(sorry if I got the syntax wrong, AT&T is just horrible)


sort by: page size:

I sometimes write optimized low-level code, but many of these books, and internet articles, are rather old. CPU architectures have converted to just two, AMD64 and ARM64. They have tons of useful instructions like sign extension and integer division.

They also have less trivial instructions equivalent to some of these tricks, only faster. Couple examples.

To turn off the rightmost 1-bit in a word, on AMD64 there’s BLSR instruction from BMI1 set.

ARM CPUs have a fast instruction to reverse bits. AMD64 do not, but they have a fast instruction to flip order of bytes allowing to reverse these bits faster than what’s in that book.

These tricks are only useful very rarely. For instance, SIMD instructions can’t divide vectors of integers. Some older GPUs can’t divide FP64 numbers but most of them can multiply FP64, and all of them can divide FP32. For these exotic use cases, these tricks are still relevant.


it's one XOR, SHL, OR, SHR, AND, on some archs the shifts might come free with the other instruction. I'd expect it to be faster.

a = v ^ 0x30303030lu // normalize to digits 0xXX0a0b0c

b = (a << 12) | a // combine into XXXXXbacb0c

idx = (b >> 12) & 0xfff // get bac

res = lookup[idx]


Haha these are really interesting.

It seems to program with such a reduced instruction set requires one to be either a genius or a computer (i.e. generated code). I'm neither :(


Here's an incredibly minimalistic example:

    int f(double x, int y) {
        return x * y;
    }

    int g(int x, int y) {
        return x * y;
    }
These two functions differ by a single token, yet their assembly translations have only one common instruction, and this instruction is "ret":

    f:
        movapd      xmm1, xmm0
        pxor        xmm0, xmm0
        cvtsi2sd    xmm0, edi
        mulsd       xmm0, xmm1
        cvttsd2si   eax, xmm0
        ret
    g:
        mov         eax, edi
        imul        eax, esi
        ret

Ah, the olden days of 8086. When xor ax, ax was faster than mov ax, 0 and so was xor ax, bx; xor bx, ax; xor ax, bx to swap instead of using a temporary.

Some processors also have ror and rol instructions that accomplish shifting in the shifted out bits. 8086 also had rotate with carry flag: rcr and rcl. Aids implementing sign-extended shift right and arbitrary precision math.


At least on x86 processors, most idiomatic bit manipulation expressions like `word & -word` or `word & (word - 1)` compile to a single fast ALU instruction. __builtin_clz and similar are comparatively more expensive on many architectures.

Ha, AMD Am29000 had single-step MUL and DIV instructions, that is, they did a single addition/subtraction and shift; to actually divide two numbers you literally wrote a sequence of 32 identical (except of the very first/last one) MUL or DIV instructions: look at [0], sections "7.1.6. Integer multiplication" and "7.1.7. Integer division" on pp. 203–207.

[0] http://www.bitsavers.org/components/amd/Am29000/1987_Am29000...


In other words, two more instructions, instead of zero.

That compilers fail to do clever masking tricks to avoid branches frequently results in 2x slower programs.


No compiler (MSVC, gcc, icc) outputs the "bts" instruction for operations like: bitfield[val / 32] = val % 32; which could implicitly perform the modulo.

Using the intrinsic provided significant performance improvements, and we got still more when the rest of the inner loop was rewritten purely in assembly.


> (Itanium used 64-bit registers but 128-bit instructions; you packed two ops into an instruction)

Three ops.


> * Inefficient instructions are replaced with more efficient instructions. For example gcc will for a simple x % 19 generate no less than 16 instructions instead of a single div/idiv. This is probably still faster, but it may still be detrimental if it's not in a hot path. It should be noted that gcc emits this even at -O0.

Does it emit it at -Os ?


That only applies to adds, subtracts, and register moves. 16-bit Booleans, shifts/rotates, multiplies, and load/store still need to be done with multiple instructions.

> Same instruction (JALR) used for both calls, returns and register-indirect branches (requires extra decode for branch prediction)

JALR for call and return used the same opcode but are two different instructions, there is no need for "extra logic" for the decode or for branch prediction.

The lack of "register + shifted" could easily be circumvented by adding an extension for "complex arithmetic instructions".

And macro op fusion is a common solution that already exists in modern CPUs to increase the number of stages in the pipeline.

> Multiply and divide are part of the same extension

An extension can easily be partially supported in hardware (e.g. multiplication) and leave the other instructions emulated in software (e.g. divisions)

> No atomic instructions in the base ISA. Multi-core microcontrollers are increasingly common

But some microcontrollers do not need atomics. And if you are designing a microcontroller that does, just include the atomic extension.

Many of the criticisms made here are incorrect and are due more to a misunderstanding of RISC-V than to RISC-V design flaws


Good question. The instruction set is so large and is not orthogonal and it difficult to know the "best way" to do things. I am always perplexed at the instruction selection compilers make, but the compiler writers are often pretty clever.

It’s mostly the immediates, I mean, how could I not mention the immediates. The other parts are indeed arranged fairly straightforwardly, but trying to read off a 5-bit register field crossing a byte boundary when the instruction is written in little endian—or even just a three-bit field somewhere in the middle to figure out what the instruction even is—is less than pleasant.

Again, I recognize this makes things simpler, not more difficult, for the hardware, and even a compiler’s emitter will spend very little code dealing with it. But it’s not easy on the eyes.


x86 doesn't have remainder. Other instruction sets do. But yeah I agree with your point. Some targets might not even have multiply.

That's a single microcoded instruction -- those are not typically any faster than a bunch of normal instructions.

This isn't the kind of instruction a compiler would emit for regular code.

This is the reason some of the esoteric x86 instructions actually perform slower on newer hardware. They are just there for backwards compatibility.
next

Legal | privacy