Hacker Read

wholesomepotato · 2023-11-29 20:11:05

it's one XOR, SHL, OR, SHR, AND, on some archs the shifts might come free with the other instruction. I'd expect it to be faster.

a = v ^ 0x30303030lu // normalize to digits 0xXX0a0b0c

b = (a << 12) | a // combine into XXXXXbacb0c

idx = (b >> 12) & 0xfff // get bac

res = lookup[idx]

reply

a1k0n | karma 1672 | avg karma 5.32 · | 2013-04-07 13:38:50+00:00

Not really; there's no rounding problem with x << 1. x * 2, x+x, x<<1 are all equivalent. The real reason it doesn't use the shift instruction on x86 is because it takes 3 bytes to encode the instruction whereas adding a register to itself takes two. But like it says, for powerpc it does use a shift as there's no difference in instruction lengths on a RISC.

random aside: gcc usually uses the LEA instruction for this sort of thing, which lets you compute any combination of {1,2,4,8} * x + y + n in one cycle where x and y are registers and n is a constant number. So even with the LEA instruction you can use either lea eax, [eax*2] or lea eax, [eax+eax] to compute this. And so on. I think add eax, eax is most likely what it does though.

reply

Tuna-Fish | karma 12536 | avg karma 5.5 · | 2011-01-03 21:20:05+00:00

That's a single microcoded instruction -- those are not typically any faster than a bunch of normal instructions.

rep_lodsb | karma 583 | avg karma 1.62 · | 2022-12-07 09:04:43

But all of these use a bit shift instead of the (I)DIV instruction. I wasn't saying compilers can't be stupid, but the AI-generated comment explicitly stated that the shift operation was more efficient than division, not that it would result in fewer instructions emitted.

    midpoint_rep_lodsb_handwritten:
        endbr64
        movq   $0x8000000000000000, %rax
        xorq   %rax, %rsi    ;convert two's complement to biased
        xorq   %rax, %rdi
        addq   %rsi, %rdi
        rcrq   $1, %rdi
        xorq   %rdi, %rax    ;convert back
        ret

(sorry if I got the syntax wrong, AT&T is just horrible)

vardump | karma 7011 | avg karma 2.24 · | 2015-04-20 21:57:01

One example seems a bit odd.

  if(LocalVariable & 0x00001000)
      return 1;
  else
      return 0;

  mov eax, [ebp - 10]
  and eax, 0x00001000
  neg eax
  sbb eax, eax
  neg eax
  ret

Hmm... wouldn't this be faster? Two instructions less:

  mov eax, [ebp - 10]
  and eax, 0x00001000
  shr eax, 12
  ret

Well, who knows. Didn't bother to analyze this case. Maybe the article's example is faster somehow?

cvccvroomvroom | karma 681 | avg karma 1.61 · | 2023-09-09 11:10:24

Ah, the olden days of 8086. When xor ax, ax was faster than mov ax, 0 and so was xor ax, bx; xor bx, ax; xor ax, bx to swap instead of using a temporary.

Some processors also have ror and rol instructions that accomplish shifting in the shifted out bits. 8086 also had rotate with carry flag: rcr and rcl. Aids implementing sign-extended shift right and arbitrary precision math.

reply

TheCoreh | karma 2592 | avg karma 5.3 · | 2017-09-15 06:42:40+00:00

An optimizing compiler will most likely compile both to the same instruction. (If my intuition is correct, a bitshift.)

snovv_crash | karma 4124 | avg karma 2.18 · | 2019-06-08 06:01:14

I've used it for bitwise hamming distance - xor+popcnt - having a hardware instruction on x86 made the entire pipeline over 6x faster.

jandrewrogers | karma 16134 | avg karma 5.16 · | 2024-02-24 23:06:47

At least on x86 processors, most idiomatic bit manipulation expressions like `word & -word` or `word & (word - 1)` compile to a single fast ALU instruction. __builtin_clz and similar are comparatively more expensive on many architectures.

NwtnsMthd | karma 119 | avg karma 2.2 · | 2021-08-05 16:14:37

With a DSP core it often gets even better, the bit shift might be free. Say you have the instruction ACC = ACC + P << PM, that could compile to a single instruction, if you're not writing assembly directly.

Additionally if you're ever working directly with hardware, writing drivers, or firmware, bit-shift operations are often the best (and most common) way to get stuff done.

reply

markh1967 | karma 39 | avg karma 6.5 · | 2019-09-09 10:30:49+00:00

I think the most probable reason for this instruction is for calculating parity bits. This would need to be done fast so it makes sense that there would be a CPU instruction to do most of the work.

Const-me | karma 5797 | avg karma 1.88 · | 2022-05-12 09:02:08

Right, unpacking numbers like that is pretty efficient in AVX512.

Still, modern processors have BMI2. For some practical applications, the PEXT instruction is pretty comparable, here’s an example: https://stackoverflow.com/a/72106877/126995

reply

kevin_thibedeau | karma 19088 | avg karma 2.16 · | 2015-07-08 12:29:35

That only applies to adds, subtracts, and register moves. 16-bit Booleans, shifts/rotates, multiplies, and load/store still need to be done with multiple instructions.

nibnib | karma 232 | avg karma 2.13 · | 2016-03-10 15:57:44+00:00

It's an FPU instruction with some setup and teardown instructions associated.

I usually bow to Agner Fog on this:

"On Core2 65nm, FSQRT takes 9 to 69 cc's (with almost equal reciprocal throughput), depending on the value and precision bits. For comparison, FDIV takes 9 to 38 cc's (with almost equal reciprocal throughput), FMUL takes 5 (recipthroughput = 2) and FADD takes 3 (recipthroughput = 1). SSE performance is about equal, but looks faster because it can't do 80bit math. SSE has a super fast approximate reciprocal and approximate reciprocal sqrt though.

On Core2 45nm, division and square root got faster; FSQRT takes 6 to 20 cc's, FDIV takes 6 to 21 cc's, FADD and FMUL haven't changed. Once again SSE performance is about the same."

reply

BeeOnRope | karma 2502 | avg karma 3.0 · | 2020-04-14 16:39:07+00:00

On x86 it's actually mixed: scalar shifts behave as you describe, but vectorised logical shifts flush to zero when the shift amount is greater than the element size!

So x86 actually has both behaviors in one box (three behaviors if you could the 32-bit and 64-bit scalar things you mentioned separately).

This is an example of where UB for simple operations actually helps even on a single hardware platform: it allows efficient vectorization.

reply

Animats | karma 143047 | avg karma 6.11 · | 2015-02-10 08:25:18+00:00

It's very impressive that the synthesizer came up with that instruction sequence. The technique used is too slow for general compiler optimization (tens of seconds to an hour for short programs) but useful for specialist problems.

This is worth trying on bit-pushing algorithms such as crypto and compression algorithms. Those use so much CPU time that the effort is justified. It might also be useful for generating code that uses MMX/SSE2/3/4 instructions.

reply

fegu | karma 733 | avg karma 3.29 · | 2020-05-29 19:31:31

This is the reason some of the esoteric x86 instructions actually perform slower on newer hardware. They are just there for backwards compatibility.

lebuffon | karma 614 | avg karma 1.84 · | 2020-09-15 14:29:07+00:00

Wild guess here but his compiler probably can recognize that it can replace sy*128 with a bit shift instruction. :-) (thats what I would want)

It can't know that with a named constant.

(Same would be true for any value that's a power of 2. Usually a real speed up.)

reply

monocasa | karma 27236 | avg karma 2.94 · | 2021-11-04 12:44:27

To be fair, the choice of xor eax, eax is 'faster' than mov eax, 0. The xor option is only two bytes versus the five of mov, which can have I$ and decode width concerns. And back in the day, five bytes was pretty awkward for up to 32 bit data buses, where you'd need at least two cycles to even read the mov instruction. That's why existing code tended to use xor eax, eax in the first place, even before Intel started adding rename hardware to their processors.

IcePic | karma 251 | avg karma 1.29 · | 2022-06-16 04:57:21

For RISC ISAs like MIPS, then you can do a lot of

ADD Reg0 to Reg0 and store result in Rx

OR Reg0 with reg0 and store result in Rx

XOR Reg0 with reg0 and ..

The only issue is that for RISC, all these instructions are of equal length, so flipping them around would gain you very little, or more likely zero effect unless you are chasing some corner case thing like "XOR instruction value compresses slightly better than ADD because.."

reply