Hacker Read

alephnerd · 2024-05-14 00:43:34

Because the unit used to measure IO is FLOPS [0], and most societies have settled on base-10 as their numerical system of choice for thousands of years. Furthermore, it is not possible to predict the exact number of cycles you'll need, since that depends on the architecture, compiler and many other factors. This is why FLOPS are used as the unit of choice.

Each jump in flops by a magnitude of 10^3 is a significant problem in concurrency, IO, parallelism, storage, and existing compute infrastructure.

Managing racks is difficult, managing concurrent workloads is difficult, managing I/O and storage is difficult, designing compute infra like FPGAs/GPUs/CPUs for this is difficult, etc.

[0] - https://en.m.wikipedia.org/wiki/FLOPS

reply

AlotOfReading | karma 8440 | avg karma 3.59 · | 2024-03-10 21:23:54

Not to take away from your point, but I'd argue that counting cycles is usually misleading even for small embedded systems now. It's very difficult to build a system where cycles aren't equally squishy these days.

fluoridation | karma 1496 | avg karma 1.48 · | 2023-09-05 10:47:50

Given that the point of the FLOPS unit is to compare processors, it does make more sense to count complex instructions as more than a single floating-point operation. If one CPU could multiply a 4x4 matrix by a vector in a single instruction that can run a million times per second, and another CPU needed ~32 instructions and so can only multiply 500k matrices per second but retires 16 million instructions in that same second, it would be silly to compare instructions instead of multiplications and additions.

vardump | karma 7011 | avg karma 2.24 · | 2016-08-08 08:32:27+00:00

> it's asking how to achieve the theoretical max of 4 FLOPs per CPU cycle.

Nowadays you can do 32 FLOPs per core per cycle, single precision (counting FMA as add + mul).

reply

dragontamer | karma 30240 | avg karma 2.84 · | 2023-09-05 09:25:25

My personal notes on this subject:

* MIPS was perhaps the integer equivalent to FLOP, still used in modern microcontrollers because the 8051 at 12MHZ would only execute 1MIPS (12 clocks per instruction). Modern 8051 chips obviously have sped up to 1 clock per instruction, but MIPS (and Dhrystone MIPS in particular) are still a common benchmark today.

* FLOPs is very difficult to calculate in theory because modern CPUs have vector units, and multiple pipelines per core. You could have 3x AVX512 instructions in parallel on today's CPUs on a single core.

* FLOPs we're traditionally a 64-bit operation for the supercomputer community. Today, most FLOPs are 32-bit for video games. Finally, the deep learning / neural net guys have popularized 16-bit flops, and even 8-bit iops.

* 'The' flop is a misnomer because it's almost always the multiply-and-accumulate instruction: X = A + B * C. Which... Is two operations per instruction (per shader/SIMD lane). Eeehhh whatever. Who cares about these details?

* As 'Dhrystone' is the benchmark for MIPS, the benchmark for 64-bit flops is Linpack.

reply

valarauca1 | karma 5484 | avg karma 2.95 · | 2016-11-07 14:13:06+00:00

FLOPS is always calculated via the simple formual

      F * (1/Hz) * 2 = FLOPS

Where F is # of FPU front ends (SIMD and scalar). This is wrong because scalar math often is slower then SIMD, and compute kernels rarely run on the scalar pipeline.

Where Hz is the well.. the clock rate, inverse to get cycles per second. This is wrong because stalls happen, memory transfers, cache misses etc. It is also wrong because the clock rate is throttled and you are not always at Maximum boost clock.

Then multiply by 2 for FMA (fused multiply add). This is wrong because well not every operation is a one cycle FMA. Division can be many (>100). Also scalar pipelines don't have FMA.

Ultimately all vendors use the same crappy calculation so we are comparing apples to apples. Just rotten apples to rotten apples. It gives you a good ideal circumstance you can optimize towards but never actually attain.

reply

hinkley | karma 39933 | avg karma 2.46 · | 2022-06-11 19:09:33

Somewhere along the way, we seem to have forgotten that we have processors that can do 3 billion mathematical operations per second.

I'm currently trying to shame some people at work into acting on the fact that they wrote some code that should take about 1µs per call that is instead taking over 100. If you're stupid with cycles then you get to be stupid with cores too, and then you get to be stupid with locking semantics.

reply

teawrecks | karma 916 | avg karma 1.11 · | 2021-12-18 11:34:08

It has nothing to do with clock cycles, it's about how the number of additions and multiplications scale for large matrices.

swetland | karma 643 | avg karma 5.27 · | 2012-04-06 02:07:43+00:00

As far as I can tell, I think the cycle count is intended to model the cost of cycles against your 100KHz budget. I'm not sure I think it's worth the complexity.

Given that Notch seems to want to continuously operate one or more virtual CPUs for every user, if it were me, I'd favor making the execution model as simple and cheap as possible.

reply

brigade | karma 2680 | avg karma 3.36 · | 2014-12-15 23:52:24+00:00

It's one instruction per cycle that gets 8 flops. And what are you arguing even? Assuming its unthrottled FP32, that gives a quad-core A15 at 2GHz 7 watts to be over 5x less efficient.

haberman | karma 20020 | avg karma 5.73 · | 2019-03-12 06:40:21

> x86 isn't RISC and not all operations take the same number of cycles.

Yes but it doesn't say instructions per hash, it says cycles per hash. Unless some kind of frequency scaling is going on, the number of cycles per second should be very consistent. If bytes/hash is held constant and cycles/second is constant, then MiB/second and cycles/hash should be exact inverses. I don't understand why this is not the case in these tables.

> Also, the code might have different levels of possible parallelism and might impact the pipeline differently.

Again this can affect the number of instructions being retired, but not the number of cycles. A cycle is a cycle, regardless of how much work is actually being accomplished.

> Comp Sci education for the low level basics is often completely neglected nowadays. This should be freshman year stuff.

I'm a low-level junkie who lives in godbolt.org and Agner Fog's tables, writes JIT compilers, and does FPGA design for fun on the side. It's possible that I'm mistaken here, but I do have a fair amount of background in this.

reply

abrookewood | karma 2270 | avg karma 2.85 · | 2017-01-15 23:18:02+00:00

It's a lot harder when you realise that everyone is trying to optimise for Cycle count, rather than the number of Instructions/Nodes.

rayiner | karma 121493 | avg karma 4.24 · | 2019-04-28 15:49:27+00:00

> These 2 FLOPs/cycle are for SSE i.e. 4-wide vector math without any conditions or branches.

Using double precision SSE scaler ops, the Pentium III could execute one addition and one multiplication per cycle. (The throughput was the same with vector math because the Pentium III only had 64-bit SSE units. So a 128-bit packed multiply and a 128-bit packed add, four double precision operations, executed over two cycles.)

reply

astrobe_ | karma 1733 | avg karma 1.5 · | 2021-11-28 07:11:04

CPU makers add so many complicated features to let users optimize their programs that it takes too much knowledge for mere mortals to optimize their programs - That you have to rely on heuristics and measures, as it is usually and wisely advised, is a bit unsatisfying. That's sort of ironic, in a way.

I used to do assembly and count cycles, but now I wouldn't dare; it's hardcore compiler and library makers stuff. It's like "don't do your own crypto (optimization)".

Everyone knows why it is so, though - we cannot solve the problem by throwing more Gigahertz at it.

reply

kabdib | karma 15100 | avg karma 5.39 · | 2019-09-22 08:45:46+00:00

There's probably enough jitter in the memory system and in instruction parallelism that accurate cycle counting will still be challenging.

Also, you probably want some padding so that newer versions of the CPU can be used without too much worry. It's possible for cycle counts of some routines to increase, depending on how new chips implement things under the hood.

[says a guy who was counting cycles, in the 1980s :-)]

reply

nighthawk454 | karma 1058 | avg karma 3.44 · | 2022-08-07 18:43:18

I mean it’s a learning tool, not a design pattern. Far simpler than figuring out how to measure/cut cpu cycles.

marmaduke | karma 2585 | avg karma 1.75 · | 2017-05-30 14:27:49

512 bit registers fit 8 doubles and 16 floats, and it's not as simple as 16 flops per cycle. They are probably counting fused multiply adds which are usually the highest flop/cycle instruction.

If an FMA can be done in one cycle, then we have 18 x 32 = 576 flops, so if it's clocked at e.g. 2.0 ghz, peak performance would be ~1.1 Tflops.

Edit: I see someone wrote exactly 8 hours before..

reply

Guvante | karma 2391 | avg karma 1.87 · | 2014-02-07 23:29:33+00:00

I am pretty sure they are talking about per-cycle performance. Since they can do 33 operations per cycle. IIRC the peak performance of an Intel chip at the moment is 6 FLOP per 2 cycles (or there abouts).

Of course this is beyond ridiculous since a 780 TI can pull off 5 TFLOP/sec on a little under a GHz clock, 5,000 FLOP per cycle is a little more than 33.

It seems like an interesting design, but comparing performance against what an x64 chip can do is a bit silly, you can't just pick numbers at random and call that the overall improvement.

reply

fluoridation | karma 1496 | avg karma 1.48 · | 2023-09-06 05:39:24

>In other words, if a CPU runs at 1 MHz and has two units but where only one can run `fmul`, should you count the CPU as having one or two MFLOPS?

I don't think there's enough information to answer that question. For starters, even if the CPU was issued a long sequence of nothing but additions, it's possible that it might spend some fraction of that million cycles waiting on memory.

I don't think FLOPS are counted from first principles, I think they're measured empirically using benchmarks. It's possible one benchmark will yield 1.25 MFLOPS and another 1.8 MFLOPS. Split the difference and call it a day.

reply

Findecanor | karma 1717 | avg karma 2.81 · | 2023-09-06 04:14:04

FLOPS is a measure of throughput. The time-difference between a `fmul` and a `fmadd` is often in latency.

A modern processor is typically superscalar, with multiple individually pipelined floating point units that can each be issued one instruction per cycle: regardless of whether that is a `fmadd`, a `fadd or something else.

If you'd issue a `fmadd` to a unit, then you are prevented from issuing a `fadd` to it in that same cycle, and vice versa. That's one op, not two.

However, floating point units are sometimes heterogeneous. I think the question should be, rather: should you count the throughput of any ops, or of arbitrary ops. In other words, if a CPU runs at 1 MHz and has two units but where only one can run `fmul`, should you count the CPU as having one or two MFLOPS?

reply