The efficient hardware types are handled by int_fast*_t. The legacy types can't be redefined outside their established ranges because that would break things that depend on them fitting into a known amount of memory.
Due to all that legacy the instruction decoder has a lot less opportunity for optimizations.
There may be a physical limit to the size of transistors but the limit of computing performance is not ‘what Intel is doing’. There’s more to performance than instructions per second or clock frequencies. In fact these don’t lead to the same performance at all as ARM is RISC and x64 is CISC.
fat pointer cost can be mitigated by doing pointer preunpacking in hardware. That kind of support fell out of favor when RISC started beating CISC architectures, but if dynamic dispatch became a significant performance issue I could see it going into certain processors. We have happily returned to an era of processor experimentation.
Let the hardware do best what it's good at, being simple and running fast. Let the interpreter/compiler layer do its thing best, flexibility.
Yeah, this is pretty much the opposite of what actually works in practice for general-purpose processors though – otherwise we'd all be using VLIW processors.
Lots of speed sensitive programs also ship multiple implementations they can choose at run time so they can more fully utilize a CPU without recompiling.
I haven't looked into it myself, but it could be that due to all this massaging you'd lose more on throughput than you gain on memory use. It's similar to doing 8 bit quantized stuff on general purpose CPUs. It's very hard to make it any faster than float32 due to all the futzing that has to be done before and after the (rather small) computation. In comparison, no futzing at all is needed for float32: load 4/8/16 things at a time (depending on the ISA), do your stuff, store 4/8/16 things at a time. Simple.
Backward compatibility. They could microcode all the lesser used instructions, but the surface area of existing code is very large, and intel and AMD care more about running existing code faster than new code.
There is a reason that even the obsolete x87 floating point stack still runs a near optimal speed.
Also I don't think it is very expensive to maintain most rare instructions. The cost is primarily in encoding space, but until they support a different ISA (possibly as an alternate mode), they don't have an option.
There is also the "small" advantage that a very complex architecture is hard to implement, validate, and/or emulate, giving an advantage against the competition.
I once did a stupid test using either a int or unsigned in a for loop variable the performance hit was about 1%. Problem is modern processors can walk, chew gum, and juggle all at the same time. Which tends to negate a lot of simplistic optimizations.
Compiler writers tend to assume the processor is a dumb machine. But modern ones aren't, they do a lot of resource allocation and optimization on the fly. And they do it in hardware in real time.
I mean hard to implement efficiently; that's still very possibly true on non-tagged hardware without custom microcode. But maybe not so we'd really notice outside of micro-benchmarks and extreme situations.
Unfortunately, fast array bounds checks and fast integer overflow checks are anathema to speculative execution so aren't happening any time soon.
I'm becoming convinced that we really need to just go back to the Alpha 21164 architecture and stamp out multiple copies with really fast interconnect.
For examples of a CPU that behave differently, lok at RISC CPUs. 32-bit PowerPC, for example, would translate an immediate long load into an immediate 'load short into high word and zero out low word' and a signed immediate addition (it would load $DEAE first, then add -$4111 to get $DEADBEEF)
The list of problems is way longer, by the way. This code makes assumptions on pointer size (I don't think it will run on x64 with common ABI's)
There also is no guarantee that function pointers point to the memory where the function's code can be found (there could be a trampoline in-between, or a devious compiler writer could encrypt function addresses).
Neither is there a guarantee that functions defined sequentially get addresses that are laid out sequentially (there is no portable way to figure out the size of a function in bytes).
Finally, I don't think there is a guarantee that one can read a function's code (its memory could be marked 'execute only').
I guess those more familiar with The C standard will find more portability issues.
The way it's achieved may not matter much with a 4 GHz multi-core CPU running a multitasking OS, but having to deal with 16-bit pointers and segmented memory in a 4.77 MHz 8086/8 was a huge pain I felt in the flesh.
I feel like there's an underlying assumption/reality here that helps us: Hardware is designed for programs that are maximally efficient but also safe and correct. No one ever designs a new CPU instruction that's like "this multiplies two numbers super fast, but only if you're willing to accept undefined behavior 0.1% of the time." They only design hardware instructions that are possible to use in a safe, correct program. And so it's possible for a programming language to make progress on safety and correctness without nessarily compromising performance. The "laws of physics" as expressed in the way we build hardware allow for such a thing. That said, I wonder if other folks know stories about hardware designs that were fundamentally incorrect, and what happened to them?
That might be another reason why it is slow -- it is an old opcode that has to be supported for compatibility, but isn't prioritized in any of the pipelines in newer chip designs.
They have many op codes that bloat the instruction set, that need to be broken apart to fit into a multi-scalar design. This is over head that takes of silicon space. This is especially painful with multi-core designs since each core needs this.
They may have memory models that make guarantees that are insignificantly secure or impede optimisation.
There is something about condition codes vs explicit checks that made speculative execution difficult. I don't recall the details about this one.
Unfortunately, you lose that advantage because in days of yore, people knew the implementation details of their hardware, and would optimize around that. With microcode. and the sheer size of certain instruction sets, it is very difficult to do the same thing, particularly mow that the philosophy of programming/computing has shifted to accomodate the industry as an omnipresent actor in your execution environment, and furthermore one with even more rights to observable machine state than you, the owner of the damn thing.
Your point on inefficient code stands, but there is far more at play than mere "programmers aren't skilled enough" and "imagine the possibilities!"
I'll wager a true Mel would not be an easy thing to reoccur nowadays.
reply