Compiler engineer here. In practice, compilers for higher-level languages often have a lot of difficulty getting anywhere close to the efficiency of comparable C code. If you take Python, for example, you have to do a lot of inlining to eliminate various abstractions. Inlining is actually non-trivial. Yes, inlining, by itself, is an easy program transformation, but knowing where to inline to get the best performance is very hard. If you inline too much, you increase code size, and you lose performance. You have to know precisely which inlining decisions will pay for themselves, and your large codebase might have tens of thousands of call sites and a call hierarchy 40 functions deep. Python also adds the fun little problem that you can redefine any function at run time, which makes it hard for the compiler to know which function you're actually going to be calling. To complicate things further, inlining decisions affect other optimizations. For example, if you inline foo into bar, do you then also inline into bar the functions that foo is calling? Do you unroll the loops from foo into bar? Etc.
Also, there's an aspect that I feel is constantly overlooked, and this is the way that objects are laid out in memory. In Python, JavaScript, Ruby, you have a lot of pointers to objects. You get a lot of pointer-chasing as a result. This is BAD for caches. Each object you touch, each pointer you dereference means pulling in a new cache line. In C, you can design very compact and flat data structures. You can use 32-bit floats, 8-bit, or even one-bit integers if you want. You can have a struct within a struct, with an array inside of it, all without any pointer-chasing.
Modern CPUs are constrained by memory bandwidth, and it's very hard for any programming language to beat C on achievable memory efficiency. What's worse is that we have little to no academic compiler literature on automatically optimizing for memory efficient data layouts. It's an overlooked and understudied problem. You would have to prove that integer values lie within certain ranges (hard) and also do object inlining (collapsing objects into parents) which AFAIK is also hard and not done by any mainstream compiler.
So, yeah, keep thinking that a sufficiently-smart compiler will do everything for you. You will assuredly be very disappointed. Until we have strong AI, the sufficiently smart compiler is basically unobtainium. If you want efficiency, the best route is generally to have less layers of abstraction, or to only rely on compiler optimizations you know for certain will happen.
Inlining is often called the mother of all optimizations. On its own, it doesn't gain you much compared to any other optimization, but once it's done it enables much, much more.
There's a reason languages like C that default to static dispatch on value types are so much faster without much effort.
I'm glad this is being discussed, because for many years inlining functions was considered a performance panacea in C++, with no regard to the size of the object code that was being generated, or the effects inlining was having on instruction cache performance.
The C++ community has brought itself all kinds of complexity and long compile times all in the name of performance which, in my mind, was always pretty suspect.
Force inlining code is actually quite a useful optimisation still. We have found many instances that after a few inline functions the compiler gives up and just calls the function instead. Force inlining certain low level functions has actually improved performance by several percent in our codebase.
To add to your good comment, doing inlining intelligently can get tricky in language using a lot of polymorphic functions like Ocaml.
To squeeze out performance, you have to detect when polymorphic functions are actually used on int and decide if you want to compile and store a special cased version which avoid the unboxing-boxing shuffle. There is a balance to find between compile time, program size and speed of running code which is fairly hard to get right.
No it's not. Except if you __force_inline__ everything, of course.
Inlining reduces the number of instructions in a lot of cases. Especially when things are abstracted and factored with lot of indirections into small functions that calls other small functions and so on. Consider a 'isEmpty' function, which dissolves to 1 cpu instruction once inlined, compared with a call/save reg/compare/return. Highly dynamic code (with most functions being virtual) tend to result in a fest of chained calls, jumping into functions doing very little work. Yes the stack is usually hot and fast, but spending 80% of the instructions doing stack management is still a big waste.
Compilers already have good heuristics about when they should be inlining, chances are they are a lot better at it than you. They don't always inline, and that's not possible anyway.
My experience is that compiler do marvels with inlining decisions when there are lots of small functions they _can_ inline if they want to. It gives the compiler a lot of freedom. Lambdas are great for that as well.
Make sure you make the most possible compile-time information available to the compiler, factor your code, don't have huge functions, and let the compiler do its magic. As a plus, you can have high level abstractions, deep hierarchies, and still get excellent performances.
True, and the functions in which inlining is most effective are small non recursive functions that don't lose any debug info from omitting the frame pointer.
> When calling a simple function like this within a large loop, would it make a noticeable difference in speed to inline the computation vs. having a function call?
Maybe. Inlining small functions can reduce cache load, and it means no call/ret instructions and no overhead of argument passing. Moreover inlining allows futher optimizations which can't be done without breaking function boundaries. It may be noticeable. And may be not. Depends on loop.
> what's the best practice for inlining a computation like this?
There are a lot of examples can be seen in linux kernel. Just random example from include/linux/list.h:
Keyword 'static' allows compiler to make no callable (not inlined) copy of function at all, and also it allows to define such a functions in header files. Compiler can't inline function call if it has no function definition at compile time. Declaration is not enough for inlining. Therefore such a functions likely to be defined in headers, and 'static' becomes necessary. When defined in *.c file 'static' can be omitted, but probably better not to.
With C++ such a functions will be a methods in most cases, and (if so) "static" would be unneeded and wrong.
And you'll need to turn on optimizations when compiling. Compiler is not inlining when not optimizing.
Right, inlines are hard to beat, and they are not part of the C standard (which is kind of mind boggling, I mean . . . honestly, what year is it?).
Still, the benefit of inlines diminishes as the size of the inlined functions increases (you start paying penalties for extra cache lines full of code, and the percentage of time in function call overhead gets small in a hurry).
The linker can get into the inlining game, too, by the way, if the win is sufficiently big. I have scars to prove it. :-)
While aggressive inlining eliminates function call overhead, it can actually exacerbate instruction cache misses because it makes the executable larger.
True, but less so now than a decade ago; if it's a hot path (which anything `__forceinline` presumably is, assuming the author has a reasonable understanding of optimization) inlining is usually going to be a performance win, and often a cache/code size win as well (optimizing the inlined code, avoiding register spills).
Yea but the point is readability not performance. If I have 50 functions that are only called once it will be hard for someone who hasn't read the code to immediately realize that which will make it more difficult to modify these functions, if they're all inlined there's no issue.
Correct! Inlining obviously costs CPU, code cache space, and makes the receiver method bigger so that it's less likely it will be inlined itself. If there ever is a decompilation occuring, the inlining efforts were pretty much wasted.
You can do this in C#, F#, Rust and many other languages too with 10x better performance (in Rust in particular those usually get completely optimized away).
One of the biggest reservations I initially had about abstraction, when learning programming, was that I didn't consider performance a detail. As such, I didn't litter my code with objects or pure functions even when they made conceptual sense, unless I was convinced the overhead was small and the abstraction helpful.
I've found that languages like Java make assumptions about your memory, and languages like Prolog make assumptions about your data. I'm under the impression that in Haskell in-place modification is an interpreter optimization, whereas C is increasingly employed when resources are constrained and abstraction layers are deemed too costly. Consequently, C is maligned for resulting in faulty implementations behind every corner, yet it remains one of the most used languages.
You're asking for something which is a bit awkward to find, because it requires a bunch of code in a loop to pressure the cache, and then have someone notice the effects of inlining one thing vs not makes all the difference.
The most likely people to be able to answer this one would be game devs or video codec hackers, at a guess.
I do know that inlining choices can have massive effects on executable size. I've seen more people complain about this kind of thing. It's most noticeable when controlling the inlining of a runtime library function in a language a bit more high level than Rust - I'm thinking of Delphi, with its managed strings, arrays, interfaces etc.
Also, there's an aspect that I feel is constantly overlooked, and this is the way that objects are laid out in memory. In Python, JavaScript, Ruby, you have a lot of pointers to objects. You get a lot of pointer-chasing as a result. This is BAD for caches. Each object you touch, each pointer you dereference means pulling in a new cache line. In C, you can design very compact and flat data structures. You can use 32-bit floats, 8-bit, or even one-bit integers if you want. You can have a struct within a struct, with an array inside of it, all without any pointer-chasing.
Modern CPUs are constrained by memory bandwidth, and it's very hard for any programming language to beat C on achievable memory efficiency. What's worse is that we have little to no academic compiler literature on automatically optimizing for memory efficient data layouts. It's an overlooked and understudied problem. You would have to prove that integer values lie within certain ranges (hard) and also do object inlining (collapsing objects into parents) which AFAIK is also hard and not done by any mainstream compiler.
So, yeah, keep thinking that a sufficiently-smart compiler will do everything for you. You will assuredly be very disappointed. Until we have strong AI, the sufficiently smart compiler is basically unobtainium. If you want efficiency, the best route is generally to have less layers of abstraction, or to only rely on compiler optimizations you know for certain will happen.
reply