Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
Faster CPython at PyCon, part one (lwn.net) similar stories update story
264 points by jwilk | karma 8094 | avg karma 2.47 2023-05-11 09:21:17 | hide | past | favorite | 170 comments



view as:

Nooooo if Python gets faster then all the Mojo demos are going to sound so much less impressive nooo

I don't know, looks like ~15% improvement [1]. Doubt the Mojo guys are going to lose much sleep (they claimed 35,000x improvement).

[1] https://speed.python.org/comparison/?exe=12%2BL%2B3.11%2C12%...


Somehow I'm still skeptical! Feels like "strict superset of Python, all Python code works" and "several orders of magnitude faster" just sounds like you're trying to have your cake and eat it, too.

I doubt that the Mojo developers have some sort of 'secret sauce' or 'special trick' that will get them there. And even if they have something, I don't see why the Python devs wouldn't just implement the same approach, considering they're currently trying to make Python faster.

I assume that (as long as Mojo wants to stick to its goal of being a strict superset of Python), there will be a lot of things that just cannot be 'fixed'. For example, I'd be surprised if Python's Global Interpreter Lock isn't entangled with the language in a few nasty ways that'd make it really difficult to replace or improve upon (while retaining compatibility).

Then again, the developer of Swift is working on it, right? I guess he's got the experience, at least.


Isn’t Mojo compiled? While Python needs to be interpreted.

Mojo bothers offers Just-In-Time and Ahead-Of-Time models. Proper AOT compilation is going to offer a significant speedup, but probably not enough to get to their stated goals. And good luck carrying all of Python's dynamic features across the gap.

I think the point is that it's really hard to get those gains from "being compiled" when every function or method call, every use of an operator, everything is subject to multiple levels of dynamic dispatch and overrideability, where every variable can be any type and its type can change at any time and so has to be constantly checked.

A language that's designed to be compiled doesn't have these issues, since you can move most of the checking and resolution logic to compile-time. But doing that requires the language's semantics to be aligned to that goal, which Python's most certainly are not.

The achievements of Pypy are pretty incredible in the JIT space, but the immense effort that has gone in there definitely is a good reason to be skeptical of Mojo's claims of being both compiled and also a "superset" of Python.


So you want a system that figures out that the vast majority of function and method calls are not overridden, that these variables and fields are in practice always integers, etc.; generates some nice compiled code, and adds some interlock so that, if these assumptions do change, the code gets invalidated and replaced with new code (and maybe some data structures get patched).

I think that this kind of thing is in fact done with the high-performing Javascript engines used in major browsers. I imagine PyPy, being a JIT, is in a position to do this kind of thing. Perhaps Python is more difficult than Javascript, and/or perhaps a lot more effort has been put into the Javascript engines than into PyPy?


Python is considerably more dynamic than JavaScript. And I'm not trying to drag pypy, just saying that that's already the state of the art and there are good reasons that the authors of pypy didn't set out to do what Mojo apparently claims to have done.

> Python is considerably more dynamic than JavaScript

This is sometimes repeated but I don’t believe that is why Python is slow (nor do I think for most measures of “dynamic” it is even true). Which aspect of “dynamism” in particular are you concerned about that JavaScript lacks? The primary hindrance is keeping CPython extensions working while making Python fast. Add to that the hundreds of millions that went into V8 and other JavaScript implementation efforts.


For starters, consider that almost every operator in Python is a dynamic method call under the hood. This includes value equality and hashing, which also affects e.g. dict keys. Oh, and all built-in types can be extended and their behavior overridden.

And then there's descriptors: https://docs.python.org/3/howto/descriptor.html


Those are hardly unsolvable issues with a JIT as they are static for the vast majority of cases. Solutions have been proposed since the 90s with Self and Strongtalk[1].

Doing that, plus 100% compatibility with CPython extension API, while preserving some expectations of deterministic destruction in Python, are the challenges one would face.

[1]: http://www.strongtalk.org/


The point is that the part of mojo that is new allows you to write early bindings that can't change, immutable variables, values that are accessed without indirection, types, ownership annotations, etc, and uses all of that to do precisely what you are asking for.

If it can't do much until those extra annotations are in place, then I would imagine it's unlikely it'll ever get support from numpy, django, and the like. Uptake for mypy type annotations alone has been pretty slow.

>Uptake for mypy type annotations alone has been pretty slow.

Because they're annotations with bad checkers. They're mostly helpful as documentation.

TypeScript checking is on a whole different level.


Python can be JIT compiled, so far most efforts failed short due to CPython FFI API.

Plenty of other dynamic languages have proven the point already.


Well, they do have MLIR, which is probably the closest thing to "secret sauce" they've got. I'm excited by some of the performance-oriented features like tiling loops, but they'll need the multitude of optimization-hints that GCC and Clang have, too. They also have that parallel runtime, which seems similar to Cilk to me, and Python fundamentally can never have that as I understand it.

IDK about 35,000x but Python really is outrageously slow relative to other popular languages, even other interpreted ones. It's more comparable to something like Ruby than to something like JavaScript.

From their FAQ: "How compatible is Mojo with Python really? Mojo already supports many core features of Python including async/await, error handling, variadics, etc, but… it is still very early and missing many features - so today it isn’t very compatible. Mojo doesn’t even support classes yet!".

Overall, pure Python seems to be about 100x slower than what you can reasonably get with a compiled language and some hard work. It's about 10x slower than what you can get from JITs like Pypy and Javascript, when such comparisons makes sense.

I agree that Mojo remind me of Cython, but with more marketing and less compatibility with Python. Cython aspired to be a nearly 99% superset of Python, at least that's exactly what I pushed my postdocs and graduate students to make it be back in 2009 (e.g., there were weeks of Robert Bradshaw and Craig Citro pushing each other to get closures to fully work). Mojo seems to be a similar modern idea for doing the same sort of thing. It could be great for the ecosystem; time will tell. Cython could still also improve a lot too -- there is a major new 3.0 release just around the corner!: https://pypi.org/project/Cython/#history


I am a little skeptical as well, but I do think there is a lot of areas for improvement beyond the changes going in mainline. Note that while Mojo intends to support all of python, they never claimed that it will still be fast if you make heavy use of all the dynamic features. The real limiting factors are how often those dynamic features are used, and how frequently the code needs to be checking whether they are used.

The fast CPython changes are still very much interpreter-centric, and are checking whether assumptions have changed on every small operation. It seems to me that if you are able to JIT large chunks of code, and then push all the JIT invalidation checks into the dynamic features that break your assumptions rather than in the happy path that is using the assumptions, you ought to be able to get much closer to to Javascript levels of performance when dynamic features aren't being used.

Then support for those dynamic features becomes a fallback way of having a huge ecosystem from day one, even if it is only modestly faster than CPython.


The way I read it, Python code as it is, won't see a huge bump it's like 10x or something.

You have to use special syntax and refactoring to get the 1000x speed ups. The secret sauce is essential a whole new language within the language which likely helps skip the GIL issues.


Do they claim that existing python code would get a 10x bump? That sounds too good to be true

...Swift and creator of LLVM and Clang.

That comparison seems quite cherry picked. Unlikely that it is generalizable. In my tests with Mojo, it doesn’t seem to behave like Python at all so far. Once they add the dynamism necessary we can see where they land. I’m still optimistic (mostly since Python has absolute garage performance and there’s low hanging fruit) but it’s no panacea. I feel their bet is not to have to run Python as is but have Python developers travel some distance and adapt to some constraints. They just need enough momentum to make that transition happen but it’s a transition to a distinct language, not a Python implementation. Sort of what Hack is to PHP.

Hack as a comparison is probably not a great thing for Mojo, since no one outside of Facebook uses Hack.

Safe to say no one outside of Modular uses Mojo either at the moment.

I don't buy the mojo hype.

The concept of compiling python has been tried again and again. It has its moments, but anything remotely important is glued in from compiled code anyways.


From a quick reading of the Mojo website, it sounds like gluing in compiled code is exactly what they're doing, except this time the separate compiled language happens to look sort of like Python a bit. For the actual Python code it still uses CPython, so that part doesn't get any faster.

Pyrex, Cython, mypyc, Codon... we've been down this road before plenty too.

If Mojo succeeds it will be purely based on quality of implementation, not a spark of genius.


The "spark of genius" might very well be a novel implementation strategy.

It leverages the Multi-Level Intermediate Representation (MLIR) https://mlir.llvm.org/ which is a follow-on from LLVM, with lots of lessons learned: https://www.hpcwire.com/2021/12/27/lessons-from-llvm-an-sc21...

Again, "compile to an IR" isn't exactly ground-breaking. The devil will be in the details.

Cython works and is widely used. Problem is that there aren't too many people working on it (from what I can tell), so the language has some very rough edges. It seems like it has mostly filled the niche of "making good wrappers of C, C++, and Fortran libraries".

Cython has always been advertised as a potential alternative to writing straight Python, and there are probably a decent number of people who do this. I work in computational science and don't personally know anyone that does. I use it myself, but it's usually a big lift because of the rough edges of the language and the sparse and low quality documentation.

If Cython had 10x as many people working on it, it could be turned into something significantly more useful. I imagine there's a smarter way of approaching the problem these days rather than compiling to C. I hope the Mojo guys pull off "modern and actually good Cython"!


Cython was, of course, "modern and actually good Pyrex."

In the end the way the industry works guarantees a endless stream of these before some combination of boredom and rentiership result in each getting abandoned. It's just a question of whether Mojo lasts 1, 3, or god willing 5 years on top.


Is it the way the industry works, or is the way open source works? Pyrex had basically one person behind it (again, from what I can tell), and Cython currently has ~1 major person behind it. Not enough manpower to support such large endeavors.

Ideally, government or industry would get behind these projects and back them up. Evidently Microsoft is doing that for Python. For whatever reason, Cython and Pyrex both failed to attract that kind of attention. Hopefully it will be different with Mojo.

Here's to 5 years!


There's also Numba, which seems a lot more active and compiles directly to LLVM IR.

We use Cython a lot. Currently two biggest annoyances are the lack of tooling, in particular a language server and code formatter. Besides that, even though Cython looks a lot like Python, you need to have some familiarity with C or C++ to avoid shooting yourself in the foot and to check that the generated code is not suboptimal.

Cython's main benefit is very deep integration with Python (compared to eg. Rust and PyO3).


Cython is also not actually that much faster if you just compile vanilla Python code with it.

Expressly not the purpose of Cython.

Sure, but we're talking about Cython as a proof of concept for Mojo, for which substantial perf gains are supposed to come from compilation even before you do anything special.

Great implementation is genius.

But grueling and desperately dependent on broader trends.

I wouldn't want to bet on "lots of work for a chance to get incrementally better."


Mojo compiles "Python" to heterogeneous hardware, I don't believe that has been tried before.

It has.

There's Numba, CuPy, Jax and torch.compile. Arguably they are more like DSLs, which happen to integrate into Python than regular Python

Of course I don't know what Mojo will actually bring to the table since their documentation doesn't mention anything GPU specific, but the idea isn't completely novel.


It has mostly failed, because contrary to the other dynamic languages, back to the early Lisp compilers, there is a community resistance to JIT adoption and most relevant refactoring C API on CPython.

In no way is Python more dynamic than Smalltalk, SELF or Common Lisp, which can at any given time redefine any object across the whole execution image and were/are mostly bootstraped environments.


I've been rewriting Python->C for nearly 20 years now. The expected speedup is around 100x, or 1000x for numerical stuff or allocation-heavy work that can be done statically. Whenever you get 10,000x or above, it's because you've written a better algorithm. You can't generalize that. A 35k speedup is a cool demo but should be regarded as hype.

> The expected speedup is around 100x, or 1000x for numerical stuff

What if you stay in the realm of numpy?

What's the biggest offender that you see?


> What if you stay in the realm of numpy?

You mean, what if you're only doing matrix stuff? Then it's probably easier to let numpy do the heavy lifting. You'll probably take less than a 5x performance hit, if you're doing numpy right. And if you're doing matrix multiplication, numpy will end up faster because it's backed by a BLAS, which mortals such as myself know better than to compete with.

> What's the biggest offender that you see?

Umm... every line of Python? Member access. Function calls. Dictionaries that can fundamentally be mapped to int-indexed arrays. Reference counting. Tuple allocation.

One fun exercise is to take your vanilla python code, compile it in Cython with the -a flag to produce an HTML annotation. Click on the yellowest lines, and it shows you the gory details of what Cython does to emulate CPython. It's not exactly what CPython is doing (for example, Cython elides the virtual machine), but it's close enough to see where time is spent. Put the same code through the python disassembler "dis" to see what virtual machine operations are emitted, and paw through the main evaluation loop [1]; or take a guided walkthrough at [2].

[1] https://github.com/python/cpython/blob/v3.6.14/Python/ceval.... (note this is an old version, you can change that in the url)

[2] https://leanpub.com/insidethepythonvirtualmachine/read


This guy cythons

Due to the possibility to fuse multiple operations in C++ (whereas you often have intermediate arrays in numpy), I routinely get 20x speedups when porting from numpy to C++. Good libraries like eigen help a lot.

What's an example of fusing operations?

Are you talking about combinations of operations that are used commonly enough to warrant Eigen methods that perform them at once in SIMD?


Probably that eigen uses expression templates to avoid the needles creation of temporaries.

most non-trivial numpy operations require temporaries that require new allocations and copies. Eigen3's design lets you avoid these through clever compilation tricks while remaining high-level.

sometimes numpy can elide those (e.g. why a+=b is faster than a=a+b) but this it not possible in general. Sometimes people use monstrosities like einsum... but I find it more intuitive to just write in C or C++...

In addition to the time spent in allocation / gc / needless copying, the memory footprint can be higher by a factor of a few (or more...).


Yep, einsum is included in "doing numpy right." And for what it's worth, it's horrid to use and still won't get around cases like x -> cos(x). I haven't needed the power of eigen for a couple of years, but I appreciate the tip.

> numpy will end up faster because it's backed by a BLAS, which mortals such as myself know better than to compete with.

I'd like to dig a little here, for my own curiosity. How is this possible? Ie, beating C or Rust code using... arcane magic. It reminds me of React was touted as fast; I couldn't figure out how a Javascript lib could be faster than Javascript.


BLAS uses low level routines that are difficult to replicate in C. Some of the stuff is written in FORTRAN so as to avoid aliasing issues inherent to C arrays. Some implementations use direct assembly operations. It is heavily optimized by people who really know what they're doing when it comes to floating point operations.

BLAS is a very well optimized library. I think a lot of it is in Fortran, which can be faster than c. It is very heavily used in scientific compute. BLAS also has methods that have been hand tuned in assembly. It’s not magic, but the amount of work that has gone into it is not something you would probably want to replicate.

BLAS are incredibly well optimized by people doing their life's work on just matrix multiplication, hand-tuning their assembly, benchmarking it per platform to optimize cache use, etc -- they are incredible feats of software engineering. For the multiplication of large matrices (cubic time), the performance gains can quickly overwhelm the quadratic-time overhead of the scripting language.

You can get on the order of 10-30x speedup over NumPy by reducing the allocation of temporaries and fusing across operations. See:

Weld: A Common Runtime for High Performance Data Analytics

https://dspace.mit.edu/bitstream/handle/1721.1/137425/cidr_w...


> The expected speedup is around 100x, or 1000x for numerical stuff or allocation-heavy work that can be done statically. Whenever you get 10,000x or above, it's because you've written a better algorithm.

Anecdotally I recently rewrote a piece of Python code in Rust and got ~300x speedup, but let's be conservative and give it 100x. Now let's extrapolate from that. In native code you can use SIMD, and that can give you a 10x speedup, so now we're at 1000x. In native code you can also easily use multiple threads, so assuming a machine with a reasonably high number of cores, let's say 32 of them (because that's what I had for the last 4 years), we're now at 32000x speedup. So to me those are very realistic numbers, but of course assuming the problem you're solving can be sped up with SIMD and multiple threads, which is not always the case. So you're probably mostly right.


Python can use multiprocessing with a shared nothing architecture to use those 32 threads.

I was about to say the same thing.

Multiprocessing on Python works great and isn’t even very hard if you use say async_apply with a Pool.

Comparing single-threaded Python with multiprocesssing in Language X is unfair if not disingenuous.


> Multiprocessing on Python works great and isn’t even very hard if you use say async_apply with a Pool.

Multiprocessing works great if you don't really need a shared memory space for your task. If it's very loosely coupled, that's fine.

But if you use something that can benefit from real threading, Python clamps you to about 1.5-2.5 cores worth of throughput very often.


There's a serialization overhead both on dispatch and return that makes multiprocessing in Python unsuitable for some problems that would otherwise be solved well with threads in other languages.

Unless you don't need to change your code.

Hear me out… we can write bad python code to justify impressive speed boosts rewriting it in rust.

In this way we can justify rewriting stuff in rust to our bosses!

If we write decent python, and perhaps even replace 1 line to use pypy, the speedup won't be impressive and we won't get to play with rust!


The other languages are not taking/releasing a globally mutually exclusive GIL every time it crosses an API boundary and thus "shared nothing" in those languages is truly shared nothing. Additionally, Python's multiprocessing carries a lot of restrictions which makes it hard to pass more complex messages.

And each of these threads will still have the Python interpreter performance.

Nothing preventing something like Mojo to also use those same 32 threads but with 10-100x the performance instead.


Trivially parallelizable algorithms are definitely in the "not generally applicable" regime. But you're right, they're capable of hitting arbitrarily large, hardware-dependent speedups. And that's definitely something a sufficiently intelligent compiler should be able to capture through dependency analysis.

Note that I don't doubt the 35k speedup -- I've seen speedups into the millions -- I'm just saying there's no way that can be a representative speedup that users should expect to see.


I wrote a simple Monte Carlo implementation in Python 3.11 and Rust. Python managed 10 million checks in a certain timeframe, while Rust was could perform 9 billion checks in the same timeframe. That's about a 900x speedup, if I'm not mistaken. I suspect Mojo's advertised speedup is through the same process, except on benchmarks that are not dominated by syscalls (RNG calls in this instance).

The one difference was the Rust one used a parallel iterator (rayon + one liner change), whereas I have found Python to be more pain than it's worth, for most usecases.


Iirc, the 35kx number included parallelisation

Because they are targeting the GPU, write a GPGPU shader and call it from Python and you will get the same number.

Or use Jax or Taichi.


35,000x suddenly becomes 30,000x, not as impressive!

How does that reconcile with the benchmarks here, which say that Python 2.12 is currently somewhere between 5% slower to 5% faster?

https://github.com/faster-cpython/benchmarking-public


if it's an improvement that big it needs to be with GPUs, & gpus can be used normally with torch etc

Meh, mypyc and codon both do the same thing as mojo. Codon seems to have similar speed improvements and the source code is available.

Do they compile to MLIR so when Alteryx writes a code generator you can target FPGAs, for example?

LLVM ir. So probably there are some flags somewhere. Also supports openmp and gpu compute through decorators.

JAX or torch.compile also do sth similar as Mojo, right?

Once fused into super-instructions/adaptive instructions, does `disas` still allow users to manipulate the bytecode as if it were not optimized?

That'd be a very nice feature and I'm not sure it exists in other languages.


I haven't followed things recently but I know that at least initially they purposefully kept the bytecode intact so that existing debuggers and compilers and what not would still work. I believe they put the optimized bytecode in a different field, and accept this duplication as the cost of supporting the feature you requested.

Speaking of Python speed you gotta hand it to the core Python devs having some really dedicated numerical analysis folks. A really good gentle introduction talk https://www.youtube.com/watch?v=wiGkV37Kbxk

Python really has earned its place as the language of choice for scientific, research, and ML. I've stolen quite a few algos from CPython.


"Speaking of Python speed you gotta hand it to the core Python devs having some really dedicated numerical analysis folks."

Well, they are Python devs after all.


> LOAD_ATTR_ADAPTIVE operation can recognize that it is in the simple case

How can it do that? The `self` variable could be anything, or not?


If you define a .__getattr__ method, then "self.y" is no longer simple.

If it finds itself not in the simple case it specialized for, it reverts back to the unspecialized case. In JIT compiler terms, this is called a "guard"

But isn't this the same as what LOAD_ATTR also does? It first checks the simple case, and if that fails, it goes the complex fallback route. Why does this need a new op? Or why would the standard LOAD_ATTR not do this?

Or are there multiple versions of the compiled function, and it modifies the bytecode per each call of the function?


The issue is that there are many different "simple cases". The more checks for "fast paths" you add, the more overhead of individually checking for them.

Also consider that:

- this is more branch predictor friendly (a generic opcode encounters many different cases, while a specialized opcode is expected to almost never fail),

- the inline cache has limited space and is stateful (imagine it's a C union of cache structs for different fast paths). If your "generic" LOAD_ATTR found a simple case B, it'd still need to ensure that its inline cache was actually primed for simple case B. This is not the issue with specialized opcodes; the cache is set to correct state when the opcode gets specialized.


Shameless plug: https://yakshalang.github.io/

This is my passion project. (Less activity these days as I'm sick)


I hope you feel better. This is a neat project and I’m looking forward to trying it out.

+1 same here

thank you

thank you

I'm impressed. Shameless plug to your plug: you may want to join the Handmade [0] community or attend our conferences [1]. You'd fit right in with us!

[0] https://handmade.network

[1] https://handmadecities.com


Nice. I'm already aware of handmade.network. I was plan on joining but lost track of it. handmadecities also look interesting. Will take a look. Might take until end of month.

Python's real strength at this point is the ecosystem. If you want speed you wouldn't use Python. Having said that, its better for the planet if default Python is more efficient. Everyone has seen Python code doing stuff Python code should not do at scale.

Thanks for the information, it has never been said before in HN when an improvement to Python happens and I am sure there is still someone out there thinking that Python is fast.

BTW from the PEP:

“Motivation

Python is widely acknowledged as slow”


There are a lot of people out there who interpret "Python is a slow language" as an attack, rather than just something that engineers need to know and account for at solution time.

There are also a non-trivial number of people out there who read about the constant stream of performance improvements and end up confusing that with absolute high performance; see also the people who think Javascript JITs mean Javascript is generally at C performance, because they've consumed too many microbenchmarks where it was competitive with C on this or that task, and got 500% faster on this and 800% faster on that, and don't know that in totality it's still on the order of 10x slower than C.

It is, unfortunately, something that still needs to be pointed out.


I am more of the idea that people that truly need performance already know about the differences of Python vs C and my experience with slow Python code is almost always due to lack of basic CS knowledge rather than the language per se.

Maybe my comment was too mean, but every time that something related to Python core enhancements appears someone has to say that C is faster and I am not sure if someone interested on how the generated bytecode improves the performance needs to be reminded that C is faster.


I write a lot of code in the "Python would be really chugging here, though it would get the job done if you put a bit of elbow grease into it, but Go chows right through this no problem with no special effort" space. I'm not generally in the "keeping 100 servers 90% occupied space" but it wouldn't take much more in percentage terms for me to fall off of Python. Python's slowness is definitely real, as well as its general difficulty in keeping multiple CPUs going. (Python has a lot of solutions to this problem precisely because none of them are really all that good, compared to language that is 1. compiled and 2. has a solid threading story. It wouldn't have many solutions if one of them was an actual solution to the problem.)

It isn't just Python. I'm getting moved more into a PHP area right now, and there's a system in there that I need to analyze to see whether its performance problems are architectural, or if it's just not a task PHP should be doing and Go(/Rust/C#/Java/... Go is just the one of that set most in my and my team's toolbelt) would actually perform just fine. I don't know which it is yet, but based on what I know at the moment, both are plausible.

And I'm not a "hater". I just have a clear view of the tradeoffs involved. There are many tasks for which Python is more power than one knows what to do with. Computers have gotten pretty fast.


Last year we had a team switch from Java to Python for what should be a high-performance API, because "they're really speeding up Python 3 these days."

It's now at least 2x off its original perf target and only the trivial case is being handled correctly.

I think reminding people is good.


No need to get testy. I know its widely known. The problem is, we still see Python in performance sensitive areas at scale anyway. So I said its a good thing.

I'm not trying to give anyone a wedgy by saying that. I thought I was just giving some justification for the optimization that OP describes.


Yes we all know python is slow. But that's not a virtue of the language, it's not a feature that anyone really wants. We live with it but increasing speed is always a good thing. Which is why even Python's creator is making it his priority right now.

I want productivity and speed, it turns out that most dynamic languages learned to embrace JIT compilers, only the Python community so far has been against them, and appears to rather write C code instead.

Most attempts thus far haven't failed because it isn't possible, rather because the community hasn't railed around them.

Smalltalk, SELF, CL, Dylan, Lua have shown how it is done, Julia, JS and Ruby are following along, only Python keeps resisting it.


In practice, Python is very often fast enough. Mostly because hotspots are implemented in another language e.g., if you use Python to multiply matrices, then something like numpy would use blas (C, Fortran) or similar under the hood. Your handwritten code in any language will have a hard time to beat blas in performance.

I like tinkering with interpreters, and I do it a lot, but this is chump change compared to even a baseline compiler, especially one that makes use of type profiles. There is a pretty big investment in introducing a compiler tier, but I'm not convinced that it's less engineering effort than squeezing more out of an interpreter that fundamentally has dispatch and dynamic typing overhead. There aren't that many bytecode kinds in CPython, so it seems worth doing.

That's what we did in Skybison before it got wound down: https://github.com/facebookexperimental/skybison

Are there publicly-available performance results?

Kind of! In my fork I run microbenchmarks on each PR. So you can see on, for example, https://github.com/tekknolagi/skybison/pull/456, that the change had a 3.6% improvement on the compilation benchmark. If you expand further, you can see a comparison with CPython 3.8. Unfortunately Skybison is still on 3.8.

What's the difference between that and cynder? I know they both come from Meta/IG but I wonder if cynder is an evolution from skybison or is it more of a different approach?

Cinder and Skybison are different approaches. Skybison was a ground-up rebuild of the object model and interpreter, whereas Cinder is a JIT that lives on top of CPython.

Small improvements to cpython are multiplied out to the millions of machines cpython is installed on by default.

I prefer C without braces

When will such goodness come to JavaScript engines? I’m curious if there is some fundamental design flaws that made JavaScript difficult to optimize.

JavaScript engines have been using inline caches and specialization for over 10 years now. It's Python that's catching up to JavaScriptCore from 2007.

I wonder if the parent comment was made tongue-in-cheek :)

Or catching up to Smalltalk/SELF/CL from 1980's, actually.

I mean JavaScript is definitely gross to optimize, if you'd design a language for easy optimization and jitting it would look different. Python is just way worse, especially because of the large and rich CPython C API, which used invasive structures almost everywhere (it has been slowly moving away from that). The CPython people, GvR in particular, always considered it very important for the codebase to remain simple as well, and it is. CPython is a really simple compiler and straightforward bytecode interpreter.

Node.js has been on par with Java for many years now due to the Jit and tons of money google has pumped into that ecosystem: https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

Java: 2.52s Node.js: 6.37s

I don't know if I'd call that on par.



Funny that you say this, the more common view I've heard is "we made JS way faster why can't we do that for python". I believe I heard Mark Shannon (one of the lead devs on Faster CPython) say this at one point.

I love me some Python, trust me. But this is just putting lipstick on a pig. The language needs to be rewritten from the ground up. If you’ve ever analyzed the machine code a simple python script spits out you’ll see how helplessly bloated it is.

Do you have thoughts around the concept of Python becoming a Rust macro based language? I read another comment about that being a potential future for Python but I don't know how feasible it is.

I've moved on to Go for things where I need to care about performance but like thinking about things in a similar way Python lets me think about them.


> If you’ve ever analyzed the machine code a simple python script spits out you’ll see how helplessly bloated it is.

How would one do that? I thought python generated bytecode and interpreted that code? Where's the machine code? Do you mean dissembling the python interpreter itself? In which case it would be gcc/llvm spitting out machine code, no?


You can make a Python-ish language that can run fast (or at least, a lot faster) but it cannot be Python: the dynamic nature of the language means many optimizing techniques just aren't available.

The unfortunate thing is that the vast majority of Python code in use today doesn't need all those super-dynamic features. So it would run just fine on a cut-down JITed interpreter. But there's always the corner cases.

In retrospect, Python 3 was a lost opportunity here: it could have broken more compatibility, enabled multi-core and JITs, and then the 10 year transition pain would actually have been worth it. But that's hindsight, of course.


[dead]

>the dynamic nature of the language means many optimizing techniques just aren't available.

This is something that seems like it should be true, but counter evidence exists that proves that it's not the case.

The first example would be V8, the JavaScript JIT compiler used in Chrome and NodeJs (and probably other things). V8 is many times faster than CPython in pretty much every situation.

The second, and even better example is SBCL, a Common Lisp compiler. SBCL is quite a bit faster than CPython and V8, it's closer to the JVM in terms of performance in benchmarks that I have saw.

The third example would be some of the Scheme compilers, like Chez and Gambit, which are not far off from SBCL.

Maybe you could argue that JavaScript is not as dynamic as Python. I don't know JavaScript at all so maybe that is the case.

I'm pretty sure that Common Lisp and Scheme are not less dynamic though. I think Common Lisp is actually more dynamic but I don't have any way to measure this, so it's just my opinion.

So assuming these languages are as or more dynamic than Python, this seems to be proof that Pythons dynamic-ness is not the reason for it's poor performance!

The Lisp compilers are also much less widely used and have much less engineering power available!

I think these counter examples are pretty interesting and don't know exactly what to make of it. Python has more funding and more users to contribute to it (except in the case of V8), I guess until now they just haven't put any of that into performance.


Python is "stupid dynamic"; it exposes implementation details which can't work work well under compilation, like accessing a caller's local variables.

Yes, but Common Lisp is also "stupid dynamic"!

I don't think there's anything Python can do that Common Lisp can't in terms of dynamic-ness!

This is a quote from python.org:

>These languages are close to Python in their dynamic semantics, but so different in their approach to syntax that a comparison becomes almost a religious argument: is Lisp's lack of syntax an advantage or a disadvantage? It should be noted that Python has introspective capabilities similar to those of Lisp, and Python programs can construct and execute program fragments on the fly. Usually, real-world properties are decisive: Common Lisp is big (in every sense), and the Scheme world is fragmented between many incompatible versions, where Python has a single, free, compact implementation.

https://www.python.org/doc/essays/comparisons/

I believe there are dynamic things Common Lisp can do that python can't, like modifying and creating classes, inheritance and methods at runtime, even with effects propagating out to already existing class instances!


> Yes, but Common Lisp is also "stupid dynamic"!

It's not. Common Lisp was designed to enable Lisp applications to be delivered with reasonable performance, first in 1984, when an expensive computer might have had 1 to 10 Megabytes (!) of memory and a CPU with 8 Mhz / 1 Million instructions per second. You'll then see a bunch of different implementations, sometimes within the same running Lisp and able to use different execution modes in the same program:

* source interpreted Lisp -> a Lisp Interpreter executes the code from traversing the s-expressions of the source code -> this is usually slow to execute, but there are also very convenient debug features available

* compiled Lisp code -> a Lisp compiler (often incremental) compiles Lisp code to faster code: byte code for a VM, C code for a C compiler or machine code for a CPU. -> often this keeps a lot of the dynamic features

* optimized compiled Lisp code -> like above, but the code may contain optimization hints (like type declarations or other annotations) -> the compiler uses this provided information or infers its own to create optimized code.

For "optimized compiled Lisp code" the compiler may remove all or some of dynamic features (like late binding of functions, allowing data of generic types to be passed, runtime type checks, runtime dispatch, runtime overflow detection, removal of debug information, tail call optimization, ...). It may also inline code. The portions where such optimizations are applied span from certain parts of functions to whole programs.

Common Lisp also has normal function calls and generic function calls (CLOS) -> the latter are usually a lot slower and people are experimenting with ways to make it fast (-> by removing dynamism where possible).

So, speed in Common Lisp is not one thing, but a continuum. Typically one would run compiled code, where possible, and run optimized code only where necessary (-> in parts of the code). For example one could run a user interface in unoptimized very dynamic compiled code and certain numeric routines in optimized compiled code.

  CL-USER> (defun foo (a b)
             (declare (optimize (speed 3) (safety 0))
                      (fixnum a b))
             (the fixnum (+ a (the fixnum (* b 42)))))

  CL-USER> (disassemble #'foo)
  ; disassembly for FOO
  ; Size: 28 bytes. Origin: #x70068A0918                     ; FOO
  ; 18:       5C0580D2         MOVZ TMP, #42
  ; 1C:       6B7D1C9B         MUL R1, R1, TMP
  ; 20:       4A010B8B         ADD R0, R0, R1
  ; 24:       FB031AAA         MOV CSP, CFP
  ; 28:       5A7B40A9         LDP CFP, LR, [CFP]
  ; 2C:       BF0300F1         CMP NULL, #0
  ; 30:       C0035FD6         RET
  NIL
As you can see, with optimization instructions and type hints, the code gets compiled to tight machine code (here ARM64). Without those, the compiled code looks very different, much larger, with runtime type checks and generic arithmetic.

Nothing prevents a Python dynamic compiler to follow a similar approach though, specially now that type annotations are part of the language.

And in any case, there are the Smalltalk and SELF JITs as an example of highly dynamic environments, where anything goes.


With declarations which are promises from the programmer to the compiler (I promise this is true, on penalty of undefined behavior), you can fix a lot of "stupid dynamic".

Python could have a declaration which says, "this function/module doesn't participate in anything stupidly dynamic, like access to parent locals". If it calls some code which tries to access parent locals, the behavior is undefined.

That's kind of a bad thing because in Lisp I don't have to declare anything unsafe to a compiler just to have reasonably efficient local variables that can be optimized away and all that.


Type annotations are defined to be completely ignored by the interpreter.

So, as of today, they’re useless for optimization. That could be changed, but hasn’t been so far.


I'm not exactly sure how anything you said supports that CL isn't a dynamic language.

When using SBCL for example, none of CLs dynamic features are restricted from the programmer in any way. So whether it's compiled to native code or not has no bearing at all on how dynamic the language is.

Can you explain to me why a Python compiler couldn't implement optimizations similar to SBCL?

>It's not.

Is Python more powerful than CL in some way that I am not aware of?


I tried to explain that the optimized version of Common Lisp is less dynamic than the non-optimized version. The speed advantage often comes because the compile code is less or not dynamic. Late binding for example makes code slower because of another indirection. An optimizing compiler can remove late binding. The code will be faster, but there might no longer be a runtime lookup of the function anymore.

> When using SBCL for example, none of CLs dynamic features are restricted from the programmer in any way.

Sure, but it will be slower in benchmarks. The excellent benchmark numbers of SBCL is in part a result of being able to cleverly remove dynamic features.


Common Lisp keeps the stupid dynamic parts out of language areas that are connected to critical code execution paths. For instance, no aspect of lexical variables is dynamic. But you have dynamic variables, which are separate in such a way that a compiler can easily tell the difference.

I believe there are areas of CLOS which are stupid dynamic; but even there, the specification tries to tread carefully. Firstly, you don't have to use CLOS in a Lisp program; and if you need data structures with named slots, structs may suffice.

Importantly, Common Lisp keeps a kind of basic type versus class type separation in the language. You don't feel it because it's not obnoxious, like int versus Integer in Java. Built in basic object types like integers and strings all have a CLOS class in Common Lisp. But, the class of that class (the metaclass) is not the same as that of a class which the application defines with defclass. The Lisp compiler doesn't have to worry about silly monkey patching being perpetrated on a string or integer.

In some areas of the language, it's clear that the designers were trying to avoid bringing in dynamic behavior that would interfere with performance. For instance, conditions are defined in such a way that they are "class-like" objects, but without the actual requirement that they be CLOS instances.

The meta-object protocol (MOP) was also kept out of the language. I'm not sure whether the MOP is "stupid dynamic" because it also seems to hold the keys to avoiding "stupid dynamic" in that if you don't like some particular dynamism in a given class, maybe you can design your own meta-class which avoids it. It might be possible using MOP to, say, have an object system where the inherited slots of a derived class are at the same offset in the underlying vector storage as in the parent class, so accesses can be optimized. Maybe you can ban multiple inheritance in that meta-class.


Those are interesting points, thank you for writing all of that out!

Just like Smalltalk and SELF, which expose everything and can change the whole image at any execution step.

Besides the sibling comment, Smalltalk and SELF JITs prove otherwise.

Ongoing Ruby efforts as well.


And yet, after 30 years of effort from various different, talented people and teams, with historical knowledge of eg, StrongTalk, and Self, and Common Lisp, and Java, and JavaScript, we've yet to see any major leaps in Python performance.

So, in my opinion, it's clearly not as simple as just taking all of that existing knowledge and a good team and just doing it, or one of the many efforts to do so would almost certainly have succeeded by now.


It would be cool if they'd put some resource into Pypy as well. Feels like that project is such a great achievement, and with just a bit more resource it would've been an amazing alternative. As it is, it takes a long time to get the new language features out, which always hamstrings it a little.

There's a limited number of volunteers with the experience to improve (C)Python, though, so the resources are better spent on improving the CPython implementation to such a point where PyPy is no longer necessary.

See: HHVM and PHP. HHVM is what lit the fire under the PHP team's collective butts to improve perf, and now most just run the mainline PHP interpreter (well, they did years ago when all this happened and I was still doing PHP)

Now there is a number of payed MS employees lead by Guido van Rossum working on improving (C)Python

Why don't they just replace python with PyPy, onboard PyPy team and make PyPy compatible with cpython 100 percent? Oh I think NIH syndrome.

From what I remember, PyPy has some fundamental issues with using libraries with C extensions.

That was a decade ago, alot had changed

Because if you want full ecosystem compatibility with as much performance as you can squeeze given that constraint, it makes more sense to start with the constraint satisfied rather than try to work toward it.

> Oh I think NIH syndrome.

Neither CPython nor PyPy was invented at Microsoft.


Python was by Guido who's now Microsoft employee

There were initiatives to improve CPython at Google and Dropbox, both of which switched to Golang and van Rossum left.

We'll see if this is a pattern and the same happens at Microsoft.


I believe both Google and Dropbox had a lot of Python code powering their products that they wanted to make faster. I don't think Microsoft has many large 1st party uses of Python. I think they're investing in it largely to gain developer mind-share. So for Google and Dropbox "use another language" was an option, for Microsoft it's not.

the userbase of pypy is a tiny fraction of cpython, likely not worth it

PyPy is very much compatible with cpython, you can just change interpreter and see it work. ( I. Many cases)

I think that's because when new codebases are written, pypy isn't on the latest version of the Python spec, so people choose CPython. If it got a little more support it could get to parity or n-1, and far more projects would start with it and stick with it.

I don't use it myself, I just see the relatively small difference in investment it would be between a great project that's slightly languishing in obscurity despite incredible talent and effort, and a genuine full alternative to CPython.


PyPy is now working onwards 3.10, to reach 3.12 quickly they only need a couple of contributors full time, which many legs can effort.

The future of pypy is with hpy, and that is direction cpython is going in, too

Always keep in mind that you can’t go faster than what the harware can allow. Sometimes even by what the OS can allow. We can do way better on some specific tasks. But for for that you need to rewrite some algo implementation to better take advantage of all the hardware can allow. ML is a specific task, if you can take advantage of the harware to do better matrix mul. Does not mean you can have the same speed ups overall. Mojo for example

Porting to WASM and getting C bindings to be compatible on steady well developed VMs will make Python faster.

How, specifically?

Because C code in web assembly can translate to native machine code and still interop with Python.

All this PyObject nonsense does nothing but slowdown the runtime. V8 was built for performance from day 1. CPython is old, slow, and will always be outperformed by V8 regardless of the tweaks this team does.

I am not a compiler engineer, I can’t give you a specific, technical answer.


I was a compiler engineer, even was a committer to V8 itself via Qualcomm's codeaurora legal entity while a Qualcomm engineer, bringing up its backend code generation support to new Snapdragon processors with SIMD extensions about 14 years ago, with a coworker double checking the work on hardware targets. That said, I hope you are right, and it's fun to watch, as time will tell.

So basically following JVM footsteps and CLR in regarding to specialization bytecodes, e.g. invokedynamic would be one of them.


Legal | privacy