Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
GIL removal and the Faster CPython project (lwn.net) similar stories update story
339 points by signa11 | karma 79101 | avg karma 11.88 2023-08-17 21:20:55 | hide | past | favorite | 163 comments



view as:

What a great article from LWN. It was well-worth reading. As someone who was excited about the NoGIL from Sam Gross when it was first posted here, I think I'm beginning to change my mind after reading this article and reflecting on my own personal experiences.

My experience is with writing backend systems in several different languages (including Python) at various volume/latency/throughput levels. I've basically worked on only two types of systems -

1. one that exposes some sort of an endpoint to the network - it accepts requests of some kind, does computation and other network requests and sends response of some kind (including long polling, ws etc).

2. reads a message from a "queue" (could be database, could be based on polling another api etc) and does computation/network calls and basically sends it to other queues.

Nothing else. Huge variance in specific requirements, but that's it. For the first type of system, latency matters more. For the second system, throughput matters more.

For the first type of system, I want to be able to spin up threads in response to requests, without worrying that an endpoint is too computationally heavy and might block others. I want to be able to share connections to databases in a shared pool. NoGIL would be useful here.

For the second type of system, I can't remember the last time where I wrote one where I had in-process parallelism/concurrency with shared resources (even in langs where there's no GIL). It would just get too confusing and hard to reason about. Any optimizations were mostly based on intelligent batching. For parallelism, you'd just have multiple _completely_ independent processes, probably across multiple machines.

I would absolutely be disappointed if NoGIL meant compromising on the quality of of the second type of system here. In practice, most of my mental bandwidth today goes towards making the second type of system better.


To take advantage of NoGIL you don’t necessarily need to use parallelism directly. But let’s say your web server or async task executor can be more efficient at sharing context between threads.

As a hobbyist who uses python I don't think I'll be directly using concurrency in my code, but I'm betting that over time the standard library and popular external libraries will.

And that will raise everyone's code.


> For the second type of system, I can't remember the last time where I wrote one where I had in-process parallelism/concurrency with shared resources (even in langs where there's no GIL). It would just get too confusing and hard to reason about. Any optimizations were mostly based on intelligent batching. For parallelism, you'd just have multiple _completely_ independent processes, probably across multiple machines.

For myself, the prospect of no-gil is interesting, in that something like my Captain's Log application [0] can be free from it; for example, I currently use a QThread to implement a JournalParser, which is basically the program's "engine" - the parser constantly reads in game events from a player journal file generated by the game Elite: Dangerous (and Odyssey), and depending on the particular event, fires off a related custom QSignal, which is then processed by whichever slot (receiving function) is listening for a given Signal.

There are other places in that application where no GIL might be quite handy.

In other words, I can see where having no GIL can be useful for GUI applications like mine.

[0] https://captainslog.scarygliders.net/captains-log-2/


Your JournalParser sounds like it could be implemented using normal Python threads or by an asyncio event loop without much performance problems. If I understand all it's doing is watching for events and posting a signal somewhere, so it doesn't sound like the kind of application that is CPU-bound.

I could. But I 100% take full advantage of Qt's signal and slot mechanism.

Also, in many ways, GUI applications written in Python are not so much CPU bound, but Python GIL bound. If you're writing a Python/Qt application, you have to take great care to ensure your GUI doesn't freeze when your program is performing, say, many database inserts; if you have some naive loop which performs some given operation, your nice Qt GUI will freeze right up until that operation is complete. Right now the solution is to perform such operations in, say, a QThread, and use Qt's signal/slot feature to blat a progress "report" to a handler in the `main` Python loop.

So back to what I said - no-GIL is looking quite interesting to me. Whether or not Qt can take advantage of such will be a different matter.


I agree. UIs written in python could benefit massively from noGIL - complex / computational UIs especially.

The GIL is a bottleneck in applications that are CPU bound, e.g. machine learning, so naturally the NoGIL project is not that interesting to people writing server applications.

Of course, one may argue that you probably should not write CPU bound programs in Python in the first place, but that's another story :)


A lot of Java server based applications are multi threaded, not CPU bound, and tend to do a lot of things that generally can't work in python because of the GIL. It's too simplistic to think of this as something only of interest for CPU bound stuff. A lot of what Java applications do is of course the thread per connection style processing that older java applications still do (more modern ones would use non blocking IO and green threads). But there are also background threads doing useful work or more complex requests that fork off asynchronous work across multiple CPUs and then aggregate the results back as the response. Java apps tend to have vastly more threads than CPU cores. The exception is when things are CPU bound; then you want to minimize the context switching and end up with an number that is close to the number of CPU cores.

The GIL is not about the CPU but about enabling those kinds of things. With the current GIL in place it's very simple: as soon as you hit the global lock, everything stops until it is released. It doesn't matter how many CPU cores you have, they'll be idling while one of them holds the lock. There's barely any point in even trying to do that with the GIL in place. Forget about sharing data between threads. Mostly that's done via queues or databases in python. Removing the GIL will revolutionize a few things in key use cases for python:

- data processing & ETL

- event driven server systems

- machine learning and data science systems

They can all benefit from this and that's the reason a lot of people are pushing for this. The short term performance losses are not inherent to removing the GIL but just a necessary evil while the python developers deal with fixing the bottlenecks and a few decades worth of technical debt.


I/O functions (may) internally release the GIL. If the GIL becomes a bottleneck, you are not I/O bound by definition.

However, you are certainly right that not all server applications are I/O bound. I was a bit sloppy there.


>For the second type of system, I can't remember the last time where I wrote one where I had in-process parallelism/concurrency with shared resources (even in langs where there's no GIL). It would just get too confusing and hard to reason about. Any optimizations were mostly based on intelligent batching. For parallelism, you'd just have multiple _completely_ independent processes, probably across multiple machines.

Interestingly I'm working on something like this right now and do have large shared resources which meant I had to abandon using a multiprocess strategy.

I don't see why it would be confusing though, provided the shared resources are read only.


For such applications, isn’t the parallelism count usually static and limited? I think there will be good benefits for distributed system frameworks for python. Id agree.

A couple of years ago, I implemented in process parallelism for a system I was maintaining at $JOB. I was happy the system was in Go and not Python. But it was an exception to the rule in my experience.


I can't imagine a more difficult test case for open source.

The decision is entirely reasonable, to move forward in test mode, treat multiple interpreters as an interim experiment, but target being concurrent. The constraint of running old and new code in the same VM is a tall one. Surprisingly, LWN's summary says nothing about testing, which is largely unsolved and could lead to releases with unknown but serious bugs. Microsoft, Facebook/Meta, and Conda have stepped up with resources and a super-majority of core contributors want to move forward, but it's unclear what happens if things get hard and more people are needed.

Meanwhile a crazy number of projects in academia and industry from web sites to big data to AI depend on Python. The potential to export costs to python developers might be measured in percentage of GDP.

It doesn't sound like the problems are even known yet. It might be fairer to commit to the Faster CPython approach of knowable improvement over the next 3+ years, but have the concurrent Python promotors do more than just prototype. They should analyze the kinds of problems that could present, how they could be detected, and what can be done about it. This should be reviewed by people with backgrounds in proving concurrency guarantees. Then the questions can be fairly presented to the steering committee, when the unknowns are at least identified.

There's not a lot of program-management scale decision-making in open-source, but most have been relatively simple questions of driving the market in a given direction (apache, eclipse, linux). This has real technical unknowns.

And I can't help feeling they should also address inter-language ABI's at the same time. A big issue is matching the expected execution model of C. Java's foreign-function and memory interfaces have been incubating for many years, and Swift is also getting better at wrapping C and C++, but FFI's are notoriously (and likely unnecessarily) difficult.


> The potential to export costs to python developers might be measured in percentage of GDP.

Of course, that's also true of the benefits.


The article is well written, and a good history of the all ordeal. But please note it gives more weight to the "against the GIL" side of the story, and all that can go wrong.

It doesn't highlight enough the other side of the coin:

- Sam's work is very high quality, and he brought with the no-gil some unrelated perf improvements so that people don't feel like they loose perf too much.

- Sam played the open source game perfectly, and is incredibly patient given what he is bringing on the able and how slow and flaccid the steering council reaction was (without the community pushing on it, it would still be collecting dust).

- Sub interpreters have yet to demonstrate any usefulness at all in Python. In fact, any serious metrics at all. This is the first attempt to be that well defined, and measured.

- The community feedback shown a great interest in this particular project.

- The steering council did conclude "We intend to accept PEP 703, although we’re still working on the acceptance details."

I'm not a no-gil enthusiast. I would be fine with have it never be removed, and I think we should try sub-interpreter first.

But what's fair is fair.


What I’d liked to see is the performance patches without the no-Gil part.

The performance patches have already been merged and there is ongoing working in the Faster CPython project.

You're already using a bunch of those in recent Python versions.

> I would be fine with have it never be removed, and I think we should try sub-interpreter first.

There is a lot of work going on with sub-interpreters and the per-interpreter GIL is shipped in Python 3.12.

The results are very impressive: https://lwn.net/SubscriberLink/941090/8bcb029dbf548f26/, as good as one could have hoped I think.

It seems to me like the work on sub-interpreters will continue in parallel ;) to the work on free-threading.

Sub-interpreters and no-GIL have different use-cases though.


I'd be interested in introducing appropriate abstractions (ABI) and in a module for parallelism. Whereby one can swap out different implementations for sub-interpreters, green threads, GIL or even hardware or cloud vendor-optimised implementations. Not that dissimilar to Project Loom in Java.

Once no-GIL is integrated, what is the use case for sub-interpreters?

I have a use case for sandboxing (without a secure-sandboxing requirement) that it would help solve. But at last read, they discourage/disallow sharing modules across the subinterpreters and reloading all modules in each subinterpreter is not acceptable for my use case.

That's very interesting, thank you for the link. I've used Python since the 00s and I don't ever recall having come across sub-interpreters...

They were always usable using the C-API but not exposed to the Python library (and until 3.12 you were blocked by the shared GIL so there was little point anyway).

Now that the work from Eric Snow has been merged, you can use https://pypi.org/project/interpreters-3-12/ to create one from Python code.


For now there is nothing impressive with it. The perfs they show off don't include data sharing, which is very primitive.

Which means there is nothing subinterpreters do that can't be done with multiprocessing.

They show progress, and it's good. But I would wait until we can see gunicorn spawning WSGI interpreters workers and get similar performances to the setup with regular workers to get enthusiastic.


to further your point: I've used the technique launching a separate python process for performance critical operations for years on different projects. I can run it in a completely different priority set as well.

SO, why would people be against removal of GIL ?

Here are some possible reasons.

They use unmaintained C extensions that won't be updated.

They maintain C extensions that would be complex or painful to update.

They're concerned about fragmenting the ecosystem into GIL and No GIL.

They think that it will make single-threaded Python programs slower.

They have no interest in multithreaded Python and don't think the additional complexity is justified.


> They use unmaintained C extensions that won't be updated.

Well, that's on them.

> They're concerned about fragmenting the ecosystem into GIL and No GIL.

Same way we have fragmentated ecosystem into ASYNC vs SYNC now? /s

> They think that it will make single-threaded Python programs slower.

Python is already horribly slow. A little bit slower single thread programs shouldn't be a problem. But the inability to run multi-threaded programs effectively in this day and age is very much a problem.


I haven't fully confirmed this but my understanding is the proposal forks the ABI into two versions. which IIUC means any package with native extensions needs to release 2x as many wheel files (not to mention if I need to have two versions of the interpreter installed to build them, not sure though). seems like a mess

PEP582 - “Python local packages directory” is my personal bugbear when it comes to Python annoyance: https://peps.python.org/pep-0582/

I believe PDM still supports it. I just hate virtualenvs for builds and deployments and wish Python could just do what JS does. It’s been proven that it can. https://discuss.python.org/t/pep-582-python-local-packages-d... Is the discussion thread. Also frustrating that “exactly one correct way to do something” seems to be one of the justifications thrown around for rejecting this when I’ve never found that to be true in Python.


Yeah, python is great but I wish package management was more standardized and simpler.

It’s more important to prioritize single-threaded performance because it’s much harder to improve by throwing money at the problem.

With multithreaded performance, you can just add another core to (more than) offset whatever overheads there are from using process-based parallelism.

I think that this entire GIL vs No-GIL dichotomy is misguided. The biggest problem people have with multiprocessing is that you can’t share memory. So add virtual processes with an explicit mechanism for memory sharing. Then you can keep all of your single-threaded optimizations like refcounting without barriers because the objects for one thread will stay in that thread.


Agree 100%.

If you need concurrency at the moment, you have already switched to using multiprocessing, so having a no-GIL multithreading is useless.

The only issue with Python/multiprocessing, is that sometimes you don't want queues, but shared mutable state. And as you said, placing Python objects in shared memory at the moment is convoluted, restrictive, and suboptimal.

Fixing _that_ should be the objective. What Python need is better support for placing native instances in shared memory.


> If you need concurrency at the moment, you have already switched to using multiprocessing, so having a no-GIL multithreading is useless.

> The only issue with Python/multiprocessing, is that sometimes you don't want queues, but shared mutable state. And as you said, placing Python objects in shared memory at the moment is convoluted, restrictive, and suboptimal.

The PEP goes into the motivation behind this work, and using multiple process does not magically solves all the issues:

> Multiprocessing, with communication via shared memory or UNIX sockets, adds much complexity and in effect rules out interacting with CUDA from different workers, severely restricting the design space.

> I reimplemented parts of HMMER, a standard method for multiple-sequence alignment. I chose this method because it stresses both single-thread performance (scoring) and multi-threaded performance (searching a database of sequences). The GIL became the bottleneck when using only eight threads. This is a method where the current popular implementations rely on 64 or even 128 threads per process. I tried moving to subprocesses but was blocked by the prohibitive IPC costs.

> NumPy does release the GIL in its inner loops (which do the heavy lifting), but that is not nearly enough. NumPy doesn’t offer a solution to utilize all CPU cores of a single machine well, and instead leaves that to Dask and other multiprocessing solutions. Those aren’t very efficient and are also more clumsy to use. That clumsiness comes mainly in the extra abstractions and layers the users need to concern themselves with when using, e.g., dask.array which wraps numpy.ndarray. It also shows up in oversubscription issues that the user must explicitly be aware of and manage via either environment variables or a third package, threadpoolctl. The main reason is that NumPy calls into BLAS for linear algebra - and those calls it has no control over, they do use all cores by default via either pthreads or OpenMP.

and it discusses the alternatives at https://peps.python.org/pep-0703/#alternatives.


You don’t need OS processes for multiprocessing. You can use threads in the same OS process. See: Erlang.

Would the work on sub-interpreters be interested for that then (https://lwn.net/SubscriberLink/941090/8bcb029dbf548f26/) ?

Yes, this is exactly what I’m taking about.

You don’t need to share objects. Have explicitly shared buffers instead. Python is a dynamic language so you can easily build proxy objects that are views into a shared buffer, and this allows you to keep all your single threaded performance because no objects are shared.

For example:

    buf = sharedbytes.alloc(1 * GiB)
    with buf.lock(lockid):
      buf[10:20] = 42
    message_other_process(buf.id)

    # other process
    def recv_buffer(bufid):
      buf = sharedbytes.get(bufid)
      with buf.lock(lockid):
        print(buf[10:20])
Most people don’t even need that and would be satisfied with just virtual processes and copying message passing between them.

> so having a no-GIL multithreading is useless.

Wrong. Processes pose hurdles via the limits of IPC and control between processes that Threaded applications don't have to bother with. There are ample examples for this in the PEP.

> What Python need is better support for placing native instances in shared memory.

If that were the case, then threading wouldn't exist in the first place. Shared memory is still IPC. It still requires context switching into the Kernel. It still poses problems that threads don't have.


You are talking nonsense...

> It still requires context switching into the Kernel.

There is no less context switching between two threads than between two processes.

> It still poses problems that threads don't have.

Hu, no, it's actually the same problem. If you see threads as different processes with heap being mmap'd at the same location, then you're basically 99.9% right, up to some minor process accounting metadata differences.


> There is no less context switching between two threads than between two processes.

I am not talking about switching context between the threads of execution. I am talking about a context switch to kernel code simply to access the shared memory. All interactions with SM require syscalls.

So no, I am not "talking nonsense", and yes, communication between multiple processes requires more context switches than communication between threads.


> I am talking about a context switch to kernel code simply to access the shared memory. All interactions with SM require syscalls.

Please enlighten me with the syscall you use to access a shared memory... because there are none.

As far as the kernel is concerned, "memory" is just a set of pages mapped to a process at a specific address. These pages can be anonymous or named (meaning other processes can map them through that name).

There is no syscall, no context switch, involved to read or write memory, named or anonymous. Hell, that's even the whole purpose of memory mapped IO.

As mentioned earlier, there is also no real difference between a thread and a process as far as the kernel is concerned. A thread is just a special case of process which maps its heap to its parent.

Even the word "thread" doesn't really exist for the Linux kernel. We just call these "lightweight processes", because these are just processes with a shared heap.


> Please enlighten me with the syscall you use to access a shared memory... because there are none.

Oh rly?

https://man7.org/linux/man-pages/man2/shmat.2.html

https://man7.org/linux/man-pages/man2/shmctl.2.html

https://man7.org/linux/man-pages/man2/shmdt.2.html

https://man7.org/linux/man-pages/man2/shmget.2.html

Pretty sure it says something about "System Calls Manual" at the top of all these man pages ;-)

And IO on the SHM once it's attached may look "free", but it isn't. The M-mapping incurs a further overhead which simply doesn't exist for threads: A threads heap space is the same address space as those of it's siblings in the same process.


No but seriously, are you actually trying to understand and learn, or just stuck trying to google your way out of your mistakes?

1) You linked the man pages of sys5 shared memories. Everyone switched to posix shared memories since literally 20 years.

2) Furthermore, these man pages dont even support your argument. Did you actually read them? These are just for the mapping, not for actually accessing (read/write) the shared memory.

I would suggest, for your own future development, that you care less about trying to look knowledgeable on internet, and more about actually being.


> You linked the man pages of sys5 shared memories. Everyone switched to posix shared memories since literally 20 years.

Which is irrelevant, because there is barely any functional difference between the two. POSIX SHM uses a better API, providing a file-descriptor like object. That's all.

And yes, using that API also requires syscalls.

> Furthermore, these man pages dont even support your argument.

Wrong, they absolutely do. "Accessing" something doesn't just involve the IO, it also involves the setup. And this requires syscalls, in SysV as it does in POSIX.

> Did you actually read them? These are just for the mapping, not for actually accessing (read/write) the shared memory

I am well aware of that, hence my seperate mentioning of IO later in my post. ;-)

And the argument in that post stands as solid as it was before. `mmap` MAPS memory. This mapping requires an address translation overhead EVERYTIME THE MEMORY IS ACCESSED. This translation doesn't happen in userspace.

Now, there is a way around that, in principle: If I mmap with ANONYMOUS and SHARED, and then `fork()`, I could have the same mapping in the child process. The problem here is: This relies on the forking after the mmap(). Any newly created objects, if I manage to map them into the child process, will again require address mapping. Again, this isn't an issue in threads. As soon as I create a new object on the heap, it is available to every thread, under the same address.

But hey, what do I know. But I think the people who are going to dedicate countless hours of their lives making actual parallel processing via multithreading possible in Python know why they are doing so. As do the people who implemented abstractions around threading in basically every major programming language ;-)


> Which is irrelevant, because there is barely any functional difference between the two

Well it is relevant, it tells that you're not familiar with the subject and most likely googled what looked related and found outdated man pages which seemed related.

> Wrong, they absolutely do. "Accessing" something doesn't just involve the IO, it also involves the setup

I think common sense would disagree, but if that's your line of arguing, sure.

> And the argument in that post stands as solid as it was before. `mmap` MAPS memory. This mapping requires an address translation overhead EVERYTIME THE MEMORY IS ACCESSED. This translation doesn't happen in userspace.

How deep are you willing to dig your hole there?

Two messages ago you were arguing that accessing a shared memory required syscall and thus was slow. Now you've accepted it doesn't require syscall but argue it requires some magical dynamic translation by the kernel?

I'm sorry but again that is just wrong. All the kernel does is inject the mappings when setting up TLB/MMU at context switch.

The virtual address translation happens without kernel intervention, through the MMU, in _exactly_ the same manner whether this is a named or anonymous memory mapping, whether it is a shared or heap page. In fact the MMU couldn't care less what these addresses are, it has no concept of the meaning of what it translates.


> it tells that you're not familiar with the subject and most likely googled what looked related and found outdated man pages which seemed related.

That is an assumption, for which I have yet to see proof :-)

> I think common sense would disagree

Please, do enlighten me: How do I access SHM without mapping it first, or otherwise setting it up?

> Now you've accepted it doesn't require syscall

There was nothing to accept. If you actually read my post, you will see that I talked about the issues of IO and setup seperately.

And my point stands still solid ;-) SHM requires syscalls not required when directly using shared memory.


The fallacy here is assuming this is a necessary trade off. You can have good single threaded performance without a GIL. Rust has this notion of zero cost abstractions. It does Threads just fine. Java does single threaded stuff just fine as well. Lots of other languages do. This is not science fiction. The locking gets optimized away. Or you can choose to have code without locks at all. Or fine-grained/structured concurrency. Something being thread safe or not is basically just an API contract. And of course optimistic locking is also a thing. There are no good reasons why python wouldn't be able to do similar things. But you need to get rid of the GIL first.

A lot of the performance reduction in python without the Gil is basically just unaddressed technical debt. That should be fixable over time. Adding a lot of locks is a stop gap solution. That indeed makes things slower. But the proper fix is probably rethinking how that stuff works internally in a lot of places or having API contracts that document the thread safety or lack thereof.

And also having faster python runtimes and compilers actually enables re-implementing a lot of things that currently depend on native libraries in python. A lot of native code interactions are precisely where you need locks. Unless you change how that works. The point of removing the GIL is getting systematic about finding and fixing those things. It will get better over time.


>The fallacy here is assuming this is a necessary trade off. You can have good single threaded performance without a GIL.

It depends what you're doing. Classic example: I have a huge dataframe, 10s-100s of GB. I want to process it in multiple threads, with each thread handling a different part of it. If I just use multiprocessing, it has to copy the memory to the new processes and then copy the results back, which is super slow. Sure there are hacky work-arounds and alternative approaches, but in a language without the GIL I don't need to fiddle around with those, I can just do the equivalent of Pool.map() and it doesn't need to copy anything.


1) GIL or no GIL does not make a single visible difference in terms of your Python code, how it works, what it can or cannot do. It's only a matter of whether the interpreter internally use locks or not.

2) There is no more copy in multiprocessing that in multithreading, to the very minor exception of the reference counting structures used internally by the interpreter which will get copied on write.

3) The problem you describe is not related to multiprocessing or multithreading, it's just related to a misunderstanding of the Pool.map() API. What you provide to Pool.map() is sent over a queue (and thus pickled) to worker threads - or processes. You don't _have_ to use this queue, so long that your function has a way to access the variable you want to use. That is, the following code will have your subprocesses share, without copying, your 100GB dataframe:

    DF = pd.read_parquet('100GB.pq')
    
    def worker_process(id):
        # Do something with DF
        return 42
    
    with multiprocessing.Pool(10):
        pool.map(work_process, [1, 2, 3, 4, 5])

I understand the desire for multi-threaded CPython scripts, but why not do a PHP5 -> PHP7 and focus on single-thread performance more, particularly around implementing a JIT? We know from PyPy (and other JIT-enabled languages like NodeJS) that performance improvements can be significant. I feel like that should be the primary focus for speedups, and have always wondered why the idea has never been more popular. Has a CPython JIT been ruled out for some reason I'm not familiar with?

You should look into the copy & patch efforts underway for Python[0]; an actual JIT will probably never exist but I think c&p has a shot of being mainlined in the next few years, such that Python could dynamically choose to either run the interpreter or a c&p option.

0: https://github.com/faster-cpython/ideas/issues/588


Has a CPython JIT been ruled out for some reason

There have been several attempts at integrating a JIT into CPython. Google made a stab at it back around version 2.6 and there was a second attempt around the 2->3 change to build a JIT on top of LLVM. However none of these projects produced results that the core developers felt were good enough across the board and had problems with backwards compatibility and lot cross platform LLVM issues (as it was a pretty new project) so they were dropped after Google stopped funding the project.

That being said there is still work being done and people are hoping for at least some JITing in upcoming python. Pyston is project being worked on Guido Van Possum himself and both Microsoft Instagram have their own JITed versions of Python they use internally. Some initial parts of Pyston are even scheduled for being included in python 3.12 or 3.13


PyPy breaks compatibility with CPython in significant ways, as far as I understand the current no-gil proposal only got through because it has an automated fallback for modules that are not no-gil compatible.

In what way does PyPy break compatibility?

Many little ways and a few bigger ones.

The biggest one would be not using reference counting, which makes execution of cleanup code in __del__ somewhat non deterministic and tricks that rely on the exact reference count of an object outright impossible (for example reusing "immutable" objects that are only referenced once).

I thought that the C-API still had significant differences but going by its documentation it has a compatibility layer that is almost complete if somewhat slow.

A full list with all the minor differences is here: https://doc.pypy.org/en/latest/cpython_differences.html


The faster-cpython folks seem to be working towards a JIT (https://github.com/faster-cpython/ideas/tree/main/3.13) and both pyston and cinder have JITs. So I don't think anyone has ruled one out.

Excellent write up as usual from LWN.

I love the Python community. It's really a leading light for open source software. And it shows what transparency and good governance can achieve.

Although I appreciate the engineering hours that Meta, Microsoft and others give, its pretty miserable still compared to the value that the whole tech industry (and beyond with data science) extract from Python and other open source software.


We can all help to change that, within our own organisations!

I did my bit at JPMorgan 8 years ago, convincing the tech leadership team to sponsor PyCon UK, plus a recruiting stand and supporting a group of junior developers from all JPMorgan's UK locations to attend. I left JPM 5 years ago now, and they are still PyCon UK's headline sponsor.

Compared with the enormous benefit we got from Python and its open source ecosystem, it was a totally negligible cost.


They pretend to be transparent. All actual decisions are made in backrooms. The mailing lists are censored and the inner circle cannot be criticized.

Real contributors are exploited by those who work for the right corporations, do very little and go for any clerical position of power.

Do not be misled by the LWN articles, which are very kind and always namedrop the deciders. It is selective reporting.


> The mailing lists are censored and the inner circle cannot be criticized.

Well, HN isn't censored in this regard (unless you believe their influence extends here too). What specific criticisms do you have against "the inner circle"? And what evidence can you show for them?


    There was a question from Shannon about ""what people think is a
    acceptable slowdown for single-threaded code"" ... he had estimated
    an impact "in the 15-20% range"
Horror!

To me, the acceptable slowdown is exactly zero.

I can already use multiple cores by running multiple Python processes.

Any slowdown of single-process performance would be a terrible step backwards.

Python is already slow. It should look at PHP and see what it can take out of their book. PHP 7, which had no JIT, was already about 6x faster than Python. PHP 8 is even faster for some workloads. I'm not sure if the overhead of a JIT makes sense though. But PHP 7 is a good place to look for a performance benchmark.


To me, the acceptable slowdown is exactly zero.

This has been the reason most of the previous GIL removal projects failed. GvR had a strict zero slowdown in single-thread performance rule for any nogil patches and none of them managed that. I wonder if the core team has decided to abandon that rule now that GvR is gone?


I saw comments that if introduced together with Faster CPython single thread performance optimisations, nogil would end up no slower than previous version. So it is best time to implement

>I can already use multiple cores by running multiple Python processes.

Because you have terribly simple processes that don't need to do synchronization. Even a simple parallel map would have Python scream and kick, because the GIL prevents it from having any kind of reasonable performance, and your multiprocess can't handle that unless they all write to files.


> Horror! To me, the acceptable slowdown is exactly zero.

If you care about single % of performance, why are you using Python in the first place? You're likely losing 100x to more performant languages already.


Because I care about elegance of code and developer productivity even more. And there is no other language which comes close to Python in these regards.

Elegance of code is subjective and developer productivity is more a trait of a developer than a language (which means it is inherently subjective, e.g. in my case Python is among the languages I was least productive in; I find myself much more productive in Scala, Kotlin, Rust and even R for statistics).

PHP doesn't have the same constraints around it as Python. With Python, you have long-running threads of execution, while this is rarely the case in PHP. PHP has no shared state outside of the interpreter, so it's inherently GIL-less.

PHP got threads late and as an extension. The issues with the GIL centre around the C extension API. PHP doesn't have to deal with that, so they were able to concentrate on making single threaded performance as good as possible. The GIL also blocks some potential avenues around making single threaded performance better, which is why the plan for this current attempt is to make it opt-in.

Also keep in mind that all this is happening concurrently with efforts around making single threaded performance better, with the idea bring that any loss of performance due to the GIL removal should be compensated by other speed improvements.

If we could go back in time and make the C API more like that of, say, Lua, that would solve the issue, but we can't.


Multi-threading gets you programs that are, at the very most, 8x faster. Is it really worth it? If this makes it even slightly more difficult to develop Python for normal people then it will get forked, and that sucks. If it can be done without any impact then great, but I'm still not sure it's worth it.

> at the very most, 8x faster.

Where does that number come from? My current desktop has 32 logical cores and it has been a few years since I bought that one.


maybe its a reference to Amdahl's law?

If you can parallelise 90% of your code, you get only 10x improvement even on infinite number of cores


You mix up time and code. You don't have to parallelize 100% of your code to scale with cores, you only have to parallelize the code that your software spends most of its time executing and that is often significantly less than 90%.

> at the very most, 8x faster.

My 64 core server setup says otherwise.

> Is it really worth it?

Yes, very much so.

> If this makes it even slightly more difficult to develop Python for normal people

It won't. Pure python code will very likely not notice a difference.


>There was a question from Shannon about "what people think is a acceptable slowdown for single-threaded code". To a large extent, that question went unanswered in the thread, but he had estimated an impact "in the 15-20% range, but it could be more, depending on the impact on PEP 659".

So, let me get this straight, some of those working in the project to "make CPython faster", think it's acceptable to overnight make most existing Python code 15-20% slower?

I'd say max 5% and that if the gil removal was a benefit to other optimizations going forward (they say the opposite: the change with complicate and stall their other optimizations).

Meanwhile, Shannon had a fundraising proposal for him personally "speeding up CPython" 5x back in 2020. Now it's a whole team working on speeding up CPython with much larger corporate support, and it seems their targets are quite smaller?


But consider that in perhaps 99% of current python code high performance (of the pure python code) is not the primary concern. And now for the times when performance does matter you can get a 20x (?) speedup without rewriting in another language.

I'd expect in many those cases people care for performance, they already use multiprocess (so the impact from a 20% hit GIL-free CPython would actually be a slowdown).

And for those that do not use multiprocess, rewriting their currently single threaded Python code for no-gil shared-concurrency will indeed bring a speedup, but would still be a significant rewrite.

Also there are people using Python for things like web serving, where performance might have been secondary to convenience when they picked Python, but still a sudden 15-20% slowdown will seriously impact their server budgets (or ability to update versions)...


> But consider that in perhaps 99% of current python code high performance (of the pure python code) is not the primary concern.

Then why do we care about GIL removal at all?


I agree that the "faster CPython" project is way oversold. I have compared real world code with numerical and string operations without any network or disk accesses. 3.12-beta only uses 20-25% less time than 3.8.

That's the level of 2.7.

I seems that the old boys are desperate for some bullet point features for the next release to impress their corporate masters. So they use the work of Sam Gross, but they will slowly get the credit over time.


There's a big history of projects underfunded, overpromising, and getting tossed aside, not even halfway there, when it comes to a faster Python.

In a PL-fantasy-league they would just heavily sponsor Lars Bak to create a team and work on a new Python interpreter!


Yeah, but this project is actually delivering stuff that's been shipped. That's perhaps why it's taking a bit longer. As far as I can tell, they haven't gotten to the boxing removal stuff yet where the really big wins probably are.

As for your fantasy, we already have PyPy.


>Yeah, but this project is actually delivering stuff that's been shipped.

The initial selling point was ~ 5x speedup over 5 years.

This is over a year now, with far more people and resources plus the backling of a major industry-player (MS), and it's been like 20%-30% improvements at best, and even those in danger to be wiped out by the no-gil change.

>As for your fantasy, we already have PyPy.

That's nowhere near Lars Bak level fantasy results...


It was actually 5x in 4 years (1.5x per year).

As you said we see 20-30% at best. And several important workloads have important slowdowns (for example coveragepy runs 30-50% slower).

The GIL is an academic problem which has little real world relevance.


> I'd say max 5% and that if the gil removal was a benefit to other optimizations going forward

Some of the recent performance improvements came out of the no-gil camp to show that there was significant room for improvement even with the gil removed.

> (they say the opposite: the change with complicate and stall their other optimizations).

So they wont actually be stopped, it will just take them a bit longer?


"A bit" could be the better part of a decade, which is an eternity in IT time. And some wont even be possible anymore, or way more difficult, and indeed, be stopped.

I want to see python with automatic and transparent concurrency.

Ie. The programmer writes:

    for file in glob.glob('*.jpg'):
      data.append(load_file(file))
Then python itself parallelizes that work. It should do that by identifying loops of pure functions and pushing them off to other threads. One should also be able to mark a function as pure. Python itself would be responsible for making everything appear as if the code ran serially - for example by buffering/reordering log statements.

Rationale: Python is designed to be easy and simple. Concurrency in general is not that. Python should do only do concurrency in a way that can be made easy and simple.


Not sure if intentional: That's a super frustrating program for a traditional automatic parallelization compiler, e.g., polyhedral:

* glob is an i/o operation: impure

* load_file is an i/o operation: impure

* data.append serializes part of the execution order

---

Interestingly, GPT4 does pretty ok:

files = glob.glob('*.jpg')

with ThreadPoolExecutor() as executor:

    data = list(executor.map(load_file, files))
---

Separately, if just going for concurrency for i/o, async/await is pretty amazing:

    tasks = [load_file(file) for file in glob.glob('*.jpg')]

    data = await asyncio.gather(*tasks)
A lot less program transformation needed as it's essentially sugar over promises, and avoids much of the need to restructure code beyond the coloring problem. So I'm not as much about the case for a compiler here..

> not sure if intentional.

It sure was! I believe any automatic parallelization should be able to deal with all these things.

With a combination of opportunistic execution (ie. guess which bits can be parallelized, and roll back if wrong), and clever heuristics to avoid rollbacks, it should be pretty do-able.

CPU manufacturers do similar stuff at a smaller scale with parallel execution of single threaded code with Tomasulo's algorithm, and they get massive gains compared to non-superscalar CPU's. They even do similar stuff at medium scale with Hardware Lock Eliding.


> Separately, if just going for concurrency for i/o, async/await is pretty amazing:

That's not async or concurrent. You're running a synchronous function in an asynchronous task which cancels out.


Terrible idea. Not only is it completely counter to the python zen koan of "explicit is better than implicit", you really don't want implicit concurrency without some sort of zero-cost abstraction and really strong guarantees against side effects.

> Rationale: Python is designed to be easy and simple. Concurrency in general is not that.

Python is way too dynamic for that. Furthermore, you have a syscall at the start of the loop, and one syscall per loop. Even if you did this in Rust, it probably would not give you the true results you wanted, if that file tree were manipulated in any sort of way during the loop, was a network FS, or any of a number of other assumptions were violated.

That's exactly why automatic concurrency is a bad idea. Now you have no idea if your code is executing in "simple mode" or "complex mode".


[Foreword: I agree that it would make no sense for any language to automatically parallelize loops, especially over filesystem mutations. However I am very happy manually doing it with one function so it's explicit when it happens and maintainers know to take it into it account.]

> Even if you did this in Rust, it probably would not give you the true results you wanted, if that file tree were manipulated in any sort of way during the loop, was a network FS, or any of a number of other assumptions were violated.

That's factually true but not really relevant.

Introducing your own parallelism barely adds to this problem. Your program cannot assume it is not parallel with another program affecting the same file tree in arbitrary ways.

The parallelism you add to your own code (say, with Rayon) doesn't change the set of operations that can occur to those files. If you design your own operations to be non-overlapping then parallelism is no concern, and if you didn't then that's a bug and you might get non-deterministic results even without parallelism.

For what it's worth, even for writing that kind of code, Rust's standard library has taken the recent wave of TOCTOU vulnerabilities much more seriously than contemporaries like C++ implementations. Good writeup here: https://www.reddit.com/r/cpp/comments/151cnlc/a_safety_cultu...

This of course only makes safer operations possible in more cases. Code still has to do what it can with the constraint that filesystem contents may change in arbitrary ways at arbitrary times. Adding your own parallelism does not add significantly to that. And if one is to add parallelism, it may as well be with something as high-level as Rayon and with all of the usual data race protection Rust provides, so that you do get to focus on filesystem concerns and not also on thread-safety in general.


Great exposition of the situation! I was in the no-GIL camp until a few years ago, until I started thinking of all the implications: I have tons of code that modifies global maps from all over the place, and I feel relieved to not manage threads like in C/C++.

What I hope the solution will end up emerging would be subinterpreters that are more intuitive to use. Honestly, I don't want to pass in source code, but rather a function or module entrypoint and have the subinterpreter abstract way the instantiation boilerplate.

At least I hope everyone is keeping an open mind and is willing (and was prepared) to backtrack if something doesn't seem to be working.


Every time the GIL removal is discussed, I notice a tidal wave of negativity in the comments, which is completely surprising to me. I suspect some critics primarily deal with thread-parallel programs that focus on asynchronous data sharing and task signaling. In languages like C++, handling these with threads and locks is tricky, and you're all but guaranteed to produce subtle race conditions. I'm totally with you on this – I wouldn't want to manage this in Python either.

However, consider parallel data processing, a task I face so often in computer vision. In C++, I can slap a `pragma omp parallel for` onto a for loop processing a list of images/videos/meshes/csvs, increasing the speed of this section of the program by the number of CPU cores. I do this when the problem is well defined (one input, one output each), and it hasn't blown up in my face once. In Python, on the other hand, it feels cumbersome when spinning up worker processes for such tasks, not to mention harder to debug. For this kind of work, removing the GIL would be a massive relief.

It's good that people are highlighting the issues. But to add my weight to the scale, the GIL is the main barrier in Python to stand in my way to speed up my programs. So I'm very happy that they're investing their resources to build a proper way of doing thread-parallelism.


I want to let you know that I agree with you 100% and face the same issues. It is baffling to me how this has not been solved by now, especially with Python's dominance in computer vision and machine learning.

It's worth remembering that the first serious gil removal patch was submitted against python 1.4, so it's not like people haven't been trying. There have also been (not 100% CPython compatible) python implementations without the GIL that people have could have used, but none of them got any traction.

I suspect the negativity you're seeing comes mainly from two main sources. First people are terrified of another 2->3 situation. If removing the GIL ends up breaking any existing libraries or causing significant single-core performance degradations then that could do seriously damage to python and no one wants to go through that again.

Secondly people who have been working on this for a long time get annoyed when people make removing the GIL sound easy and imply that the only reason it hasn't been done is because the developers are lazy, incompetent or haven't thought of it. Various people have been trying for literally decades, but so far no one has managed to come up with a solution that is both backwards compatible and performant.

Everybody agrees that removing the GIL would be fantastic. However most of the core developers aren't willing to sacrifice backwards compatibility with existing code to get there.


I guess I think of highly parallel CPU-bound work as being fundamentally against the ethos of Python, which is to be a scripting/glue language on top of native extensions. Thus anything (like GIL removal) that makes it harder to write native extensions is harmful.

Even if the GIL is removed, the task you’re talking about would be dozens of times slower in Python than in a better-suited language, so why not write it in C++ or Rust and then call into that from Python?

I could be wrong or missing something, and I’m not a Python developer so I don’t really have a dog in the fight, but I just wanted to try to explain where some of the criticism is coming from.


It's exactly the high-performance native extensions that require the GIL removal to allow for a full utilization of the machine. Although already successful as glue for these libraries, Python is ultimately hamstrung by the GIL on many-core machines. Using the GIL removal to start writing algorithms in interpreted Python is not the intention.

I warmly recommend reading the motivation section of the PEP. It is incredibly well written, quoting issues from PyTorch, scikit and NumPy:

https://peps.python.org/pep-0703/


That doesn't make sense to me, since native extensions can and do drop the GIL. Try calling `torch.set_num_threads(16)` and then doing some CPU-intensive operation on a large tensor; you will observe the Python process using 16 cores.

What am I missing?


I think you are missing that the world is a little messy. You are correct that native extensions can and do drop the GIL. But data needs to be loaded, it needs to be preprocessed (does your image downsampling library drop the GIL? Does the video decoder drop the GIL? How do you find out? You need to do a little string processing on your data, well then you're out of luck.). Hardware devices need to be managed. Progress needs to be logged, and operations need to be debugged. There isn't a perfect kernel that can be run for every operation, you may have to run a little Python code before calling into your native code.

Again, I would encourage you to read the motivation section of the PEP, which shows a multitude of frustrations with the current state. The issue is subtle enough that I'm struggling to summarize it in a single sentence.


Preface: I rarely use Python. But to me, the logical, manageable thing seems to be a double implementation: one version with, one version without gil. Pick the one that suits you.

Computer vision often has tasks independent of the global state; pure functions, if you will. But many applications have different "state" pools all over the place. JIRA is written in Python. Their library for accessing their API alone is 8k lines of Python. I suspect JIRA's code base won't be all pure functions.

A large group (probably a comfortable majority) uses Python at a basic level. They may use numpy and matplot, but that's about it. They will notice a slow down when the GIL is removed. Don't expect smiles.

More advanced users may use concurrent code. If you use an external C library, chances are it has to be updated to work with a no-gil python version. This won't go swimmingly. It puts the burden on the maintainers of those libs, and they may not be able or willing to fix the problem. That will lead to unhappy faces.

And then there will be the subtle bugs in complex systems. Python programmers are not really used to synchronicity problems, so they'll slap locks all over the place, leading to slow down or dead-locks. More unhappy faces.

So I see the commitment to a single, no-GIL version as a serious threat to Python.


I am thinking about this. I think our distinction between IO-bounded tasks v.s. compute-bounded tasks helped design in some ways. However, as time goes by, the code you are dealing with can bounce between these two modes, and that is where all the limitations seem good for you start to be bad.

That's what I suspect happening on the Python / scientific-computing community. They thought (and some people in this thread still think) that Python can be the driver to wait for compute-bounded task to finish (effectively, making the Python code to be IO-bounded / i.e. waiting for another language to finish). But over time, it changed. People write more code in the Python driver, and suddenly, the very optimized C code is no longer the bottleneck, the Python code is.

We've seen this bouncing pattern enough times and really need to think through whether we want to have two different design patterns to deal with IO-bounded problem v.s. compute-bounded problem.


The best way to remove the GIL completely is switching language to Go, unfortunately this is not always possible.

From one of the linked threads: [1]

> I might be good to remind the readers that the GIL removal has very little chance to break a Python-only codebase

Is this actually true? I was under the impression that some multi-threaded Python code relies on some operations being implicitly thread safe due to the GIL. For example, adding an item to the same list from two concurrent threads is never going to corrupt the list, simply because the threads never do that operation in parallel (the GIL prevents the threads from running in parallel). If you remove the GIL, suddenly you'll punish this kind of code, just like C++ quickly punishes concurrent mutation of an std::vector.

I'm not 100% sure of this, but I find this sentence a bit suspect.

[1] https://discuss.python.org/t/pep-703-making-the-global-inter...


Yes this is a thing was trivially protected by the GIL. There is the same thing with mutating the same map concurrently in Go that will panic for example.

PEP 703 goes over this in the "Container Thread-Safety" (I think container here refers to the fact that the object has references to other objects, this is the things that already are special-cased in CPython to be managed specifically by the garbage collector):

> This PEP proposes using per-object locks to provide many of the same protections that the GIL provides. For example, every list, dictionary, and set will have an associated lightweight lock. All operations that modify the object must hold the object’s lock. Most operations that read from the object should acquire the object’s lock as well; the few read operations that can proceed without holding a lock are described below.

More information at https://peps.python.org/pep-0703/#container-thread-safety


Thank you! That makes sense, and it also explains why removing the GIL has a negative performance impact as discussed in other comments. Taking a lock every time a container is accessed is significant overhead, which is why languages like C++ don't make basic containers thread-safe.

Effectively the GIL is incurring that overhead on every data structure whether you need it or not.

But with it removed, you'll have to think about it in your designs more than currently. History shows us this is not easy.


> Effectively the GIL is incurring that overhead on every data structure whether you need it or not.

Not really. The GIL is taken and released quite infrequently (only when the Python interpreter decides it's time to do a context switch), whilst the new locks for each data structure are taken/released every time you do a basic operation on those data structures.

Holding a lock that is rarely taken/released incurs very little overhead.


Hm, that’s a fair point, it’s only some of the same overhead because it’s not per operation.

Note that concurrency of containers not specific to Python any way.

For example, Java implements different versions of containers for single thread and multithread usage, because multithreaded containers have obvious performance penalty

https://docs.oracle.com/javase/8/docs/api/java/util/concurre...



> For example, Java implements different versions of containers for single thread and multithread usage, because multithreaded containers have obvious performance penalty.

Very few codebases in Java are single-threaded; specialty frameworks like Netty are the exception not the rule.

Likewise, there are not different containers for single threaded and multithreaded usage; there are containers that have different strategies for dealing with multi-threaded usage.

Hashtable is the oldest and is notable for it still being present and still being fundamentally flawed. It will lock reads and writes, but does not describe a way to lock for a transaction - e.g. changes based on read data. As such, it fundamentally has race conditions you can't protect against.

Hashmap and the rest of the Java 1.2 collections API set a slightly better pattern - they don't internally try to maintain safety, but provide mechanisms like synchronizedMap() to let you hold the monitor for the length of your transaction.

However, this could only be so good, because the monitors in Java are pretty fundamentally broken as well. A monitor is both part of its public API ( e.g. "synchronized(foo) {...}" ) and part of its implementation (e.g. public synchronized void bar() { ... })". This means that external code can affect your internal operation if you leverage the monitor that you get by default through your "this" instance.

As such, synchronized set involves three monitors:

1. The monitor on the interface-implementing collection type itself, e.g. on the HashMap. This likely is never used.

2. The monitor on the object returned by the 'synchronizedXXX' wrapping method. This is used to protect transactional access, such as iterating through while removing items.

3. The monitor used as a mutex inside the object returned by the 'syncronizedXXX' wrapping method, protect the integrity of the collection data type if used by multithreaded code which does not hold monitor #2. The code may have a race condition, but it won't put the collection itself into an inconsistent structural state.

The 'synchronizedXXX'-returned wrapper objects are pretty expensive, and if you can you should just internalize those collections into business object that does any needed syncronization itself.

ConcurrentHashMap and the like are lockless, and are built with the idea that you can perform the changes needed through atomic operations rather than transactions. This isn't always true, but often is.

For a collection which is always held by a single thread, the atomic operation overhead may still cause a performance impact - after all, the atomic operations are still processor state synchronization points. It is also possible to beat ConcurrentHashMap with regular HashMap on certain usage in multithreaded environments, when you are properly protecting access to the HashMap yourself.

It might be challenging to find scenarios where ConcurrentHashMap doesn't beat the 'synchronizedMap()' wrapper, just because the implementation itself is really expensive.


indeed

not sure there's much they can do about this, other than protecting all the built-in data structure operations with mutexes, like java's original data structures (Hashtable, Vector, etc)

(but then how do you get a non-synchronized [] if you want one?)


I think you can make a non thread-safe list by just making the same object without the lock described at https://peps.python.org/pep-0703/#container-thread-safety if you really want the maximum performance of single threads.

Maybe it could be part of a non_threadsafe_containers module on Pypi.


You could fast path it by checking if the reference count of the list is `1` and avoid taking the lock in that case, I think.

that would work on x86 but not on an arch that can re-order loads (e.g. arm)

I'm assuming it would be a fenced operation

but then there's still an advantage to using the fenceless version :)

(admittedly it's python so it's so slow it's probably not even measurable)


Yeah, for Python I feel like the difference between fenced vs unfenced doesn't matter. The primary cost is around your L3 cache getting slammed with contentious atomics but your L3 is already absolutely fucked if you're using Python.

Could you? What enforces only a single thread having access to a given reference? What about global variables?

If there's a refcount of 1 you can mutate the value safely because no other thread could be trying to read/ write to it. And the only thread that can give it to others would be the one that's doing that mutation, so it can't suddenly change.

I'd assume that importing any sort of module level variable would imply an increment of the counter, but unsure.


My impression was that that was exactly what they were going to do: replace the GIL with fine-grained locking on the objects themselves. I can't imagine they'd let multiple python threads manipulate python data-structures concurrently, the interpreter would segfault immediately.

> (but then how do you get a non-synchronized [] if you want one?)

You don't. This is one of the reasons why using the GIL is higher performance for single-threaded use-cases: stuff like lists and dicts can be non-synchronized


> not sure there's much they can do about this, other than protecting all the built-in data structure operations with mutexes, like java's original data structures (Hashtable, Vector, etc)

However - this is fundamentally the incorrect approach, because Vector and Hashtable aren't protected from read-then-write race conditions.

Such internal locking guarantees that the collection stays structurally sound, but not that code accessing it is dealing with a single consistent state until it finishes.


Wouldn’t a naive solution just be to have a flag that by default enables the GIL, and when disabled, a warning gets printed?

Nope. Cause users would just ignore or silence the warning, continue on, get subtly incorrect, difficult to reproduce behavior, submit tickets/issues, and just cause a lot of dead weight overall.

It's not remotely a trivial problem, and I assure you just about every naive solution has been considered and rejected.


List append would probably end up guarded by an instance specific lock. At least it is in some of the nogil concept code.

[flagged]

Quit python 5 years ago, never looked back. If you want to write some glue code to put pieces together, golang is like 100x times better.

Great.

And when there's a thread about Go I'm sure someone will tell us they don't use it and a different language is 100x better, too.


That language is called C and I use/read/enjoy/teach/love/worship it everyday.

Getting better is a process, you have to take it one step at a time. /s

Is there any chance that CoPilot can help transform old-style C extensions into new style api code? (Since the transformation is probably going to be very similar for all those extensions).

Thank you to everyone involved in this project and thank you lwn for the article.

I am a beginner in this space but my hobby is paralellism, coroutines, async and multithreading.

I read about the subinterpreter approach and I think the subinterpreter approach causes me to think of actors. You destroy the reference in the source interpreter when you send data to the other interpreter, like message passing. But it should be a O(1) message passing and transfer of ref counting responsibilities. The interpreters would be separate so you wouldn't have to worry about a dictionary being updated in two threads because they never touch the same objects.

After reading https://blog.redplanetlabs.com/2023/08/15/how-we-reduced-the... I am thinking about how to parallelise the creation of relationships.

What's a good practice for parallelising behaviour? If I want to fan out a message to 20 million collections, how can I use paralellisation to make it faster?


A few questions:

If single threaded performance is degraded, couldn't these people use an old version of python?

With such incredible increases in multithreaded performance, I imagine this is basically infrastructure tier importance. Like the US government should be funding it. Would throwing a billion dollars at it, solve it in ~1 year? Or is this going to take 3 years regardless?


I am quite impressed by how Sam Gross pushed for it. After he got a positive but non-committal response from the steering committee it would've been easy to just sit back and hope for some movement. The fact that he disagreed and pressured them is to his immense credit.

As an application developer, I am curious about where this is leading/empowering.

I hope to see something akin to the Project Loom approach in Java whereby suitable parallelisation abstractions are native, and one configures concrete implementations. Examples off the top of my head are:

1/ Light threads with default implementation of a scheduler provided by CPython and not dissimilar to Golang. Orthogonal to adjunct libraries or potential language features like channels.

2/ A no-op that uses the GIL can be default or even an interpreter switch, as discussed in the article.

3/ A stable ABI so one can replace the CPython implementation with an alternative. Written in Rust or whatever could be for hardware or a cloud vendor's environment.

4/Sub-interpreters, where parallelism pays to be far more coarse, could be Actor like architectures where spawned processes are new interpreters whose Actor System is a graph of interpreters.

These envisioned use cases offer significant advantages, and the potential benefits are exciting and encouraging. The prospect of this progress is exhilarating, and I eagerly anticipate this endeavour's evolution. Thanks to all those painstakingly putting so much energy into this.

Edit: spelling and formatting


At my job we get pretty far just using things like einhorn to just run a bunch of similar python processes under the same port- this mostly fixes the single threaded performance problem.

Very exciting. Sam Gross took a very bold move with his "I don't need a yes/no right now, but I do need to know what acceptance looks like, and this issue has been languishing." [0] That interaction could have gone a lot of different ways (especially as a neurospicy engineer myself and knowing the overall spice level in the computer world) but it sounds like that was the exact kick the council needed.

It is still a long and winding path (double-digit engineer-years) to get a no-gil python, but at least there exists a proper path from the looks of it.

The hardest part by far will be ensuring the correctness of all existing codebases. It's one thing to say you don't want a 2->3 repeat. It's another altogether if you were to claim no breaking changes, but fear of bugs resulted in folks avoiding the upgrade in practice.

Making gil/no-gil even a compile-time switch will absolutely increase the maintenance cost. But I think in the long run, all this effort will be worth it, as I would claim the GIL is a lightning rod for python criticism. Just peruse any HN thread about python and parallelism to see what I mean. Maybe it's because it's the one thing folks can point directly to and say "this is why python isn't as fast as it could be" without understanding the decades of context. It's kind of the Final Boss of Chesterton's Fences in that regard.

[0] https://github.com/python/steering-council/issues/188#issuec...


I'm not quite as optimistic as your take. The decision by the steering committee seems wishy washy to me. It's still possible for a multiyear effort to remove the GIL to get rejected even under these new guidelines.

This is very exciting and I'm thankful that they're actively avoiding another disruptive transition like the one from Python 2 to 3.

My private opinion, but I think removing GIL is a mistake. Not that many applications will benefit strongly from this, most will suffer a performance penalty. This will take years of attention/effort which could be spent more wisely.

It kind of seems like Python suffers from insecurity / inferiority complex about GIL. I would instead go with the "JavaScript route" of fully embracing the single thread model. Yes, some applications will remain hard / impossible, but I'd argue Python is not a good fit for apps requiring high performance / high scalability anyway. Being somewhat specialized and not supporting every use case isn't necessarily bad.


It may be too late to go with the JavaScript model.

Why?

There’s a lot of deployed Python code that uses multithreading. The genie is out of the bottle, so to speak.

The JavaScript model is, basically, the sub-interpreter model. You have multiple independent heaps. Given independent heaps, you write single-threaded code and don't worry about locking, because no other thread can modify objects in your heap anyway, outside of special objects like SharedArrayBuffer.

Maybe you could enable GIL-free, single-threaded, sub-interpreter Python as a compile-time option. But it would break such a large amount of code out there. It would be a very difficult transition, practically speaking.


The PEP goes into details about the motivations and why the single thread model has limitations: https://peps.python.org/pep-0703/#motivation.

> Yes, some applications will remain hard / impossible, but I'd argue Python is not a good fit for apps requiring high performance / high scalability anyway

People are already trying to use Python for parallel tasks. Forcing people to use another language does not help them much.


Supporting Unicode from the beginning probably would have been a good idea too but it is what it is. Happy to see them taking this on.

Always thought great JIT that's compatible with Cython and major projects with C extensions like numpy, scipy would be a more worthwhile effort. A lot of the data intensive tasks Python can be run in parallel processes easily so doesn't seem like a major benefit to removing the GIL?

> not a good fit for apps requiring high performance / high scalability anyway

It's way too late to artificially limit Python's reach like this. It's already heavily used in data science and machine learning. People need performance right now for things that they are already doing in Python. We can't go back to a time when that isn't true. Now it's one of the most common use cases for Python. People are building careers on this.


Those people/use-cases don't care about the GIL. When you're doing anything numerical you're calling into Numpy, Pytorch or Polars anyways and those libraries make use of multiple threads and drop the GIL.

So yes, you can do Matmuls to your heart's content in one python thread and wait for file i/o in another thread without ever running into issues with the GIL. I'd even say if your Python Interpreter isn't bored waiting for stuff to happen, you're doing it wrong.


> Those people/use-cases don't care about the GIL.

This is not true. The primary funding and motivation for the GIL removal work comes from the numerical computing community. The PEP (https://peps.python.org/pep-0703/) contains direct quotes from folks working on numpy, scipy, PyTorch, scikit-learn, etc. and also practitioners from places like Meta, DeepMind and so on, describing the practical constraints that the GIL places on many workloads.


This doesn't solve the data science and machine learning problems, though. You will still have the problem that optimized code paths will be hundreds to thousands to even millions of times faster (on a GPU) than any Python could be, and falling back to pure Python, accidentally or otherwise, will still incur those slowdowns, even if the resulting pure Python code is embarrassingly parallel and that is fully perfectly exploited.

Personally, I wouldn't be surprised all this work gets done on GIL-removal, and the end result is that if you run 10 threads in Python you max out at roughly a 3x performance improvement in your pure Python code (and often get less, I do mean that as a max not an average), with all the memory traffic it is doing. Honestly even if it did perfectly parallelize up to 32 cores, a completely impossible absurdity, it would still be noticeably inferior performance to what many languages will give you on one core. I can't help but think that if you're sitting here waiting with bated breath for the GIL-ectomy to improve your Python performance that that is a sign you should be rewriting your code right now in any number of languages that are simply faster. I very, very strongly suspect that this is going to result in very very disappointing speedups when it is done.


The most obvious performance win for free threading in Python is where you have large read-only, immortal data that is referenced by all threads. The "memory traffic" you mention is a major issue for multiprocessing approaches, and sub-interpreters don't provide access to shared memory in this way, either.

The no-gil project has already been benchmarked to scale basically linearly in this scenario, so your hand-waving predictions are already known to be inaccurate. https://www.backblaze.com/blog/python-gil-vs-nogil-boost-i-o...


I'm not talking about the performance that no-GIL is going to get on little bits and pieces of code. I'm talking about the real-world performance it is likely to get on real code. I work in languages that don't have anywhere near the overhead and ceremony that Python does, and when you're writing multithreaded code for the explicit purpose of optimizing some particular bit of slow code, it is quite common to find you completely top out at 3x or 4x performance no matter how many cores you throw at it, because you bottleneck on something other than CPU anyhow. Microbenchmarks in those languages can also get perfectly parallel speedups but that's not what you get in real code.

Again, if you're sitting on slow Python code and hoping for the GIL-ectomy to save you, you should just go write it in a different language right now. A perfectly parallel Python will still be slower on a 32 core machine than other languages can be on a single core... and by some margin!

The competition I'm benchmarking Python's performance work isn't against other Python approaches, it's against other languages entirely. Subinterpreters are just a concession to the enormous pile of other preconditions Python (and dynamic scripting languages) brings with it and have no bearing on how other languages work; they're a hack around that problem. They're not a solution any other language would consider without those requirements weighing them down.


The only hope is for Mojo to deliver on making that Python compatible language compiling straight to the GPU accelerator.

Dynamic/scripting languages will always be slower and will most likely never be suitable for this use case - native libraries will always win and there will be no point to writing these in pure Python. I concur with the parent that I don't believe there's a strong justification here. Personally, I would focus on making Python a top-quality single-threaded experience as well as performance.

In the age of common multi-processors(by over a decade)... it's time to embrace it. Especially if it comes with little or no cost thanks to great interpreter engineers. I will personally embrace it.

> If there's one lesson we've learned from the Python 2 to 3 transition, it's that it would have been very beneficial if Python 2 and 3 code could coexist in the same Python interpreter. We blew it that time, and it set us back by about a decade.

A very characteristic thing for a core Python developer to say: it's stupid, baseless, but gets a standing ovation.

The reason Python 3 wasn't catching up is because it offered no tangible benefits. Had it been, say, 10x faster than Python 2, then there would've been a reason. Had it offered sensible concurrency, there would've been a reason, had it offered some way to automatically verify programs... or a bunch of other useful things you might expect from a language, there would've been a reason to switch. But Python 3 was and still is a worthless change. It just gives you the ability to do the same things you could already do but in a slightly different way. If CPython project didn't refuse to support Python 2, it would be still alive and well.

I've lived for a while with the language that had to support two versions at the same time: ActionScript. After AS3 was released, the SWF format and Flash player had changed to support executing both at the same time. The cases when both were used at the same time were very rare. They were never a part of an upgrade process -- usually it was because someone had to embed code they were under contractual obligation to execute (s.a. showing ads). Nobody partially upgraded their projects.

Now, even though Python didn't provide a way to run two versions of the language at the same time, it wouldn't have been hard to do on your own, if you really wanted that: say modify the multiprocessing module to launch Python 2 from Python 3 or Python 3 from Python 2. Or create a native module embedding a Python interpreter of a different version. There are other ways as well... but nobody was doing that. Nobody needed that.

There's no evidence anyone wanted to run both versions of Python at the same time, but, instead of facing the music and understanding that Python 3 wasn't wanted, these people will pet themselves on the back, tell each other how great their project is, and how it doesn't succeed for any other reason beside them being incompetent.


The 2/3 split was intended to allow Python to get better in the future. Python 3.11 is tremendously better than 2.7, but you couldn't have got to that point without the 2/3 split first.

Legal | privacy