Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

At my job we get pretty far just using things like einhorn to just run a bunch of similar python processes under the same port- this mostly fixes the single threaded performance problem.


sort by: page size:

I think a lot of this complexity can be avoided by just writing single threaded python and using GNU parallel for running it on multiple cores. You can even trivially distribute the work across a cluster that way.

I like the idea of exploring other models. Shared data and threads is not the only approach to parallelism.

And, usually, when you are stuck solving a hard problem, it often pays to take a step back and make sure we are failing to solve a problem you should solve. The problem we want to solve is not how to get rid of the GIL, nor improve Python performance with threads but how to use Python more effectively on multi-core/thread architectures and gain performance from that.

This is not a problem only Python has. The machines I work on most of the time (a Core i5 laptop and an Atom netbook) rarely experience loads larger than 2. There are simply not enough threads to keep them busy.

That's not to say they never get slow - they do - but I'd like to emphasize that the limiting factor here is that we are not extracting parallelism from the software already written. We stand do gain a lot from that.


I’m not saying performance is not different between these types of solutions or the dev work is not sometimes different.

I’m saying that the penalties (including development effort) for work co-ordinated between processes compared to threads can vary from nearly zero (sometimes even net better) to terrible depending on the nature of the workload. Threads are very convenient programmatically, but also have some serious drawbacks in certain scenarios.

For instance fork()’d sub-processes do a good job of avoiding segfaulting /crashing/dead locking everything all at once if there is a memory issue or other problem when doing the work, it’s very difficult in native threading models to do per-work resource management or quotas (like maximum memory footprint, or max number of open sockets, or max number of open files) since everything is grouped together at the OS level and it’s a lot of work to reinvent that yourself (which you’d need to do with threads). Also, the same shared memory convenience can cause some pretty crazy memory corruption in otherwise isolated areas of your system if you’re not careful, which is not possible with separate processes.

I do wish the python concurrency model was better. Even with it’s warts, it is still possible to do a LOT with it in a performant way. Some workloads are definitely not worth the trouble now however.


You're forgetting multiprocessing. That manages to get round this by running multiple Python interpreters. Big problem is passing objects is dog slow as everything has to be pickled either way.

I'm one of those with embarrassingly-parallel cpu-bound workloads. The multiprocessing module works, but the extra bookkeeping and plumbing over an actually-parallel-multithreading implementation is a pain in the butt. That said, the multithreading speedup is both faster and easier than porting to another language.

> I may have missed something but I couldn’t figure out how to get the multi-threaded performance out of Python

Multiprocessing. The answer is to use the python multiprocessing module, or to spin up multiple processes behind wsgi or whatever.

> Historically I’ve written several services that load up some big datastructure (10s or 100s of GB), then expose an HTTP API on top of it.

Use the python multiprocessing module. If you've already written it with the multithreading module, it is a drop in replacement. Your data structure will live in shared memory and can be accessed by all processes concurrently without incurring the wrath of the GIL.

Obviously this does not fix the issue of Python just being super slow in general. It just lets you max out all your CPU cores instead of having just one core at 100% all the time.


And the python people would just point to multiprocessing...which works pretty well.

If Python is slow and you need concurrency, and you have a long-ish timeframe in mind, do yourself a favor and use one of C++/Java/Rust/Go/Erlang/Elixir/(other newer programming language). Especially, don’t rely on Multiprocessing.

I’ve been developing and maintaining a largish batch processing Python project where performance is a key feature, and it’s been very frustrating. I should have rewritten it when I had the chance, but I’m slowly outsourcing critical pieces to C++ and Rust libraries. Multiprocessing has been a source of subtle portability errors as it has changed over different minor versions.


This is extremely exciting stuff. Thank you for Python and working on improving its performance.

I use Python threads for some IO heavy work with Popen and C and Java for my true multithreading work.


what i was thinking as well... i love these performance improvement posts but at the same time had to think what kind of choice was it to originally reach for python in the first place if the task was to do a lot of heavy concurrent task management???

python doesn't scale well with multiple threads/cores[1]. the high level motivation is to fix that. since the GIL is the main source of the problems, that has to go.

[1] the best current solution is to use the multiprocessing package which runs a completely separate python instance on each core, but obviously that doesn't support simple shared memory access (you can do it, but it's not "natural").


There is a JIT being built for Python, so performance is being attacked from 2 avenues, single threaded and multithreaded.

You could probably get decent performance for a similar application written in another language (then Python) using 175 threads. 175 threads is not that big of deal, the OS can manage it pretty well. It's only when you start talking about thousands of individual connections and thousands of threads that you need to worry. Python sucks at that at low number of threads (GIL).

A few questions:

If single threaded performance is degraded, couldn't these people use an old version of python?

With such incredible increases in multithreaded performance, I imagine this is basically infrastructure tier importance. Like the US government should be funding it. Would throwing a billion dollars at it, solve it in ~1 year? Or is this going to take 3 years regardless?


Multiprocessing is great as a first pass parallelization but I've found that debugging it to be very hard, especially for junior employees.

It seems much easier to follow when you can push everything to horizontally scaled single processes for languages like Python.


Many people still don't understand you can run millions of threads on Linux on a beefy server. The bottleneck with python will always be the CPU doing something with those requests.

This is a great analysis; thanks for writing.

I have also been working on running multiple Python interpreters in the same process by isolating them in different namespaces using `dlmopen` [1]. The objective on a high level is to receive requests for some compute intensive operations from a TCP/HTTP server and dispatch them on to different workers. In this case, a thin C++ shim receives the requests and dispatches them on to one of the Python interpreters in a namespace. This eliminates contention for the GIL amongst the interpreters and can exploit parallelism by running each interpreter on a different set of cores. The data obtained from the request does not need to be copied into the interpreter because everything is in the same address space; similarly the output produced by the Python interpreter is also just passed back without any copies to the server.

[1] https://www.man7.org/linux/man-pages/man3/dlmopen.3.html


Is single threaded perf is important, you’ve already lost by using python. You’re only ever going to get, ok-ish performance or slightly more ok-ish

The standard Python workaround to the GIL is multiprocessing. Multiprocessing is basically a library which spins up a distributed system running locally - it forks your process, and runs workers in the forks. The parent process then communicates with the child processes via unix pipes, TCP, or some such method, allowing multiple cores to be used.

Multiprocessing is the way to do parallelism. Deviating from that should be an exception -- for example, shared memory maps could be used to transfer select data objects instantaneously between the processes instead of serializing/deserializing over a pipe, and only those while still retaining separate process images. Threads were practically invented as a compensation for systems with heavy process image overhead.

I think Python is very Unix in this regard. And that's not a bad thing per se. Unix and Linux can do multiprocessing very efficiently.

next

Legal | privacy