The lion in the room is that the DOE National Laboratories have a huge amount of code tied up in MPI and they continue to spend millions of dollars both on hardware and software to support this infrastructure. If you look at the top 500 list:
Four out the ten computers are owned by DOE. That's a pretty significant investment, so they're going to be reluctant to change over to a different system. And, to be clear, a different software setup could be used on these systems, but they were almost certainly purchased with the idea that their existing MPI codes would work well on them. Hell, MPICH was partially authored by Argonne:
so they've a vested interest in seeing this community stay consistent.
Now, on the technical merits, is it possible to do better? Of course. That being said, part of the reason that DOE invested so heavily in this infrastructure is that they often solve physics based problems based on PDE formulations. Here, we're basically using either a finite element, finite difference, or finite volume based method and it turns out that there's quite a bit of experience writing these codes with MPI. Certainly, GPUs have made a big impact on things like finite difference codes, but you still have to distribute data for these problems across a cluster of computers because they require too much memory to store locally. Right now, this can be done in a moderately straight forward way with MPI. Well, more specifically, people end up using DOE libraries like PETSc or Trilinos to do this for them and they're based on MPI. It's not perfect, but it works and scales well. Thus far, I've not seen anything that improves upon this enough to convince these teams to abandon their MPI infrastructure.
Again, this is not to say that this setup is perfect. I also believe that this setup has caused a certain amount of stagnation (read huge amount) in the HPC community and that's bad. However, in order to convince DOE that there's something better than MPI, someone has to put together some scalable codes that vastly outperform (or are vastly easier to use, code, or maintain) the problems that they care about. Very specifically, these are PDE discretizations of continuum mechanics based problems using either finite different, finite element, or finite volume methods in 3D. The 1-D diffusion problem in the article is nice, but 3-D is a pain in the ass, everyone knows it, and you can not get even a casual glance shy of 3-D problems. That sucks and is not fair, but that's the reality of the community.
By the way, the oil industry basically mirrors the sentiment of DOE as well. They're huge consumers of the same technology and the same sort of problems. If someone is curious, check out reverse time migration or full wave inversion. There are billions of dollars tied up in these two problems and they have a huge amount of MPI code. If someone can solve these problems better using a new technology, there's a huge amount of money in it. So far, no one has done it because that's a huge investment and hard.
I worked at a supercomputing facility for a few years. The codes are typically decades old, maintained by hundreds of people over the years. By and large, they understand their performance profiles, and are working to squeeze as much out of the code as they can.
In addition, the performance engineers tend to be employed by the facilities, not the computational scientists. They're the ones who do a bunch of legwork of profiling the existing code on their new platform, and figuring out how to squeeze any machine-specific performance out of the code.
A lot of these codes are time-marching PDE solvers that do a bunch of matrix math to advance the simulation, so the kernel of the code is responsible for a vast majority of the time spent during a job. So it's not necessarily a huge chunk of code that needs to be tuned to wring better performance out of the machine.
The parallel communication they do is also to an API, not an ABI - the supercomputing vendors drop in the optimizations in the build of the library for their machine, to take advantage of network-specific optimizations for various communications patterns. If you express your code in the most-specific function (doing a collective all-to-all explicitly, say, rather than building your own all-to-all out of the point-to-point primitive) the MPI build can insert optimized code for those cases.
There's some misalignment because the facility will be in the top 500 for a few years, while the code lives on and on and on. If your supercomputer architecture is really out of left field (https://en.wikipedia.org/wiki/Roadrunner_(supercomputer)) it's not going to be super worth it for people to try to run on it without porting support from the facility.
John: You make some solid points. Most though seem to more support the idea that more research investment/emphasis is badly needed for HPC programming models. From an application perspective, one sees such a dominant reliance on MPI+X primarily because typically the value proposition just isn't there yet for alternatives (at least in the areas where I work, where we have undertaken new developments recently and done fairly exhaustive evaluations of available options). Though the coding can be somewhat tedious and rigid, in the end these shortcomings have been outweighed by the typical practical considerations -- low risk, extremely portable, tight control over locality and thus performance, etc. It's obviously not all or nothing - as you say we could choose something even lower level and possibly get better performance, but when seen from the perspective of the near and mid-term application goals, it's hard to make a different choice unless explicitly tasked with doing so.
Again I'm not sure if I agree or disagree with this. My hatred of MPI is only outweighed by the fact that I can use it... and my code works.
I think a large part of the inertia behind MPI is legacy code. Often the most complex part of HPC scientific codes is the parallel portion and the abstractions required to perform them (halo decomposition etc). I can't imagine there are too many grad students out there who are eager to re-write a scientific code in a new language that is unproven and requires developing a skill set that is not yet useful in industry (who in industry has ever heard of Chapel or Spark??). Not to mention that re-writing legacy codes means you're delaying from getting results. Its just a terrible situation to be in.
I'm helping to organize a workshop about alternatives to MPI in HPC. You can see a sample of past programs at [1].
But you're right: today, MPI is dominant. I suspect this will change, if only because HPC and data center computing is converging, and the (much larger) data center side of things is very much not based on MPI (e.g., see TensorFlow). Personally, I find it reassuring that many of the dominant solutions from the data center side have more in common with the upstart HPC solutions, than they do with the HPC status quo. I'd like to think that at some point we'll be able to bring around the rest of the HPC community too.
I never got into high performance scientific computing, but I believe the stuff that was done in my department at university was all MPI based and required very high interconnect speeds (like with Infiniband). It looks like your offering is much more standard, what's the thinking there, or am I just wrong/out of date?
I am not familiar with that particular one but I have used other supercomputers and those people are not waiting for better hardware, they are trying to squeeze the best performance they can out of what they have.
The end result mostly depends on the balance between scientists and engineers in the development team, it will oscillates between "this is python because the scientists working on the code know only that but we are using MPI to at least use several cores" and "we have a direct line with the hardware vendors in order to help us write the best software possible for this thing".
That's an excellent essay. I agree with about every point. I worked MPI over ten years ago. It seemed like an assembler language for cluster computing: something for raw efficiency or to hold us over until we get a real language. I was playing with Cilk and other parallel tools then. Much better but little adoption or optimization.
The examples given in Spark and Chapel show how much better it can be. Such methods deserve much more investment. The bad thing is that the DOE labs are clinging so hard to their existing tools when they're in the best position to invest in new ones via huge grants they have. They built supercomputing before anyone heard of the cloud. Their wisdom combined with today's datacenter engineers could result in anything from great leaps in efficiency to something revolutionary. Yet, they act elitist and value backward compatibility over everything else. Starting to look like those mainframe programmers squeezing every ounce of productivity they can out of COBOL.
I think the change will have to come from outside. A group that works both sides with funding to do new projects. They either (a) improve MPI usage for new workloads or (b) get the alternatives up to MPI performance for HPC workloads. Then, they open the tool up to all kinds of new projects in both HPC and cloud-centered research (esp cloud compute clusters). Maybe the two sides will accidentally start sharing with each other when contributing to the same projects. Not much hope past that.
This is… amazing misinformed. I am assuming you’ve never done scientific simulation work if you think that. Physical simulations in many fields get better due to increases in compute, memory, and bandwidth faster than they do from algorithmic improvements (there are only so many algorithmic improvements one can make to a PDE solver). And certain problems simply can’t be simulated until a certain amount of compute (and more importantly memory) are available.
And while some of the time the entire cluster will be given to a single large scale project, most of the time it will be acting as a massive GPU farm for all sorts of research. A win-win for everyone.
If only there was a standardized, robust, widely available cross-platform Message Passing Interface that could do this.
I don't grok why people outside of HPC seem to be shunning MPI. The shared-nothing memory model and asynchronous nature of MPI makes it very similar in spirit to a lot of the current web dev tech, AFAICT.
these things get run in phases. phase 1 they insist on novel architectures, rethinking of basic premises, required involvement by academics.
by the end of phase 3 when they actually procure machines its mostly 'just give me one of what you're already shipping, and it better run MPI well'
doe exascale was supposed to be fundamentally different. because by the time you got there all the incremental improvements in power and latency management weren't nearly enough anymore...my guess is that they'll just make big gpu clusters and call it a day
its probably not that much money in the scheme of things, but the sad (or good) depending on your perspective is that HPC got so far out of the mainstream that when corporations finally got around to worrying about scaling, they just did their own thing. so I guess they can thank darpa/doe for infiniband? not really?
First of all, this is from 2007, since then the number of openMPI deployments has increased steadily, the latest release is of the 4th of may of 2010, so to declare it dead is realy a bit premature, it looks like MPI is alive and kicking.
Appistry apparently makes its money by selling cloud management software. This is fine and good but does not overlap 100% with the main field of application for MPI, which is large scale number crunching in the scientific world.
This includes large scale simulations of all kinds, including fluids, particle systems, molecular and sub-molecular simulations.
While it is possible to run these on a different architecture by porting them, the amount of code that is written for and around MPI means that MPI is here to stay for a long long time. After all, rewriting all that stuff would take ages and cost a fortune.
Probably MPI will be around just a little longer than FORTRAN.
MPI shines mostly in the context of tightly-coupled applications operating on multiple nodes with a fast, reliable network. Most implementations are well-optimized for operating on extremely high-bandwidth low-latency networks like Infiniband or Cray's Gemini. They can also use remote direct memory access (RDMA) to do cool stuff like let CUDA applications directly access GPU memory accross the network (GPUDirect).
Scientific applications which involve a lot of interprocess communication -- for example, large CFD simulations -- can get a lot of benefits, especially since the popular MPI implementations are highly tuned for those types of apps. Also there's a ridiculous amount of legacy code, which leads to the usual lock-in effects.
Too bad he didn't talk about GPGPU killing MPI too or not. I don't know enough to say.
I'm not familiar with the HPC space but I thought a lot of new work, at least in machine learning, was migrating to GPGPU instead of traditional CPUs. The compute per $ or per watt payoff is too large to ignore.
http://www.top500.org/lists/2014/11/
Four out the ten computers are owned by DOE. That's a pretty significant investment, so they're going to be reluctant to change over to a different system. And, to be clear, a different software setup could be used on these systems, but they were almost certainly purchased with the idea that their existing MPI codes would work well on them. Hell, MPICH was partially authored by Argonne:
http://www.mcs.anl.gov/project/mpich-high-performance-portab...
so they've a vested interest in seeing this community stay consistent.
Now, on the technical merits, is it possible to do better? Of course. That being said, part of the reason that DOE invested so heavily in this infrastructure is that they often solve physics based problems based on PDE formulations. Here, we're basically using either a finite element, finite difference, or finite volume based method and it turns out that there's quite a bit of experience writing these codes with MPI. Certainly, GPUs have made a big impact on things like finite difference codes, but you still have to distribute data for these problems across a cluster of computers because they require too much memory to store locally. Right now, this can be done in a moderately straight forward way with MPI. Well, more specifically, people end up using DOE libraries like PETSc or Trilinos to do this for them and they're based on MPI. It's not perfect, but it works and scales well. Thus far, I've not seen anything that improves upon this enough to convince these teams to abandon their MPI infrastructure.
Again, this is not to say that this setup is perfect. I also believe that this setup has caused a certain amount of stagnation (read huge amount) in the HPC community and that's bad. However, in order to convince DOE that there's something better than MPI, someone has to put together some scalable codes that vastly outperform (or are vastly easier to use, code, or maintain) the problems that they care about. Very specifically, these are PDE discretizations of continuum mechanics based problems using either finite different, finite element, or finite volume methods in 3D. The 1-D diffusion problem in the article is nice, but 3-D is a pain in the ass, everyone knows it, and you can not get even a casual glance shy of 3-D problems. That sucks and is not fair, but that's the reality of the community.
By the way, the oil industry basically mirrors the sentiment of DOE as well. They're huge consumers of the same technology and the same sort of problems. If someone is curious, check out reverse time migration or full wave inversion. There are billions of dollars tied up in these two problems and they have a huge amount of MPI code. If someone can solve these problems better using a new technology, there's a huge amount of money in it. So far, no one has done it because that's a huge investment and hard.
reply