Hacker Read

kxyvr · 2015-04-08 05:14:28+00:00

The lion in the room is that the DOE National Laboratories have a huge amount of code tied up in MPI and they continue to spend millions of dollars both on hardware and software to support this infrastructure. If you look at the top 500 list:

http://www.top500.org/lists/2014/11/

Four out the ten computers are owned by DOE. That's a pretty significant investment, so they're going to be reluctant to change over to a different system. And, to be clear, a different software setup could be used on these systems, but they were almost certainly purchased with the idea that their existing MPI codes would work well on them. Hell, MPICH was partially authored by Argonne:

http://www.mcs.anl.gov/project/mpich-high-performance-portab...

so they've a vested interest in seeing this community stay consistent.

Now, on the technical merits, is it possible to do better? Of course. That being said, part of the reason that DOE invested so heavily in this infrastructure is that they often solve physics based problems based on PDE formulations. Here, we're basically using either a finite element, finite difference, or finite volume based method and it turns out that there's quite a bit of experience writing these codes with MPI. Certainly, GPUs have made a big impact on things like finite difference codes, but you still have to distribute data for these problems across a cluster of computers because they require too much memory to store locally. Right now, this can be done in a moderately straight forward way with MPI. Well, more specifically, people end up using DOE libraries like PETSc or Trilinos to do this for them and they're based on MPI. It's not perfect, but it works and scales well. Thus far, I've not seen anything that improves upon this enough to convince these teams to abandon their MPI infrastructure.

Again, this is not to say that this setup is perfect. I also believe that this setup has caused a certain amount of stagnation (read huge amount) in the HPC community and that's bad. However, in order to convince DOE that there's something better than MPI, someone has to put together some scalable codes that vastly outperform (or are vastly easier to use, code, or maintain) the problems that they care about. Very specifically, these are PDE discretizations of continuum mechanics based problems using either finite different, finite element, or finite volume methods in 3D. The 1-D diffusion problem in the article is nice, but 3-D is a pain in the ass, everyone knows it, and you can not get even a casual glance shy of 3-D problems. That sucks and is not fair, but that's the reality of the community.

By the way, the oil industry basically mirrors the sentiment of DOE as well. They're huge consumers of the same technology and the same sort of problems. If someone is curious, check out reverse time migration or full wave inversion. There are billions of dollars tied up in these two problems and they have a huge amount of MPI code. If someone can solve these problems better using a new technology, there's a huge amount of money in it. So far, no one has done it because that's a huge investment and hard.