Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

The real money shot is here: https://github.com/flame/blis/blob/master/docs/Performance.m...

It seems that the selling point is that BLIS does multi-core quite well. I am especially impressed that it does as well as the highly optimized Intel MKL on Intel CPUs.

I do not see the selling point of BLIS-specific APIs, though. The whole point of having an open BLAS API standard is that numerical libraries should be drop-in replaceable, so when a new library (such as BLIS here) comes along, one could just re-link the library and reap the performance gain immediately.

What is interesting is that numerical algebra work, by nature, is mostly embarrassingly parallel, so it should not be too difficult to write multi-core implementations. And yet, BLIS here performs so much better than some other industry-leading implementations on multi-core configurations. So the question is not why BLIS does so well; the question is why some other implementations do so poorly.



sort by: page size:

It’s interesting, it does very well in multicore and with complex numbers. The issue is that there’s no way we’ll rewrite all our code littered with calls to BLAS and LAPACK to use a different API. It looks like they have a BLAS compatibility layer though; I hope it’s good.

It even has a nice, friendly licence.


The advantage of MKL is typically greatly over-rated anyhow, and I don't see why one should care. BLIS and OpenBLAS have good tuned x86 BLAS implementations, and run infinitely faster than MKL on ARM and POWER, for instance, and if you're interested in small matrices on x86_64, there's libxsmm. (I know MKL has more than BLAS, but I don't know what there is that doesn't have free rough equivalents.) BLIS performance: https://github.com/flame/blis/blob/master/docs/Performance.m...

The mythology surrounding the Intel tools and libraries really ought to die. It's bizarre seeing people deciding they must use MKL rather than the linear algebra libraries on which AMD has been working hard to optimize for their hardware (and possibly other hardware incidentally). Similarly for compiler code generation.

Free BLASs are pretty much on a par with MKL, at least for large dimension level 3 in BLIS's case, even on Haswell. For small matrices MKL only became fast after libxsmm showed the way. (I don't know about libxsmm on current AMD hardware, but it's free software you can work on if necessary, like AMD have done with BLIS.) OpenBLAS and BLIS are infinitely better performing than MKL in general because they can run on all CPU architectures (and BLIS's plain C gets about 75% of the hand-written DGEMM kernel's performance).

The differences between the implementations are comparable with the noise in typical HPC jobs, even if performance was entirely dominated by, say, DGEMM (and getting close to peak floating point intensity is atypical). On the other hand, you can see a factor of several difference in MPI performance in some cases.


BLIS is an interesting new direction in that regard: https://github.com/flame/blis

>The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental innovation is that virtually all computation within level-2 (matrix-vector) and level-3 (matrix-matrix) BLAS operations can be expressed and optimized in terms of very simple kernels.


The BLAS & Lapack subset of the API of the Intel Math Kernel Library (MKL) is very well implemented in open source projects such as OpenBLAS and BLIS:

https://github.com/flame/blis

Both are well optimized for AMD CPUs.


> numerical algebra work [...] is mostly embarrassingly parallel

It's the exact opposite, most numerical linear algebra is _not_ embarrassingly parallel and requires quite an effort to code properly.

That is why BLAS/LAPACK is popular and there are few competing implementations.


> In most programming languages, the linear algebra handling for these kinds of standard operations is performed by underlying libraries called the BLAS and LAPACK library. Most open source projects use an implementation called OpenBLAS, a C implementation of BLAS/LAPACK which does many of the tricks required for getting much higher performance than "simple" codes by using CPU-specialized kernels based on the sizes of the CPU's caches. Open source projects like R and SciPy also ship with OpenBLAS because of its generally good performance and open licensing, though it's known that OpenBLAS is handily outperformed by Intel MKL which is a vendor-optimized BLAS/LAPACK implementation for Intel CPUs (which works on AMD CPUs as well).

Much of this is more complex than this. Most open source software doesn’t ship with any assumptions about a particular BLAS/LAPACK implementation at all - and on HPC systems you are generally expected to choose one as appropriate and compile your code against it. It is generally only when you download a precompiled version that you’re given a particular implementation, but it doesn’t mean you can’t use another one if you compile from source as the BLAS and LAPACK libraries just present a standard API. Generally, for performance reasons, you want to compile specifically for your platform, because precompiled wheels from Conda, PyPI, etc. will leave performance on the table.

On forward thinking cluster teams these days, sysadmins use tools like Spack and Easybuild and to some degree software is made available to available to users either directly or by request, so it’s usual to log into a cluster and have multiple implementations available to choose from and compile your code against. More often than not, it’s still on you however to compile against what you need as dependencies. It’s a worthwhile exercise in HPC to try different ones and check the performance characteristics of your code on the particular machine with multiple implementations.


Not arguing with that, but I think the jury is out on whether they need to be hand vectorized. With recent GCC, generic C for BLIS' DGEMM gives about 2/3 the performance of the hand-coded version on Haswell, and it may be somewhat pessimized by hand-unrolling rather than letting the compiler do it. The remaining difference is thought to be mainly from prefetching, but I haven't verified that. (Details are scattered in the BLIS issue tracker earlier this year.)

For information of anyone who doesn't know about performance of level 3 BLAS: It doesn't come just from vectorization, but also cache use with levels of blocking (and prefetch). See the material under https://github.com/flame/blis. Other levels -- not matrix-matrix -- are less amenable to fancy optimization, with lower arithmetic density to pit against memory bandwidth, and BLIS mostly hasn't bothered with them, though OpenBLAS has.


Some time ago I did compare different BLAS implementations (OpenBLAS, MKL, ACML etc) on different Intel CPU architectures, in case somebody is interested in the differences between them

http://stackoverflow.com/questions/5260068/multithreaded-bla...


BLIS is a framework to instantiate (among other things) a BLAS. The exciting bit is that they distilled it down to a couple compute kernels. You provide stuff like the inner loops for a matrix multiplication, it turns it into a numerical library.

AMD did use it to create their BLAS library though.

Also, side note, when I’ve heard “reference BLAS,” it has been used in the opposite way. Netlib BLAS is the reference BLAS, it has basically bad performance but it defines the functionality. AMD used BLIS to create a tuned vendor BLAS.


BLAS is a very well optimized library. I think a lot of it is in Fortran, which can be faster than c. It is very heavily used in scientific compute. BLAS also has methods that have been hand tuned in assembly. It’s not magic, but the amount of work that has gone into it is not something you would probably want to replicate.

Well, I worked with BLAS as well, but it was just something we linked in (IE, it's a stable library with a stable interface with a limited set of functions). None of this should be surprising- in my opinion, computers mainly exist to do linear algebra efficiently.

BLAS are incredibly well optimized by people doing their life's work on just matrix multiplication, hand-tuning their assembly, benchmarking it per platform to optimize cache use, etc -- they are incredible feats of software engineering. For the multiplication of large matrices (cubic time), the performance gains can quickly overwhelm the quadratic-time overhead of the scripting language.

Depends on the field kinda. For a really extreme case, something like big dense linear algebra is going to just use BLAS calls anyway. For big enough matrices, all of the flops are coming from the library, the cost of calling from Python vs the cost of calling from C or Fortran is amortized anyway, and probably most people won’t beat a tuned BLAS however much developer time they throw at it.

It makes more sense to tune libraries excessively, we could say it is less of a trade off, more of an allocation of the finite low-level tuning developer resources to high-impact libraries.

Anyway it turns out that the only way to get most people to link to a good BLAS is to use Numpy and distribute it through some giant Anaconda Rube Goldberg machine, so, I dunno, people are weird.


Also, the major point is that BLAS has little to no role played here. Algorithms which just hit BLAS are very suboptimal already. There's a tearing step which reduces the problem to many subproblems which is then more optimally handled by pure Julia numerical linear algebra libraries which greatly outperform OpenBLAS in the regime they are in:

https://github.com/YingboMa/RecursiveFactorization.jl#perfor...

And there are hooks in the differential equation solvers to not use OpenBLAS in many cases for this reason:

https://github.com/SciML/DiffEqBase.jl/blob/master/src/linea...

Instead what this comes out to is more of a deconstructed KLU, except instead of parsing to a single sparse linear solve you can do semi-independent nonlinear solves which are then spawning parallel jobs of small semi-dense linear solves which are handled by these pure Julia linear algebra libraries.

And that's only a small fraction of the details. But at the end of the day, if someone is thinking "BLAS", they are already about an order of magnitude behind on speed. The algorithms to do this effectively are much more complex than that.


I don't disagree, but where are those techniques presented in the article? It seems like she exploits the particular shape of her matrix to align better with cache. No BLAS library is going to figure that out.

I am not trying to say that a simple 50+ year old matrix solver is somehow competitive with existing BLAS libraries. But I disagreed with its portrayal in the article, which associated the block with NumPy performance. Give that to a 2024 Fortran compiler, and it's going to get enough right to produce reasonable bytecode.


The compiler is responsible for the implementation of the matmul intrinsic. Most of them have a “decent for small matrices and easy to inline” version by default and an option to convert it to call to BLAS under certain conditions (like larger matrices). Then, the performance is that of your BLAS library. The assumption (quite good in my experience) is that people use the intrinsic naturally because it is easy and natural, but turn to BLAS calls when performance becomes important. Which is not the case for the vast majority of matrix multiplication, even in high performance computing.

There is no universal optimal algorithm.


Yeah. Fast compare to what ? Blas ? On which platform ? The Readme is a bit short for something that pretend to be "significantly faster than most mathematics libraries out there"

"The API itself is not the reason for its performance"

It eliminates many performance mistakes by making it so that when you do something dumb it's generally immediately apparent. This mostly boils down to nonsensical/slow operations being usually impossible and any memory copying being explicit.

I can only tell you my own experience. Going through the steps of transforming a few "textbook algorithms" into BLAS made me think about the problems much more clearly and in a way I would have missed using a higher abstraction.

All the examples you give (Matlab, eigen, armadillo, Julia) allow you to really easily write really bad code (blas under the hood is kinda irrelevant). But you're totally right that they're useful. If you want to call out to code that does an SVD or something simple that exists in a library then they're generally just fine and you can't mess it up too much.

I use BLAS with a repl from Clojure (thanks to neanderthal) so it's very painless. Sounds like you're using it in a compile/run loop from C/C++ which sounds like very much not fun.

next

Legal | privacy