Hacker Read

ogrisel · 2019-09-20 18:24:32+00:00

The BLAS & Lapack subset of the API of the Intel Math Kernel Library (MKL) is very well implemented in open source projects such as OpenBLAS and BLIS:

https://github.com/flame/blis

Both are well optimized for AMD CPUs.

reply

oxfeed65261 | karma 480 | avg karma 2.45 · | 2022-11-28 11:46:11

Had to look it up:

> Intel oneAPI Math Kernel Library (Intel oneMKL; formerly Intel Math Kernel Library or Intel MKL), is a library of optimized math routines for science, engineering, and financial applications. Core math functions include BLAS, LAPACK, ScaLAPACK, sparse solvers, fast Fourier transforms, and vector math.

https://en.m.wikipedia.org/wiki/Math_Kernel_Library

reply

kxyvr | karma 1261 | avg karma 3.56 · | 2020-11-11 18:10:00+00:00

Thanks for the links. If anyone is wondering about some of the hoops that need to be jumped through to make it work, here's another guide [1].

One question in case you or anyone else knows: What's the story behind AMD's apparent lack of math library development? Years ago, AMD and ACML as their high-performance BLAS competitor to MKL. Eventually, it hit end of life and became AOCL [2]. I've not tried it, but I'm sure it's fine. That said, Intel has done steady, consistent work on MKL and added a huge amount of really important functionality such as its sparse libraries. When it works, AMD has also benefited from this work as well, but I've also been surprised that they haven't made similar investments.

Also, in case anyone is wondering, ARM's competing library is called the Arm Performance Libraries. Not sure how well it works and it's only available under a commercial license. I just went to check and pricing is not immediately available. All that said, it looks to be dense BLAS/LAPACK along with FFT and no sparse.

[1] https://www.pugetsystems.com/labs/hpc/How-To-Use-MKL-with-AM...

[2] https://developer.amd.com/amd-aocl/

reply

gnufx | karma 1558 | avg karma 1.12 · | 2020-08-29 22:14:48+00:00

The mythology surrounding the Intel tools and libraries really ought to die. It's bizarre seeing people deciding they must use MKL rather than the linear algebra libraries on which AMD has been working hard to optimize for their hardware (and possibly other hardware incidentally). Similarly for compiler code generation.

Free BLASs are pretty much on a par with MKL, at least for large dimension level 3 in BLIS's case, even on Haswell. For small matrices MKL only became fast after libxsmm showed the way. (I don't know about libxsmm on current AMD hardware, but it's free software you can work on if necessary, like AMD have done with BLIS.) OpenBLAS and BLIS are infinitely better performing than MKL in general because they can run on all CPU architectures (and BLIS's plain C gets about 75% of the hand-written DGEMM kernel's performance).

The differences between the implementations are comparable with the noise in typical HPC jobs, even if performance was entirely dominated by, say, DGEMM (and getting close to peak floating point intensity is atypical). On the other hand, you can see a factor of several difference in MPI performance in some cases.

reply

danieldk | karma 20232 | avg karma 4.08 · | 2020-08-29 07:03:16

Yes, starting with MKL 2020.01 release. The Wikipedia page has more information and references:

https://en.wikipedia.org/wiki/Math_Kernel_Library#Performanc...

This is quite bad, since a lot of software relies on Intel MKL as the default BLAS implementation (e.g. PyTorch binaries).

reply

simtel20 | karma 442 | avg karma 1.74 · | 2020-08-29 04:06:56

The statement you were responding to is only referring to the Intel mkl, though. There are many other blas libraries. Where you making a more general statement about some set kf blas implementations? Or the blas interface in general perhaps?

bee_rider | karma 16765 | avg karma 2.35 · | 2022-04-06 14:05:16

The expectation in the HPC community is that an interested vendor will provide their own BLAS/LAPACK implementation (MKL is a BLAS/LAPACK implementation, along with a bunch of other stuff), which is well-tuned for their hardware. These sort of libraries aren't just tuned for an architecture, they might be tuned for a given generation or even particular SKUs.

saretired | karma 208 | avg karma 2.7 · | 2016-08-29 04:06:55

Intel's Math Kernel Library has optimized versions of BLAS, a sparse solver, etc. I'd like to see a comparison using ICC/MKL.

jedbrown | karma 2603 | avg karma 3.58 · | 2020-08-29 13:23:43

It's their official BLAS [1] since 2015 when they moved away from their proprietary ACML implementation [2].

[1]https://developer.amd.com/amd-aocl/blas-library/

[2] https://developer.amd.com/open-source-strikes-again-accelera...

reply

kergonath | karma 7728 | avg karma 2.24 · | 2024-01-25 01:07:25

It’s interesting, it does very well in multicore and with complex numbers. The issue is that there’s no way we’ll rewrite all our code littered with calls to BLAS and LAPACK to use a different API. It looks like they have a BLAS compatibility layer though; I hope it’s good.

It even has a nice, friendly licence.

reply

physicsguy | karma 1765 | avg karma 2.23 · | 2022-09-22 00:48:23

> In most programming languages, the linear algebra handling for these kinds of standard operations is performed by underlying libraries called the BLAS and LAPACK library. Most open source projects use an implementation called OpenBLAS, a C implementation of BLAS/LAPACK which does many of the tricks required for getting much higher performance than "simple" codes by using CPU-specialized kernels based on the sizes of the CPU's caches. Open source projects like R and SciPy also ship with OpenBLAS because of its generally good performance and open licensing, though it's known that OpenBLAS is handily outperformed by Intel MKL which is a vendor-optimized BLAS/LAPACK implementation for Intel CPUs (which works on AMD CPUs as well).

Much of this is more complex than this. Most open source software doesn’t ship with any assumptions about a particular BLAS/LAPACK implementation at all - and on HPC systems you are generally expected to choose one as appropriate and compile your code against it. It is generally only when you download a precompiled version that you’re given a particular implementation, but it doesn’t mean you can’t use another one if you compile from source as the BLAS and LAPACK libraries just present a standard API. Generally, for performance reasons, you want to compile specifically for your platform, because precompiled wheels from Conda, PyPI, etc. will leave performance on the table.

On forward thinking cluster teams these days, sysadmins use tools like Spack and Easybuild and to some degree software is made available to available to users either directly or by request, so it’s usual to log into a cluster and have multiple implementations available to choose from and compile your code against. More often than not, it’s still on you however to compile against what you need as dependencies. It’s a worthwhile exercise in HPC to try different ones and check the performance characteristics of your code on the particular machine with multiple implementations.

reply

gnufx | karma 1558 | avg karma 1.12 · | 2019-12-02 19:34:53+00:00

The advantage of MKL is typically greatly over-rated anyhow, and I don't see why one should care. BLIS and OpenBLAS have good tuned x86 BLAS implementations, and run infinitely faster than MKL on ARM and POWER, for instance, and if you're interested in small matrices on x86_64, there's libxsmm. (I know MKL has more than BLAS, but I don't know what there is that doesn't have free rough equivalents.) BLIS performance: https://github.com/flame/blis/blob/master/docs/Performance.m...

bee_rider | karma 16765 | avg karma 2.35 · | 2023-05-18 11:39:38

Intel hired Mr. Goto a while ago, he wrote gotoBLAS (from which openBlas is derived, so this one dude is responsible for a ton of FLOPs).

The thing that made gotoBLAS good was the hand-tuned assembly kernels, so we can be reasonably sure that MKL has hand tuned assembly kernels at this point.

I think the tongue-in-cheek comment is actually really good. It is a reminder that BLAS is more like a linear algebra API than a particular library.

reply

Roark66 | karma 2983 | avg karma 3.5 · | 2022-10-28 15:55:00

Intel did a great thing for people interested in ML and numeric research by making their MKL library and compiler free and cross platform compatible. Even today on my AMD zen3 Ryzen machine intel's MKL linked numpy and pytorch are in some operations 10* (yes that is really ten times) faster in comparison with the next best alternative (openBlas etc). I was shocked to discover how much of a difference MKL makes for cpu workloads. This is mostly because it makes use of AVX2 cpu extensions which make certain matrix operations a lot faster.

cbcoutinho | karma 602 | avg karma 2.32 · | 2016-12-19 04:55:28

At a first glance, it looks like the way they're optimizing the BLAS/LAPACK implementations is by making it CPU architecture specific - the same game that IMKL plays. That's probably why they are reaching the same performance as MKL as well.

Good to see they aren't reinventing the wheel, and openly expressing inspiration from Numpy is also a nice touch.

reply

mosburger | karma 1214 | avg karma 3.49 · | 2016-07-28 21:06:01

To save others from searching for it:

> ATLAS (Automatically Tuned Linear Algebra Software) provides highly optimized Linear Algebra kernels for arbitrary cache-based architectures. ATLAS provides ANSI C and Fortran77 interfaces for the entire BLAS API, and a small portion of the LAPACK AP

https://sourceforge.net/projects/math-atlas/

reply

quanto | karma 1023 | avg karma 5.75 · | 2024-01-24 15:46:31

The real money shot is here: https://github.com/flame/blis/blob/master/docs/Performance.m...

It seems that the selling point is that BLIS does multi-core quite well. I am especially impressed that it does as well as the highly optimized Intel MKL on Intel CPUs.

I do not see the selling point of BLIS-specific APIs, though. The whole point of having an open BLAS API standard is that numerical libraries should be drop-in replaceable, so when a new library (such as BLIS here) comes along, one could just re-link the library and reap the performance gain immediately.

What is interesting is that numerical algebra work, by nature, is mostly embarrassingly parallel, so it should not be too difficult to write multi-core implementations. And yet, BLIS here performs so much better than some other industry-leading implementations on multi-core configurations. So the question is not why BLIS does so well; the question is why some other implementations do so poorly.

reply

miloshh | karma 420 | avg karma 2.53 · | 2009-05-10 16:56:22+00:00

The Intel Math Kernel Library is decent, and very fast. It has disadvantages - not optimized for AMD processors, not free (but the student version is cheap, and the free trial does not really expire).

timeu | karma 147 | avg karma 1.99 · | 2016-08-29 09:09:12

Some time ago I did compare different BLAS implementations (OpenBLAS, MKL, ACML etc) on different Intel CPU architectures, in case somebody is interested in the differences between them

http://stackoverflow.com/questions/5260068/multithreaded-bla...

reply

sliken | karma 5336 | avg karma 1.83 · | 2019-11-26 07:45:30+00:00

AMD has a compiler (c, c++, fortran) and various math libs: libm for basics, blis/libflame for blas.