Not to mention operations like "complex blends: Darken, Multiply, Color Dodge, Hue, Luminosity... etc" and many SIMD accelerations. It also has Node, Python, and C interfaces. Though I haven't found an excuse to use it yet. Surely there's got to be some excuse... ;)
I understood most of the information presented, but missed some basic technical point of what the AMX does. Is it a specialized module + its libraries that's very good at dot products, for example? Matrix inversions?
Yes, by using f2c. For BLAS and LAPACK, this is a big effort and doesn't run out of the box. The cblas libraries are too old and don't contain new functions.
Thanks for the links. If anyone is wondering about some of the hoops that need to be jumped through to make it work, here's another guide [1].
One question in case you or anyone else knows: What's the story behind AMD's apparent lack of math library development? Years ago, AMD and ACML as their high-performance BLAS competitor to MKL. Eventually, it hit end of life and became AOCL [2]. I've not tried it, but I'm sure it's fine. That said, Intel has done steady, consistent work on MKL and added a huge amount of really important functionality such as its sparse libraries. When it works, AMD has also benefited from this work as well, but I've also been surprised that they haven't made similar investments.
Also, in case anyone is wondering, ARM's competing library is called the Arm Performance Libraries. Not sure how well it works and it's only available under a commercial license. I just went to check and pricing is not immediately available. All that said, it looks to be dense BLAS/LAPACK along with FFT and no sparse.
Small addtion: matrix multiplications (and other operations implemented through BLAS) do use the M1's AMX matrix co-processor (through the Apple Accelerate framework).
Very cool, btw it's not mentioned in the readme so I assume it's only for running full precision models or do quantized GGML/GPTQ/etc. also work with it?
I wonder if they'll somehow include the AMX 'instruction' (or whatever it is) into BLIS kernels. GEMM isn't everything, but it is a pretty important building block in linear algebra. (I mean that's the big observation of these fancy tile based BLAS implementations).
Scale, and only scale. As with most extensions, there's negative value if you're doing a small number of calcs, but it pays off in larger needs like ML training.
We've been calculating matrixes for time eternal, but AMX has just recently become a thing. Intel just introduced their own AMX instructions.
Confusingly there are 2 mechanisms to do matrix operations on the new apple hardware - AMX (https://github.com/corsix/amx) - and the ANE (apple neural engine) - which is enabled by CoreML. This code does not run on the neural engine but the author has a branch for his whisper.cpp project which uses it here: https://github.com/ggerganov/whisper.cpp/pull/566 - so it may not be long before we see it applied here as well. All of this is to say that it actually could get significantly faster if some of this work was able to be handed to the ANE with CoreML.
You just run MKL from the oneapi distribution, and it gives decent performance on EPYC2, but basically only for double precision, and I don't remember if that includes complex.
ACML was never competitive in my comparisons with Goto/OpenBLAS on a variety of opterons. It's been discarded, and AMD now use a somewhat enhanced version of BLIS.
BLIS is similar to, sometimes better than, ARMPL on aarch64, like thunderx2.
You can find the engine used here [1], the API built around it here [2] and its WASM port here [3] and the WebAssembly matrix multiplication optimizations are here [4]
It’s interesting, it does very well in multicore and with complex numbers. The issue is that there’s no way we’ll rewrite all our code littered with calls to BLAS and LAPACK to use a different API. It looks like they have a BLAS compatibility layer though; I hope it’s good.
Intel did a great thing for people interested in ML and numeric research by making their MKL library and compiler free and cross platform compatible. Even today on my AMD zen3 Ryzen machine intel's MKL linked numpy and pytorch are in some operations 10* (yes that is really ten times) faster in comparison with the next best alternative (openBlas etc). I was shocked to discover how much of a difference MKL makes for cpu workloads. This is mostly because it makes use of AVX2 cpu extensions which make certain matrix operations a lot faster.
reply