Hacker Read

ribit · 2021-06-09 15:07:33

The previous beta 2.4 branch would also utilize AMX coprocessor when appropriate. Does anybody know whether the plugin is also based on ML Compute?

jedbrown | karma 2603 | avg karma 3.58 · | 2020-08-29 13:23:43

It's their official BLAS [1] since 2015 when they moved away from their proprietary ACML implementation [2].

[1]https://developer.amd.com/amd-aocl/blas-library/

[2] https://developer.amd.com/open-source-strikes-again-accelera...

reply

adr1an | karma 381 | avg karma 1.68 · | 2023-12-07 10:48:43

It's an array manipulation library for their silicon https://github.com/ml-explore/mlx

elcritch | karma 3737 | avg karma 1.99 · | 2021-09-12 19:36:28

Not to mention operations like "complex blends: Darken, Multiply, Color Dodge, Hue, Luminosity... etc" and many SIMD accelerations. It also has Node, Python, and C interfaces. Though I haven't found an excuse to use it yet. Surely there's got to be some excuse... ;)

supernova87a | karma 14742 | avg karma 6.61 · | 2021-01-16 18:36:02+00:00

I understood most of the information presented, but missed some basic technical point of what the AMX does. Is it a specialized module + its libraries that's very good at dot products, for example? Matrix inversions?

That's what I wanted to know. Maybe I missed it.

reply

s-macke | karma 1633 | avg karma 4.75 · | 2015-11-17 08:19:57+00:00

Yes, by using f2c. For BLAS and LAPACK, this is a big effort and doesn't run out of the box. The cblas libraries are too old and don't contain new functions.

kxyvr | karma 1261 | avg karma 3.56 · | 2020-11-11 18:10:00+00:00

Thanks for the links. If anyone is wondering about some of the hoops that need to be jumped through to make it work, here's another guide [1].

One question in case you or anyone else knows: What's the story behind AMD's apparent lack of math library development? Years ago, AMD and ACML as their high-performance BLAS competitor to MKL. Eventually, it hit end of life and became AOCL [2]. I've not tried it, but I'm sure it's fine. That said, Intel has done steady, consistent work on MKL and added a huge amount of really important functionality such as its sparse libraries. When it works, AMD has also benefited from this work as well, but I've also been surprised that they haven't made similar investments.

Also, in case anyone is wondering, ARM's competing library is called the Arm Performance Libraries. Not sure how well it works and it's only available under a commercial license. I just went to check and pricing is not immediately available. All that said, it looks to be dense BLAS/LAPACK along with FFT and no sparse.

[1] https://www.pugetsystems.com/labs/hpc/How-To-Use-MKL-with-AM...

[2] https://developer.amd.com/amd-aocl/

reply

danieldk | karma 20232 | avg karma 4.08 · | 2021-10-22 10:46:55

Small addtion: matrix multiplications (and other operations implemented through BLAS) do use the M1's AMX matrix co-processor (through the Apple Accelerate framework).

moffkalast | karma 7759 | avg karma 1.88 · | 2023-06-19 04:01:53

Very cool, btw it's not mentioned in the readme so I assume it's only for running full precision models or do quantized GGML/GPTQ/etc. also work with it?

bee_rider | karma 16765 | avg karma 2.35 · | 2021-11-22 16:03:04

I wonder if they'll somehow include the AMX 'instruction' (or whatever it is) into BLIS kernels. GEMM isn't everything, but it is a pretty important building block in linear algebra. (I mean that's the big observation of these fancy tile based BLAS implementations).

joseph_grobbles | karma 111 | avg karma 0.71 · | 2020-12-28 20:58:06+00:00

Scale, and only scale. As with most extensions, there's negative value if you're doing a small number of calcs, but it pays off in larger needs like ML training.

We've been calculating matrixes for time eternal, but AMX has just recently become a thing. Intel just introduced their own AMX instructions.

reply

kamranjon | karma 721 | avg karma 2.97 · | 2023-03-10 18:57:42

Confusingly there are 2 mechanisms to do matrix operations on the new apple hardware - AMX (https://github.com/corsix/amx) - and the ANE (apple neural engine) - which is enabled by CoreML. This code does not run on the neural engine but the author has a branch for his whisper.cpp project which uses it here: https://github.com/ggerganov/whisper.cpp/pull/566 - so it may not be long before we see it applied here as well. All of this is to say that it actually could get significantly faster if some of this work was able to be handed to the ANE with CoreML.

gnufx | karma 1558 | avg karma 1.12 · | 2020-11-11 20:30:53+00:00

You just run MKL from the oneapi distribution, and it gives decent performance on EPYC2, but basically only for double precision, and I don't remember if that includes complex.

ACML was never competitive in my comparisons with Goto/OpenBLAS on a variety of opterons. It's been discarded, and AMD now use a somewhat enhanced version of BLIS.

BLIS is similar to, sometimes better than, ARMPL on aarch64, like thunderx2.

reply

andrenatal1 | karma 74 | avg karma 5.69 · | 2022-06-03 04:49:09

You can find the engine used here [1], the API built around it here [2] and its WASM port here [3] and the WebAssembly matrix multiplication optimizations are here [4]

[1] https://marian-nmt.github.io/

[2] https://github.com/browsermt/bergamot-translator

[3] https://github.com/browsermt/bergamot-translator/tree/main/w...

[4] https://github.com/mozilla/gecko-dev/tree/master/third_party...

reply

danieldk | karma 20232 | avg karma 4.08 · | 2020-08-29 07:03:16

Yes, starting with MKL 2020.01 release. The Wikipedia page has more information and references:

https://en.wikipedia.org/wiki/Math_Kernel_Library#Performanc...

This is quite bad, since a lot of software relies on Intel MKL as the default BLAS implementation (e.g. PyTorch binaries).

reply

stabbles | karma 2467 | avg karma 3.87 · | 2020-06-15 21:37:55

FWIW, MaBLAS currently does not depend on LoopVectorization.jl, the code to generate kernels is all handwritten.

kzrdude | karma 11414 | avg karma 2.35 · | 2023-07-03 13:04:21

Visible in the unofficial documentation for AMX instructions too - M2 only bf16 functionality - https://github.com/corsix/amx/blob/main/matfp.md

This matfp instruction computes an outer product and is a kernel for matrix multiplication.

reply

kergonath | karma 7728 | avg karma 2.24 · | 2024-01-25 01:07:25

It’s interesting, it does very well in multicore and with complex numbers. The issue is that there’s no way we’ll rewrite all our code littered with calls to BLAS and LAPACK to use a different API. It looks like they have a BLAS compatibility layer though; I hope it’s good.

It even has a nice, friendly licence.

reply

Roark66 | karma 2983 | avg karma 3.5 · | 2022-10-28 15:55:00

Intel did a great thing for people interested in ML and numeric research by making their MKL library and compiler free and cross platform compatible. Even today on my AMD zen3 Ryzen machine intel's MKL linked numpy and pytorch are in some operations 10* (yes that is really ten times) faster in comparison with the next best alternative (openBlas etc). I was shocked to discover how much of a difference MKL makes for cpu workloads. This is mostly because it makes use of AVX2 cpu extensions which make certain matrix operations a lot faster.

ogrisel | karma 689 | avg karma 2.78 · | 2019-09-20 18:24:32+00:00

The BLAS & Lapack subset of the API of the Intel Math Kernel Library (MKL) is very well implemented in open source projects such as OpenBLAS and BLIS:

https://github.com/flame/blis

Both are well optimized for AMD CPUs.

reply