Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

> RDNA3 does have WMMA which should at least get the foot in the door for things like AI/ML(-weighted TAAU) upscaling.

Do keep in mind that RDNA3 WMMA is very slow, running at the same theoretical TFLOPS as shader (which are doubled from RDNA2 due to very limited dual issue support that can be used in WMMA). Nvidia tensor cores and Intel XMX can run closer to 4:1 or 8:1 or so compared to vector workloads.

> It just feels obviously wrong to do wave-8 right of the gate

That's because Alchemist isn't a a true first generation product, it's a scaled-up version of Intel's (relatively mature at this point) integrated graphics product. This means that Alchemist has suffered large growing pains (Gen12 was not really designed to be used in products bigger than maybe 128 EUs, A770 is 512 EUs). It also has some form of separation anxiety, seeing its need for ReBAR. I also recall very early in Alchemist's life (pre-release) some driver optimization that had a huge benefit, which was just the wrong memory region being used, as all memory is the same on an iGPU but not a dGPU.



sort by: page size:

> But why would game programmers care about shader core latency??? I seriously don't understand.

Well, I don't know per se. What I can say is that the various improvements AMD made to RDNA did the following:

1. Barely increased TFLOPs -- Especially compared to CDNA, it is clear that RDNA has fewer FLOPs

2. Despite #1, improved gaming performance dramatically

--------

When we look at RDNA, we can see that many, many latency numbers improved (though throughput numbers, like TFLOPs, aren't that much better than Vega 7). Its clear that the RDNA team did some kind of analysis into the kinds of shaders that are used by video game programmers, and tailored RDNA to match them better.

> I've never seen any literature that complained about load/store access latency in the shader core. It's just so low level...

Those are just things I've noticed about the RDNA architecture. Maybe I'm latching onto the wrong things here, but... its clear that RDNA was aimed at the gaming workload.

Perhaps modern shaders are no longer just brute-force vertex/pixel style shaders, but are instead doing far more complex things. These more complicated shaders could be more latency bound rather than TFLOPs bound.


> So to enable RDNA1, they'd need to ship RDNA1 code in all their libraries

RDNA1? more like 3 binary slices. Navi10 (5700 XT), Navi12 (AWS G4ad) and Navi14 (5500 XT) require separate binaries!

> I do believe that running oneAPI on AMD is possible, but it still needs HIP/ROCm?

Yes, HIP runtimes for AMD GPUs rely on an underlying HIP implementation.

> Wonder if it would be possible to bake a L0 backend for AMD

Yes. But why would anybody not named AMD do that? It's AMD's hardware so AMD has to support it. OSS/hobbyists can only do so much.

> Frankly I wish AMD and Intel just started working together more on this stuff

Why? AMD truly does not care about GPGPU APIs for the masses. For their management it's a useless additional expense so they haven't been doing it.

A chunk of the community has wanted to consider AMD as an NV alternative for this, but AMD are not selling the same product. They think that their gaming GPU line is gaming centred w/ often bare minimum support for other markets if any, while NV cares about a much wider audience.

That's how the market ended up with: Q3 2022 Discrete GPU Market Share Report: NVIDIA Gains 88% Market Share Hold, AMD Now at 8% Followed By Intel at 4%.


> Even the discrete-ish Iris Xe graphics are surprisingly fast

Anyone have any idea how well this stacks up against RDNA2? I'd love it to be close enough to not have to worry much about, but from what I hear AMD have it significantly better


> And even then, the HIP C++ compiler library is bundled as part of Orochi instead of being part of the app. This means that your app using Orochi will not run on a future GPU gen unless it's updated against a newer Orochi runtime.

Ugh. Leave it to AMD to make something that technically works but is an absolute nightmare.

IIRC this machine code nonsense is also the reason that GPU support is such an issue for AMD: to 'support' a chip, they need to bake binaries for that chip in all libraries. So to enable RDNA1, they'd need to ship RDNA1 code in all their libraries, which would make the install size balloon to crazy levels. At least Intel got it right.

I do believe that running oneAPI on AMD is possible, but it still needs HIP/ROCm? Wonder if it would be possible to bake a L0 backend for AMD that just uses SPIR-V like the Intel stuff does, side-stepping this issue entirely.

Frankly I wish AMD and Intel just started working together more on this stuff. Both of them stand to gain from a cross-vendor standard that works well.


> many algos currently seem bottlenecked by how fast you can load data into your GFX card

All of them are, but that's a PCI-e issue and horizontal scaling doesn't fix that (unless using nvlink or similar afaik, but then you face the fact that current horizontal scaling schemes aren't very effective at increasing model accuracy anyway)

> It's difficult to say what "most applications are"

Nevermind "most applications"--so far, all I've heard is one, that being the absolute bleeding edge of RNN research, assuming you're using a huge softmax instead of an alternative.

My point remains: Multi-GPU is way down the list when it comes to features a DL framework should have. Because very very few people need it.

kajecounterhack, do you use multi-GPU for your DL work? If so, how often?


> The CPUs are also using the previous-gen graphics architecture, RDNA2

Faster GPU is reserved for APUs. These graphics are just here for basic support.


> because DLSS 2.0 is the big differentiator

Is it really, though? Consoles have been doing upscaling without it for years, and one has to assume they're still going to be innovating on that front on RDNA 2.0 with the new generation, too.

The DLSS 2.0 mode where it's used as a super-sampling replacement is kinda interesting, but realistically TXAA in most engines is also pretty solid. It seems like a fairly minor extra feature at the end of the day as a result... Cool, but not game changing at all.

EDIT: although AMD did make mention of something called "super resolution" as part of their fidelity fx suite which sounds like a DLSS competitor but there's no real details. And of course the actual image results here are far more important


>the equivalent PC GPUs will be available too.

And RDNA2 should have scary performance, on the GPU power budgets that are standard these days.


> AMD diverged their GPU architecture development into separate CDNA and RDNA lines specialized for compute and graphics respectively.

Ooooh, is that why the consumer cards don't do compute? Yikes, I was hoping that was just a bit of misguided segmentation but this sounds like a high level architecture problem, like an interstate without an on ramp. Oof.


> I don't think these caches are entirely /not worth discussing/

Hmmm... NVidia's is clearly aiming at just punching through the memory-bandwidth problem with GDDR6x (2-bits per clock tick since its got 4-level encoding).

That's the thing, NVidia isn't really pushing memory bandwidth or size limits on the L2 or even L1 cache IMO. Even the 128kB L1$ per SM is only roughly the size of the SM's register space. Their most interesting move really is GDDR6x, which is the brute-force way to solve that problem.

--------

AMD's "Infinity Cache" on RDNA2 is 128MB of L3$, but AMD is using only standard GDDR6 (1-bit per clock tick transferred). AMD's RDNA2 is very strange: L0, L1, L2, and L3 caches, when AMD GCN was just L1 and L2 layers of cache.

That "infinity cache" is worth talking about I guess... its large enough to be relevant in a number of gaming situations.

------

I guess AMD and NVidia are both using HBM at the high end for 1TBps to 2TBps bandwidths. But those chips aren't in the consumer realm anymore. The ultimate brute force solution: spend more money.

You're right in that the L1 and L2 caches (and L0 and L3 caches of AMD) probably do affect performance in real ways.


> This could work if AMD invested a ton in software like Intel has with OneAPI.

Not my area of expertise, but I get the same feeling regarding their GPUs. Everything is built using CUDA, so AMD is out for ML/AI, etc.


> The sad part is that, by all appearances, the goal of this work was not to add functionality for NVIDIA GPUs in particular.

I don't see how that can be true given that it is designed to speed up ML, and only nVidia graphics cards are used for ML. Approximately nobody uses Intel or AMD.

Maybe they meant it wasn't written by an nVidia employee? But it's clearly intended to be used with nVidia GPUs in particular.


> However, I'm suspicious that they implemented an entirely different algorithm on the FPGA, and didn't measure the performance of that algorithm on GPUs.

I agree this was a bit suspicious. It may be the case that the different algorithm they used for the FPGA would have done well on a GPU -- or perhaps more likely, that if they spent a similar amount of effort in rethinking the algorithm just for the GPU, ending up with a third GPU-specialised approach, perhaps that would have done dramatically better.

Pragmatically, it seems like they chose the GPU for their application anyway - so they had already decided the GPU was the overall winner without needing to improve it.


> but the 80GB A100s server GPUs definitely are

I'm sure LLMs were considered, like many other ML use cases, but that A100 was intended for LLMs? I'm unsure about that.

A100 was released the same year as GPT3, and it wasn't until GPT3 went live that people really started pay attention. Then I'm sure designing and producing a GPU takes a longer time than a couple of months.


> I feel like AMD for some reason doesn't understand that support for GPGPU on the consumer stack is important to eventually getting adoption in the enterprise space.

I made an attempt a few years ago at getting in to the machine learning world and discovered very quickly that I was facing a choice - either I could go AMD and open source drivers or Nvidia and cutting edge software. To learn the algorithms on my own PC just didn't seem feasible with an AMD GPU - running them on the CPU was too slow and I never found a way to make things work on the GPU with the resources I located. I gave up on AI since open source software is more important to me.

Maybe I just lacked the right mindset, but the difficulties were so extreme it seems plausible that most people with an AMD GPU never learned to use ROCm at the amateur level and there wasn't a sufficiently good strategy to get it adopted at the professional level without an amateur community.

Low level GPU programming is really hard. OpenGL, CUDA, OpenCL, whatever. I'm still on the lookout for an introductory resource on how I'm meant to use an AMD gpu to multiple 2 matrices together. I literally can't find one with a 5 minute google search, I get a blank article [0]. Maybe it is just blank for me? For using OpenGL I can just go play around [1]. There needs to be a strong community of advanced people explaining what to do which means big amateur support.

So, in short, I strongly agree. I think AMDs (poor) support of consumer stack for GPGPU hurt them a lot more than they realise. OpenCL doesn't seem to cut it. The lack of libraries and community indicates that the feeder pipeline of amateur -> professional growth is broken and nobody is learning to use their platform even for fun.

[0] https://developer.amd.com/article/a-heterogeneous-accelerate...

[1] https://www.shadertoy.com/


> So it's not like they had to start absolutely from scratch here

No, and actually in some respects that's not a good thing either. Their existing iGPU driver was designed with the assumption of GPU and CPU memory being pretty much fungible, and with the CPU being pretty "close" in terms of latency, just across the ringbus. It wasn't even PCIe attached, like how AMD does it, it was directly on the ringbus like another core.

Now you need to take that legacy codebase and refactor it to have a conception of where data lives and how computation needs to proceed in order to feed the GPU with the minimum number of trips across the bus. Is that easier than writing a clean driver from scratch, and pulling in specific bits that you need? ....

One of their recent bugs in raytracing was literally due to one line of code in an allocator that was missing a flag to allocate the space in GPU memory instead of CPU, a one-line change produced a 100x speedup in the raytracing performance.

https://www.phoronix.com/news/Intel-Vulkan-RT-100x-Improve

It is most likely much easier to do what AMD did and go from discrete to integrated than the other way around... again, they don't have a tightly-coupled design like Intel did, their iGPU is literally just a pcie client that happens to be on the same die.

(also, AMD paid the penalty like 15 years ago... there were terascale based APUs, and the GCN driver was developed as a dGPU/iGPU hybrid architecture from day 1, they never had to backport GCN itself from dGPU, that was all done in the terascale days.)


> AMD GPUs have near zero support in major ML frameworks. There are some things coming out with Rocm and other niche things, but most people in ML already have enough work dealing with model and framework problems that using experimental AMD support is probably a no go.

_If_ AMD has made a sufficiently powerful GPU, that will add a lot of incentive to ML frameworks to support it. But it's going to have to be a big difference, I imagine.

Given how active AMD is in open source work, I'm a little surprised they haven't been throwing developers at ML frameworks.


> It doesn't beat RTX 4090 when it comes to actual LLM inference speed

Sure, whisper.cpp is not an LLM. The 4090 can't even do inference at all on anything over 24GB, while ASi can chug through it even if slightly slower.

I wonder if with https://github.com/tinygrad/open-gpu-kernel-modules (the 4090 P2P patches) it might become a lot faster to split a too-large model across multiple 4090s and still outperform ASi (at least until someone at Apple does an MLX LLM).


> (it is weirdly difficult to get a good tutorial on how to do matrix multiplication on an AMD GPU; every so often I look for one and have I think literally never found an example).

There's some blogs on GPUOpen: MFMA on MI100/200 https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix...

WMMA on Navi3 https://gpuopen.com/learn/wmma_on_rdna3/

next

Legal | privacy