Hacker Read

touisteur · 2023-11-21 13:57:38

With GPUDirect and active-wait kernels, you can get a tight controlled latency and saturate PCIe bandwidth without touching main memory. StorageDirect if you need to write to (or read from) disk.

consp | karma 2881 | avg karma 2.56 · | 2019-07-10 08:18:13+00:00

Or use the PCIe mod and use an infiniband card for low latency and high thoughput.

yaantc | karma 1507 | avg karma 3.91 · | 2023-01-16 14:34:15

This is not specific to dGPU, it could apply to any PCIe device. Emphasis on "theoretically" too.

On the device (dGPU here), it is possible to route memory accesses to part of the internal address space to the PCIe controller. In turn, the PCIe controller can translate such received memory access into a PCIe request (read or write), in the different PCIe address space, with some address translation.

This PCIe request goes to the PCIe host (CPU in a dGPU scenario). Here too the host PCIe controller can map the PCIe request, using using a PCIe address space address, into the host address space. And this can go to the host memory (after IOMMU filtering and address translation usually). And all this back for the return trip to the device in case of a read.

So latency would be rather high, but technically possible. In most application such transfers are offloaded to a DMA in the PCIe controller doing a copy between PCIe and local address spaces, but a processing core can certainly do a direct access without DMA if all the address mappings are suitably configured.

reply

DaiPlusPlus | karma 8401 | avg karma 2.53 · | 2020-10-17 05:30:43+00:00

PCI-Express bus sharing, and custom Radeon firmware - or it could use DMA to main memory and copy it in the background.

magicalhippo | karma 12317 | avg karma 2.47 · | 2020-05-01 22:33:26+00:00

Not really my area of expertise by any means, but I got a bit excited by this talk[1]. The idea is that a PCIe switch can be used to send data directly from PCIe to PCIe device without hitting the host controller. NICs are mentioned as a use-case.

While the talk is centered around RISC-V, there's nothing RISC-V specific about this, just that it enables whimpy cores to control multi-Gbps traffic.

[1]: https://www.youtube.com/watch?v=LDOlqgUZtHE (Accelerating Computational Storage Over NVMe with RISC V)

reply

scott_wilson46 | karma 38 | avg karma 1.31 · | 2016-12-01 10:12:58+00:00

Xilinx provide software drivers and IP for PCIe DMA and memory mapped interfaces. These are fairly easy to integrate (probably not the best for latency though - I've developed my own but I require a specific use case - low latency but don't care about bandwidth).

wtallis | karma 18533 | avg karma 3.1 · | 2023-11-24 18:00:08

PCIe latency is a few orders of magnitude lower than NAND flash read latency, so the extra round trip to the CPU's PCIe root complex doesn't matter.

__david__ | karma 6442 | avg karma 3.05 · | 2011-02-26 18:00:40+00:00

And even better, you can get access to the other PCIe devices--in particular the SATA devices. Though it might be tough to talk to devices that the main processor is actively using. Even if you somehow disable interrupts on the device, the potential for conflicts (leading to data corruption) seems amazingly high.

Dylan16807 | karma 31639 | avg karma 1.39 · | 2021-02-12 20:52:10+00:00

The analog parts are the slow parts.

PCIe costs you less than a microsecond of latency. A good SSD has 60 microseconds of latency. You're not going to notice any difference from moving the controller.

reply

rowanG077 | karma 2770 | avg karma 0.92 · | 2021-01-03 19:32:44

Yes through the PCI bus not directly. You don't want to have that latency. You want a unified model. Like Intel GPUs that can access main memory, or the FPGA being another endpoint in AMDs infinite fabric architecture. That exists as well in SoCFPGA boards. But not in the mid or high performance segments.

vrinsd | karma 151 | avg karma 2.44 · | 2024-02-15 03:10:40

The issue is not lack of raw bandwidth, it's getting the hardware, software, drivers and OS "to do the right thing".

PCIe gives you the building blocks of posted transactions and non-posted transactions but doesn't help you use them effectively. There is no coordinated or designated DMA subsystem to help move data between the root-complex("host") and end-point("device)".

So, if you have to design a new PCIe end-point (target in original PCI terms) using an FPGA or ASIC then trying to actually sustain PCIe throughput in either "direction" isn't trivial.

Posted transactions ("writes") are 'fire and forget' and non-posted transactions ("reads") have a request/acknowledgement system, flow-control, etc.

If you can get your "system" to use ONLY posted writes (fire and forget) with a large enough MPS (payload size), usually >128 Bytes, then you can get to 80%-95% of theoretical throughput (1).

The real difficulty is if you need to do a PCIe 'read' this breaks down into a read-request (MRd) and a Completion with Data (CplD). The 'read' results in a lot of back and forth traffic and tracking the MRds/CplDs becomes a challenge (2).

Often an end-point can use 'posted writes' to blast data to the PCIe root-complex (usually the CPU/host) maximizing throughput since a host usually has hundreds of MegaBytes of RAM to make use of for buffers. Unfortunately to transfer data from the root-complex(host) to the end-point(device), the host usually will have the device's DMA controller initiate a 'read' from the host's memory which results in these split transactions since end-points don't often carry hundreds of MB of RAM. This also means bespoke drivers, tying into the OS PCIe subsystems and hopefully not loosing any MSI-X interrupts.

To re-iterate in the modern "Intel way" the CPU houses the PCIe root complex but does not house ANY DMA controller. So to get "DMA" working means each PCIe end-point's implementation has some kind of DMA "controller" which is different than the DMA controller of all other end-points, rather than Intel having spec'd out an "optional" centralized a DMA controller in the root complex.

1: https://cdrdv2-public.intel.com/666650/an456-683541-666650.p...

2: https://www.intel.com/content/www/us/en/docs/programmable/68...

reply

PragmaticPulp | karma 66674 | avg karma 12.83 · | 2023-03-10 18:28:56

PCIe devices like GPUs can access system memory.

Integrated GPUs also access system memory via the same bus as the CPU.

It’s not really a new technique. Apple just shipped a highly integrated unit with large memory bandwidth.

reply

magicalhippo | karma 12317 | avg karma 2.47 · | 2024-03-30 03:01:22

Given that PCIe allows data to be piped directly from one device to another without going through the host CPU[1][2], I guess it might make sense to just have the GPU read blocks straight from the NVMe (or even NVMe-of[3]) rather than having the CPU do a lot of work.

[1]: https://nvmexpress.org/wp-content/uploads/Enabling-the-NVMe-...

[2]: https://lwn.net/Articles/767281/

[3]: https://www.nvmexpress.org/wp-content/uploads/NVMe_Over_Fabr...

reply

Symmetry | karma 18650 | avg karma 3.07 · | 2019-05-07 13:27:56+00:00

PCIe provides communication but isn't intended to provide memory coherency. There's a lot of work that goes on in figuring out which cache(s) have a copy of which cache line and figuring out how to resolve conflicting access needs.

paulmd | karma 13045 | avg karma 2.72 · | 2023-02-11 19:18:32

PCIe switch with NVMe drives and the GPU behind it, they actually just are presented straight to windows even.

https://www.youtube.com/watch?v=-fEjoJO4lEM

In principle they could be used with an API like DirectStorage RDMA or CUDA GPUDirect RDMA (which dates back to Kepler) and in this case they would never need to talk to the CPU, given appropriate software support. But it's not going to be presented as GPU memory ever, it's going to work like a block storage device you can do RDMA requests against, most likely.

https://docs.nvidia.com/cuda/gpudirect-rdma/

https://developer.download.nvidia.com/video/gputechconf/gtc/...

Now again, like others I'm not exactly clear why for ML training there isn't better support for exactly these block-storage RDMA use-cases... I hear things like "chatGPT was trained on a cluster of A100s with Infiniband RDMA interconnects to allow larger model size" and I don't see why RDMA to another GPU is fundamentally different from RDMA to a block-storage device. Why can't I have a very large model that lives on, say, optane 905P pcie drives and the GPU just works a portion at a time and does 1GB blits of the model? Why is that fundamentally different from when a GPU need to pull something from VRAM on another GPU? Yeah it's slower than onboard VRAM but so is going across a PCIe link to your infiniband card... that's gonna be pcie 2.0/3.0/4.0 x8 per PHY usually.

reply

cjbprime | karma 11682 | avg karma 4.23 · | 2022-01-22 19:52:33

Agree with you, but a quick clarification: PCIe IOMMUs exist now, a PCIe device doesn't get DMA access to main memory.

stefan_ | karma 13798 | avg karma 4.69 · | 2021-08-28 09:21:07

The whole "map the PCIE device into userspace process memory" thing is called DPDK (https://www.dpdk.org/)

hak8or | karma 286 | avg karma 2.72 · | 2020-03-11 13:31:07+00:00

My understanding is that pcie devices incur a very high latency for getting data to and from them, due to the pcie device itself and setting up a dma descriptor. Sadly this article doesn't look into that.

Can someone comment on what sort of latency would be involved in something like this? For example, latency to send to pcie device, latency to react to and manipulate the data in a warp(Nvidia?), and latency to send the data back to the host system?

reply

nimish | karma 2469 | avg karma 3.11 · | 2023-11-24 21:37:56

Hm, without a consumer equivalent to GPU Direct Storage (PCIe P2P DMA), this is not as cool as it could be. Still has to bounce to and from the CPU for no good reason.

eptcyka | karma 3807 | avg karma 2.71 · | 2021-09-13 17:43:59

I was under the impression that PCI-E was perfectly capable of sending notifications from one device to another in a somewhat efficient manner. Having said that, this is not my area of expertise - and I do see that if your main concern is to feed the GPU then blocking a thread might be the optimal solution. I assume that MSI would be too much overhead and might involve some context switching to service the interrupt from the kernel etc to allow for asynchronous completion? Also, is it possible to have overlapping memory regions between a high speed networking card and the input buffer from the GPU, which in effect just means that the CPU just has to tell the GPU to start reading once the network card is done receiving?

Having said that, I don't believe that for most application developers this is a major concern - in cases where you flood the GPU with a firehose of data to compute on you probably also don't care about what other processes run on the machine and whether your architectural decisions end up making people's laps uncomfortably hot. I also do not believe that the future of all I/O is just memcpy and atomics - we can already do that today. It doesn't really bring you any advantages for speed in the general case. I think the future of I/O is memcpy, atomics and a good signaling mechanism to signal I/O task completion without costly context switches with as little extraneous memory allocation as possible. Moreover, the future of consumer computing will probably not rely on PCI-E at all and instead have the GPU and the CPU share all of it's memory. And hey, maybe Nvidia will add some cool ARM companion cores to their biggest chips, slap on some DDR5 slots on their cards and sell self-contained solutions, sidestepping PCI-E entirely, at least for feeding the data from the CPU to the GPU.

reply