Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Fixed large bar exists in some older accelerator cards like e.g. iirc the MI50/MI60 from AMD (the data center variant of the Radeon Vega VII, the first GPU with PCIe 4.0, also famous for dominating memory bandwidth until the RTX 40-series took that claim back. It had 16GB of HBM delivering 1TB/s memory bandwidth).

It's notably not compatible with some legacy boot processes and iirc also just 32bit kernels in general, so consumer cards had to wait for resizable BAR to get the benefits of large BAR (that being notably direct flat memory mapping of VRAM so CPUs and PCIe peers can directly read and write into all of VRAM, without dancing through a command interface with doorbell registers. AFAIK it allows a GPU to talk directly to NICs and NVMe drives by running the driver in GPU code (I'm not sure how/if they let you properly interact with doorbell registers, but polled io_uring as an ABI would be no problem (I wouldn't be surprised if some NIC firmware already allows offloading this).



sort by: page size:

pcie resizable bar also works only on intel hardware despite being standard pcie feature available for ~5 years now (Zen 1).

PCI devices have always been able to read and write to the shared address space (subject to IOMMU); most frequently used for DMA to system RAM, but not limited to it.

So, poking around to configure the device to put the whole VRAM in the address space is reasonable, subject to support for resizable BAR or just having a fixed size large enough BAR. And telling one card to read/write from an address that happens to be mapped to a different card's VRAM is also reasonable.

I'd be interested to know if PCI-e switching capacity will be a bottleneck, or if it'll just be the point to point links and VRAM that bottlenecks. Saving a bounce through system RAM should help in either case though.


In PCI, BAR is Base Address Register, which is a register in the PCI device's configuration space which defines where in the machine's physical memory address space that particular window of memory and/or I/O will be mapped (a single device can have several BARs, for instance a simple graphics card could have one for its control registers and one for the framebuffer). So the "BAR space" would be a shorthand for "the region of the physical memory address space which can be used to map the PCI devices memory through their Base Address Registers". The size of this region is limited, and graphics cards in particular tend to have somewhat large BARs.

(See for yourself in your machine: run "lspci -v", the lines starting with "Memory at ..." or "I/O ports at ..." are the BARs.)


Fascinating stuff, thanks for the details! I had no idea that they made PCIe accelerator based configurations now.

I don't think that really fits together. PCIe (and PCI) hotplug has existed for a while and topology changes aren't new either. ExpressCard for example has done this, as has PCMCIA. Older RDMA buses did this too, as do backplane-based industrial PCs of which there are a really large amount.

I suspect that end-user smoothness based on 'the user is not required to know everything' that makes the likes of Apple implement bus pausing for dynamic topology assignment and BAR adjustment is not available in Linux land because there simply isn't a big enough overlap of people to make this a hot topic.

You need:

  1. A person who understands how this works
  2. A person who understands what they want to use it for
  3. A person who understands what person 1 has to do so person 2 can use it
Usually you get 1 or 2, sometimes both, but almost never 3 except at places like Canonical, RedHat, SuSE etc. because it's too much of an analyst role and not enough of a "I need it for myself and I can build it" role.

Similar but different problems exist in other software areas like when person "3" is split in "business", "end-user" and "licensing" with competing interests. That's where you get the NT kernel which gets split into arbitrary partitions where based on an integer configuration it may or may not want to address your RAM.

Same goes for the old "aperture size" for GPU memory transfers, and later on the BAR resize support. It was never 'hard' to implement, it's just that IBV's didn't bother and mainboard manufacturers didn't care. Yet it was always available and even Tianocore EDK2, Apple's own EFI and (for some reason) BIOS and UEFI from Quanta and Supermicro all did support it just fine. Same goes for KMS and non-blink GPU switchovers where AMD, NVDA and Intel used to constantly sell it as 'impossible' and we all just accepted that. Yet KMS, the MobileFramebuffer (and the old AppleTV Gen 1) and even VesaFB showed that it's totally possible and it's just everyone using the same joke sample implementation from the vendor that's causing it.

Another example would be VESA where DisplayPort topology changes on the control channel side do similar things to PCIe bus pausing. The display controllers and bus drivers should pause on hot plug to let the host decide on the new topology, but implementing that costs time and effort, and you need to have some in-depth knowledge on both the hardware and software side, so lots of companies don't bother. Result: some host+display combinations only work after restarting either end to force it to re-discover the current topology. This even happens in 1:1 topology scenarios where a simple GPU driver update might restart the host bus and the display ignores it and simply stops receiving data until the watchdog timer restarts the embedded processor causing the screen to blink. I'd just dumb low-quality choices and corner-cutting that causes this.


PCIe devices like GPUs can access system memory.

Integrated GPUs also access system memory via the same bus as the CPU.

It’s not really a new technique. Apple just shipped a highly integrated unit with large memory bandwidth.


Didnt stop AMD from disabling PCI Express Resizable BAR support, hidden behind AMD marketing wank "smart access memory (TM)" name, on older "chipsets" despite both memory and PCIE controllers being build INTO THE CPU ....

https://www.amd.com/en/technologies/smart-access-memory

Or that time x470 was going to support PCIE 4, but then it was made x570 exclusive


PCIe does this in hardware

I could never get a solid answer wether that was presented as memory to the GPU or just as a PCIE switch with NVME drives hanging off one side and the GPU on another.

I think most (if not all) modern PCI-E GPUs still supports this. If I am not mistaken, this is what is preventing them from working on ARM right now (Raspberry PI 4, Apple M1, etc.), because the driver expect specific BIOS features

There's also the I/O ports (PCI has three address spaces: configuration, memory, and I/O, with the I/O space being a legacy x86 thing which does not exist on ARM and most other CPU architectures), which some cards might require; and the PCI address window (BAR space), which might be too small for some cards (usually GPUs; see for instance https://www.jeffgeerling.com/blog/2020/external-gpus-and-ras... for a couple of examples where it was too small.

historically cases had a bracket at the front to support full length cards. I even remember I once had a reference amd card that had an odd extension so that it would be supported by the forward full length brackets.

I have to admit I have not seen that front bracket for a long time. some server chassis have a bar across the top to support large cards. this would bet great except gfx card manufacturers like to exceed the pci spec for height. that bar had to be removed on my last two builds. now days I just put my case horizontal and pray.


They prefer for the GPU cores to be doing math rather than displaying image. That's why you can get away with the GPU not being in a full lane width slot. The little PCIe did more than just send out an HDMI signal back then as well, as most of it went to an SDI or even older analog component BNCs. No GPU I've ever seen has had those connectors.

With GPUDirect and active-wait kernels, you can get a tight controlled latency and saturate PCIe bandwidth without touching main memory. StorageDirect if you need to write to (or read from) disk.

Not really my area of expertise by any means, but I got a bit excited by this talk[1]. The idea is that a PCIe switch can be used to send data directly from PCIe to PCIe device without hitting the host controller. NICs are mentioned as a use-case.

While the talk is centered around RISC-V, there's nothing RISC-V specific about this, just that it enables whimpy cores to control multi-Gbps traffic.

[1]: https://www.youtube.com/watch?v=LDOlqgUZtHE (Accelerating Computational Storage Over NVMe with RISC V)


Why didn’t they release GPUs for those PCIe slots? I just don’t get why they couldn’t do a simple thing instead of AR/VR.

The fact that it sticks out on the side made me realize these Framework dongles are basically neo-PCMCIA cards lol

I wonder why there was never a standardized PCIE 'card' format? Seems like the perfect thing for extra storage with NVME being so common these days.


>high-speed I/O such as PCIe

Back in the years we're talking about, that would be AGP (https://en.m.wikipedia.org/wiki/Accelerated_Graphics_Port).


I’ve been thinking about this a lot recently - pcie slots. They need to be deprecated. We should be using mini-sas style breakout cables and reduce motherboard sizes.

It’s particularly a problem with huge top of the line GPU’s like the 7900 XTX and 4090. They are so long and so heavy that they sag. To work around it, we have kickstands and brackets that are added on the far end (opposite the external slot) to prop them up. Vertical mounts exist, but they’re a very wide ribbon the same width as the slot that will get in the way of lots of stuff.

Why aren’t we innovating here? Big GPU’s are so big they will often block all the slots on the board anyway. Manufacturers are shifting pcie lanes to m.2 on platforms without many lanes. The slots need to go or remain only for legacy use.

It’ll help with things like this too. The 4060 is using a full slot that it doesn’t need so those wasted lanes are now available for an m.2 card. IMHO all this should be modular like a less polished usb-c/thunderbolt interface. Minisas comes to mind but I know the server market is doing things with pushing pcie over a breakout cable.

ATX feels so antiquated as a form factor right now.

next

Legal | privacy