For those that don't want to jump to the twitter feed, we should be expecting a follow-up as there should be a push addressing the data loss and allowing changes to attempt speed enhancements without data loss.
Good, wholesome, internet collaboration right here.....
Articles like this are my very most favorite kind. The scientific approach combined with the never ending quest for more efficient and higher performance from software!
This article had me all excited until they got to the benchmarks. Hopefully this is just some design deficiencies and not something terrible like running a BPF bytecode is just plain slower than a couple of context switches.
I really really wanted to prove SOCKMAP rocks. Oh, well. Not just yet. In one of the linked videos the SOCKMAP authors say they didn't optimize part of the code. So there definitely is plenty of room.
I don't see any fundamental reason why SOCKMAP wouldn't be fastest. It's just a matter of putting an effort into it.
For example, we noticed the poor splice(2) performance is due to a spinlock. We need to figure out just why this happens, but it doesn't seem like a major problem. Probably just a trivial regression.
I really appreciate the effort you put into doing the benchmarks. Reading the first half of your article I was excited but also slightly dreading that I'd have to go to the trouble to create and run the benchmarks to see if it was worth the effort.
I see the opinion, but the kernel developer in me thinks that perhaps it would be better for everyone if the answer to such narrow, performance focused use cases was not "let's develop an exotic kernel interface that doesn't really work" but rather "go buy hardware or do it in userland".
Any reason this article completely skipped over the userspace tools like DPDK and Netmap (there are many others). Considering cloudflare uses a customer version of nginx using the dpdk stack?
Apples and oranges. Most important problem with DPDK and Netmap is that they require a dedicated network card. We don't have a network card to spare for each of our applications. Read on more:
Also using dpdk and netmap kills usual tooling (from basics like tcpdump). We very much like iptables, xdp, conntrack, syn cookies and other technologies deeply embedded into linux kernel. Doing DDoS once on linux kernel is hard enough. We don't want to redo the logic for each of the possible kernel bypasses technologies:
XDP is sort of that for the network subsystem. (Though it doesn't allow you to implement drivers, "just" protocols.) Next step could be filesystem stuff, then libvirt drivers, etc.
But BPF is no that versatile, but at least it has a stable ABI.
I guess you're getting downvotes because you posted this without explanation, but wasm actually makes a lot of sense for safer kernel extensions, as it allows for sandboxing code running as ring 0.
Not if that’s the write syscall, since it can return with fewer bytes than requested having been written. The full C code detects this case and crashes, which looks totally wrong.
The Python-ish example should, at the very least, use sendall. But even that is potentially suboptimal.
How is backpressure handled in the SOCKMAP solution? The description sounds like it just attaches all incoming buffers to the outgoing part of the socket irrespective of the number of already buffered data. That could be bad, if it means the memory usage is unbounded.
The other solutions definitely all would block if the send buffer is full.
Apart from that question it was interesting to learn that io_submit actually works on sockets too. I definitely need to read more about this one.
Thanks for all the articles! I read the one about io_submit in the meantime. I actually hoped it would be as async as the API suggests, so it seems a bit disappointing for my use-case. Do you know if when I would schedule multiple writes via io_submit, whether it would block until all sockets have written something, or only until at least one has written something?
I guess the fact that one has to guess and try how those APIs behave makes me most nervous and would make me prevent from using those. An async IO system should be very well-behaved, and not block at arbitrary points.
https://twitter.com/majek04/status/1097485987054346240
reply