SOCKMAP – TCP splicing of the future

eloycoto | karma 2815 | avg karma 10.66 · 2019-02-18 17:09:18+00:00

Thomas Graf, a Cilium founder, made some comments on Twitter around this:

https://twitter.com/majek04/status/1097485987054346240

castlec | karma 10 | avg karma 1.43 · 2019-02-18 17:50:17

For those that don't want to jump to the twitter feed, we should be expecting a follow-up as there should be a push addressing the data loss and allowing changes to attempt speed enhancements without data loss.

Good, wholesome, internet collaboration right here.....

reply

jaytaylor | karma 6136 | avg karma 4.13 · 2019-02-18 17:09:58+00:00

Articles like this are my very most favorite kind. The scientific approach combined with the never ending quest for more efficient and higher performance from software!

I love everything about it!

<3

reply

jandrese | karma 30121 | avg karma 3.36 · 2019-02-18 17:27:08

This article had me all excited until they got to the benchmarks. Hopefully this is just some design deficiencies and not something terrible like running a BPF bytecode is just plain slower than a couple of context switches.

majke | karma 5074 | avg karma 7.14 · 2019-02-18 21:09:15+00:00

I really really wanted to prove SOCKMAP rocks. Oh, well. Not just yet. In one of the linked videos the SOCKMAP authors say they didn't optimize part of the code. So there definitely is plenty of room.

I don't see any fundamental reason why SOCKMAP wouldn't be fastest. It's just a matter of putting an effort into it.

For example, we noticed the poor splice(2) performance is due to a spinlock. We need to figure out just why this happens, but it doesn't seem like a major problem. Probably just a trivial regression.

reply

jandrese | karma 30121 | avg karma 3.36 · 2019-02-18 21:52:14+00:00

I really appreciate the effort you put into doing the benchmarks. Reading the first half of your article I was excited but also slightly dreading that I'd have to go to the trouble to create and run the benchmarks to see if it was worth the effort.

londons_explore | karma 35497 | avg karma 2.72 · 2019-02-19 08:50:08

For optimising things like this I can recommend using ARM's ITM tracing.

Then you can get a view of every clock cycle and what is going on in the CPU, the caches, the hardware peripherals, the DMA controller etc.

You can start from a goal like "I want a TCP packet forwarded in under 5k clock cycles" and keep hacking code till you've met it.

Other software based tracing tools don't tend to be as good, and frequently rely on statistical techniques which hide what is really going on.

reply

debatem1 | karma 1662 | avg karma 2.93 · 2019-02-18 17:47:23+00:00

I'm surprised that for something this simple there isn't an easy to use approach through netmap and friends.

majke | karma 5074 | avg karma 7.14 · 2019-02-18 21:11:04+00:00

We specifically want to splice userspace TCP sockets. The unix way. I'm also not a fan of custom TCP/IP stacks.

https://blog.cloudflare.com/why-we-use-the-linux-kernels-tcp...

reply

debatem1 | karma 1662 | avg karma 2.93 · 2019-02-18 23:17:19+00:00

I see the opinion, but the kernel developer in me thinks that perhaps it would be better for everyone if the answer to such narrow, performance focused use cases was not "let's develop an exotic kernel interface that doesn't really work" but rather "go buy hardware or do it in userland".

xmichael999 | karma 158 | avg karma 1.33 · 2019-02-18 19:40:02+00:00

Any reason this article completely skipped over the userspace tools like DPDK and Netmap (there are many others). Considering cloudflare uses a customer version of nginx using the dpdk stack?

majke | karma 5074 | avg karma 7.14 · 2019-02-18 21:05:52+00:00

Apples and oranges. Most important problem with DPDK and Netmap is that they require a dedicated network card. We don't have a network card to spare for each of our applications. Read on more:

https://blog.cloudflare.com/kernel-bypass/

https://blog.cloudflare.com/single-rx-queue-kernel-bypass-wi...

Also using dpdk and netmap kills usual tooling (from basics like tcpdump). We very much like iptables, xdp, conntrack, syn cookies and other technologies deeply embedded into linux kernel. Doing DDoS once on linux kernel is hard enough. We don't want to redo the logic for each of the possible kernel bypasses technologies:

https://blog.cloudflare.com/why-we-use-the-linux-kernels-tcp...

https://blog.cloudflare.com/syn-packet-handling-in-the-wild/

reply

grandinj | karma 1611 | avg karma 3.72 · 2019-02-18 20:11:55+00:00

Feels like we're heading to a future where half of userspace will recompile itself as BPF and upload itself into kernel space to be executed there

diegocg | karma 1975 | avg karma 7.45 · 2019-02-18 20:48:42

Jokes aside, I wonder how much time will pass until people start writing kernel drivers in BPF

pas | karma 7438 | avg karma 1.12 · 2019-02-18 20:52:30+00:00

XDP is sort of that for the network subsystem. (Though it doesn't allow you to implement drivers, "just" protocols.) Next step could be filesystem stuff, then libvirt drivers, etc.

But BPF is no that versatile, but at least it has a stable ABI.

reply

agumonkey | karma 29727 | avg karma 1.44 · 2019-02-18 21:09:49

let's ask bsd folks who vetted lua

tedunangst | karma 26000 | avg karma 2.74 · 2019-02-18 20:59:08+00:00

WebAssembly.

nicoburns | karma 22847 | avg karma 3.29 · 2019-02-19 02:20:18

I guess you're getting downvotes because you posted this without explanation, but wasm actually makes a lot of sense for safer kernel extensions, as it allows for sandboxing code running as ring 0.

tedunangst | karma 26000 | avg karma 2.74 · 2019-02-19 03:17:08+00:00

For instance: https://www.wasmjit.org/blog/nginx-on-wasmjit.html

amluto | karma 19119 | avg karma 3.85 · 2019-02-19 01:18:47+00:00

> The naive TCP echo server would look like:

    while data:
        data = read(sd, 4096)
        write(sd, data)

Not if that’s the write syscall, since it can return with fewer bytes than requested having been written. The full C code detects this case and crashes, which looks totally wrong.

The Python-ish example should, at the very least, use sendall. But even that is potentially suboptimal.

reply

Matthias247 | karma 3753 | avg karma 2.13 · 2019-02-19 05:24:50

How is backpressure handled in the SOCKMAP solution? The description sounds like it just attaches all incoming buffers to the outgoing part of the socket irrespective of the number of already buffered data. That could be bad, if it means the memory usage is unbounded.

The other solutions definitely all would block if the send buffer is full.

Apart from that question it was interesting to learn that io_submit actually works on sockets too. I definitely need to read more about this one.

reply

majke | karma 5074 | avg karma 7.14 · 2019-02-19 10:17:11+00:00

There are couple of interesting question:

- backpressure

- what if both userspace and SOCKMAP are doing read() on a socket?

- can one use SOCKMAP to splice only a selected amount of data, and pick up the rest with read()

- What are the parser/verdict program semantics. What can they do.

- what is the SK_MSG abstraction, and how to benefit from it.

and more... some of this is discussed:

http://vger.kernel.org/lpc_net2018_talks/ktls_bpf_paper.pdf

On io_submit, we wrote about it here:

https://blog.cloudflare.com/io_submit-the-epoll-alternative-...

TLDR: it allows for batching, and with IOCB_CMD_POLL can be used as epoll alternative.

reply

Matthias247 | karma 3753 | avg karma 2.13 · 2019-02-20 06:49:56+00:00

Thanks for all the articles! I read the one about io_submit in the meantime. I actually hoped it would be as async as the API suggests, so it seems a bit disappointing for my use-case. Do you know if when I would schedule multiple writes via io_submit, whether it would block until all sockets have written something, or only until at least one has written something?

I guess the fact that one has to guess and try how those APIs behave makes me most nervous and would make me prevent from using those. An async IO system should be very well-behaved, and not block at arbitrary points.

reply