Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
SOCKMAP – TCP splicing of the future (blog.cloudflare.com) similar stories update story
125.0 points by jgrahamc | karma 89756 | avg karma 10.41 2019-02-18 13:13:23+00:00 | hide | past | favorite | 24 comments



view as:

Thomas Graf, a Cilium founder, made some comments on Twitter around this:

https://twitter.com/majek04/status/1097485987054346240


For those that don't want to jump to the twitter feed, we should be expecting a follow-up as there should be a push addressing the data loss and allowing changes to attempt speed enhancements without data loss.

Good, wholesome, internet collaboration right here.....


Articles like this are my very most favorite kind. The scientific approach combined with the never ending quest for more efficient and higher performance from software!

I love everything about it!

<3


This article had me all excited until they got to the benchmarks. Hopefully this is just some design deficiencies and not something terrible like running a BPF bytecode is just plain slower than a couple of context switches.

I really really wanted to prove SOCKMAP rocks. Oh, well. Not just yet. In one of the linked videos the SOCKMAP authors say they didn't optimize part of the code. So there definitely is plenty of room.

I don't see any fundamental reason why SOCKMAP wouldn't be fastest. It's just a matter of putting an effort into it.

For example, we noticed the poor splice(2) performance is due to a spinlock. We need to figure out just why this happens, but it doesn't seem like a major problem. Probably just a trivial regression.


I really appreciate the effort you put into doing the benchmarks. Reading the first half of your article I was excited but also slightly dreading that I'd have to go to the trouble to create and run the benchmarks to see if it was worth the effort.

For optimising things like this I can recommend using ARM's ITM tracing.

Then you can get a view of every clock cycle and what is going on in the CPU, the caches, the hardware peripherals, the DMA controller etc.

You can start from a goal like "I want a TCP packet forwarded in under 5k clock cycles" and keep hacking code till you've met it.

Other software based tracing tools don't tend to be as good, and frequently rely on statistical techniques which hide what is really going on.


I'm surprised that for something this simple there isn't an easy to use approach through netmap and friends.

We specifically want to splice userspace TCP sockets. The unix way. I'm also not a fan of custom TCP/IP stacks.

https://blog.cloudflare.com/why-we-use-the-linux-kernels-tcp...


I see the opinion, but the kernel developer in me thinks that perhaps it would be better for everyone if the answer to such narrow, performance focused use cases was not "let's develop an exotic kernel interface that doesn't really work" but rather "go buy hardware or do it in userland".

Any reason this article completely skipped over the userspace tools like DPDK and Netmap (there are many others). Considering cloudflare uses a customer version of nginx using the dpdk stack?

Apples and oranges. Most important problem with DPDK and Netmap is that they require a dedicated network card. We don't have a network card to spare for each of our applications. Read on more:

https://blog.cloudflare.com/kernel-bypass/

https://blog.cloudflare.com/single-rx-queue-kernel-bypass-wi...

Also using dpdk and netmap kills usual tooling (from basics like tcpdump). We very much like iptables, xdp, conntrack, syn cookies and other technologies deeply embedded into linux kernel. Doing DDoS once on linux kernel is hard enough. We don't want to redo the logic for each of the possible kernel bypasses technologies:

https://blog.cloudflare.com/why-we-use-the-linux-kernels-tcp...

https://blog.cloudflare.com/syn-packet-handling-in-the-wild/


Feels like we're heading to a future where half of userspace will recompile itself as BPF and upload itself into kernel space to be executed there

Jokes aside, I wonder how much time will pass until people start writing kernel drivers in BPF

XDP is sort of that for the network subsystem. (Though it doesn't allow you to implement drivers, "just" protocols.) Next step could be filesystem stuff, then libvirt drivers, etc.

But BPF is no that versatile, but at least it has a stable ABI.


let's ask bsd folks who vetted lua

WebAssembly.

I guess you're getting downvotes because you posted this without explanation, but wasm actually makes a lot of sense for safer kernel extensions, as it allows for sandboxing code running as ring 0.


> The naive TCP echo server would look like:

    while data:
        data = read(sd, 4096)
        write(sd, data)
Not if that’s the write syscall, since it can return with fewer bytes than requested having been written. The full C code detects this case and crashes, which looks totally wrong.

The Python-ish example should, at the very least, use sendall. But even that is potentially suboptimal.


How is backpressure handled in the SOCKMAP solution? The description sounds like it just attaches all incoming buffers to the outgoing part of the socket irrespective of the number of already buffered data. That could be bad, if it means the memory usage is unbounded.

The other solutions definitely all would block if the send buffer is full.

Apart from that question it was interesting to learn that io_submit actually works on sockets too. I definitely need to read more about this one.


There are couple of interesting question:

- backpressure

- what if both userspace and SOCKMAP are doing read() on a socket?

- can one use SOCKMAP to splice only a selected amount of data, and pick up the rest with read()

- What are the parser/verdict program semantics. What can they do.

- what is the SK_MSG abstraction, and how to benefit from it.

and more... some of this is discussed:

http://vger.kernel.org/lpc_net2018_talks/ktls_bpf_paper.pdf

On io_submit, we wrote about it here:

https://blog.cloudflare.com/io_submit-the-epoll-alternative-...

TLDR: it allows for batching, and with IOCB_CMD_POLL can be used as epoll alternative.


Thanks for all the articles! I read the one about io_submit in the meantime. I actually hoped it would be as async as the API suggests, so it seems a bit disappointing for my use-case. Do you know if when I would schedule multiple writes via io_submit, whether it would block until all sockets have written something, or only until at least one has written something?

I guess the fact that one has to guess and try how those APIs behave makes me most nervous and would make me prevent from using those. An async IO system should be very well-behaved, and not block at arbitrary points.


Legal | privacy