I call BS on the benchmarks AND the theoretical analysis. Every time I read those HTTP/X benchmarks, people don't mention TCP's congestion control and seem to just ignore it in their analysis. Well, you can't. At least if you call "realistic conditions". Congestion control will introduce additional round trips, depending on your link parameters (bandwidth, latency, OS configuration like initcwnd etc.) and limit your bandwidth on the transport layer. And based on your link parameter, 6 parallel TCP connections might achieve a higher bandwidth on cold start because the window scaling in tcp slow start is superior than in a single tcp connection used by HTTP/2.
Additionally, the most common error people do while benchmarking (and i assume the author did too) is to ignore the congestion control's caching of the cwnd in the tcp stack of your OS. That is, once the cwnd is raised from usually 10 mss (~14,6kb for most OSes and setups), your OS will cache the value and re-use the larger cwnd as initcwnd when you open a new tcp socket to the same host. So if you do 100 benchmark runs, you will have one "real" run, and 99 will re-use the cwnd cache and produce non-realistic results. Given the author didn't mention tcp congestion control at all and didn't mention any tweaks to the tcp metrics cache (you can disable or reset it through changing /proc/sys/net/ipv4/tcp_no_metrics_save on linux) I assume all the measured numbers are BS.
It’s also a specious argument anyway. The six connection limit isn’t purely artificial, opening and tracking TCP connection state is expensive, and happens entirely in the kernel. There’s a very real cap on how many TCP connections a machine can serve before the kernel starts barfing, and that cap is substantially lower than the number of multiplexed streams you can push over a single TCP connection.
You’re also completely ignoring TCP slow start process, which you can bet your bottom dollar will prevent six TCP streams beating six multiplexed streams over a single TCP stream when measuring latency from first connection.
The are a number of issues at play here — RTT (Round Trip Time, i.e. ping/latency), window sizes, packet loss and initcwnd (TCP's initial window).
Initial window size: not relevant AFAICS, I'm not talking about connection startup behavior.
RTT, Window size: if the bandwidth-delay product is large, obviously you need a large window size (>>65K). Thankfully, recent TCP stacks support TCP window scaling.
Packet loss: you need relatively large buffers (by the standards of traditional TCP) and a sane scheme for recovering from packet loss (e.g., SACK), but I don't see why this is a show stopper on modern TCP stacks.
I'm not super familiar with the SPDY work, but from what I recall, it primarily addresses connection startup behavior, rather than steady-state behavior.
Those incremental gains doesn't seems much better than what linux Tcp improvments get each year, especially if turning on state of the art congestion / bufferbloat algorithms.
Also Tcp fast open is ridiculously old and I can't see how mainstream equipment still wouldn't support it on average.
Based on our large scale experiments, we are pursuing efforts in the IETF to standardize TCP’s initial congestion window to at least ten segments. Preliminary experiments with even higher initial windows show indications of bene?ting latency further while keeping any costs to a modest level. Future work should focus on eliminating the initial congestion window as a manifest constant to scale to even large network speeds and Web page sizes.
Because running multiple TCP connections in parallel plays havoc with TCP congestion control and also plays poorly with the TCP slow-start logic. Every TCP connection begins its receive window again and so it starts small, so fetching many moderately-sized or large resources (think images) will cost you many round trips you didn't need to spend.
Ah yes. I'd say that's more a problem with TCP in general than in using multiple connections. The assumption that "1 TCP connection == 1 share of bandwidth" is at best a useful first approximation. I don't know that N TCP streams eating N times their "fair share" is really any different a problem than some-important-interactive-application being given equal bandwidth with some-irrelevant-background-download. (Though it might be a worse problem.)
I'd love to live to see the day that something actually better than TCP (that addresses these and other issues) dethrones it, but given how long IPv6 took to gain traction, I wouldn't be surprised if it didn't happen in my lifetime.
With scaling tcp can use window up to 1GB. Which is enough for 1s rtt which is ridiculously high @ 8gbps which is also ridiculous for long haul on a single flow. In practice you probably almost always just parallelize into multiple flows
Only for apps that don't try to utilize maximum throughput. Skype, YouTube, most web browsing, most mail use.
But those that do - like large file transfers over ftp/sftp or a very large email, for example - will cause the meltdown described in this article.
There are some TCP stacks that use RTT rather than packet loss as their congestion metric; Those fair well under a TCP-over-TCP regime (but have other problems)
True. Minimal TCP with 1 MSS window may be easy, but proper congestion control with fast recovery, F-RTO, tail loss probe, SACK etc. is much harder. Miss one of these aspects and you get a TCP that takes minutes to recover from a lost packet in some obscure case. It took years to debug Linux TCP stack. Even BSD stack is already way behind.
I don't disagree, but it makes it an apples an oranges comparison. It introduces the variables of how linux vs windows deals with TCP (not withstanding that linux 2.x vs 3. might have some internal ipv4 changes, but youre recommendation is to upgrade anyway, so that's fine) but also changes in the webserver.. It seems like the changes are hard coded into the compiled kernel, so there's no way to simply change configuration flags?
That said, thanks for the post, and I'll definitely be tcpdumping in the upcoming week and reading some more about slowstart!
Maybe testing with net.ipv4.tcp_slow_start_after_idle 0 vs 1 would make a cleaner comparison?
> It also seems like the protocol intentionally runs slower that possible as to not create buffer pressure on the receiving side, if I'm understanding this quick description properly: "then cruising at the estimated bandwidth to utilize the pipe without creating excess queue".
> The this line just scares me: "Occasionally, on an as-needed basis, it sends significantly slower to probe for RTT (PROBE_RTT mode).
I haven't read the proposal, but I think the reason for this is that they're comparing RTT during load with idle RTT to determine packet queuing, but the idle RTT may change over time.
Depending on how accurate you want to be, it could be as simple as after some time or packet count of full data packets sent from socket buffer in response to ack moving the window, leave a small gap for the next packet, and then resume sending. If that packet is acked faster than the rest, the idle RTT is shorter than the under load RTT, which means you should slow down in general (to optimize latency). If the RTT is the same for the after gap packet, then the load RTT is close to idle, and you can keep going at the current rate.
(I probably wouldn't implement it like I described it. With TCP timestamps, we have pretty continuous RTT measurements, some sort of last N packet min/max/average/stddev to drive the congestion window from all measurements, and a mechanism to add a small gap for the PROBE_RTT would make more sense: any low RTT response should inform the system, not just one that comes in response to a probe)
No, TCP performance goes up for the connection in question. Latencies are lower because the packets arrive earlier. Latency (for all protocols) goes down on the whole though due to the backlog.
Balancing these requirements against each other is a really hard problem. TCP slow start is (well, was, c.f. this article) an early attempt to get an auto-tuning solution. But it isn't the only part of the problem, nor is it an optimial (or even "good") solution to its part of the problem. It's defaults are very badly tuned for modern networks (though they'd be a lot better if everyone was using jumbograms...).
Additionally, the most common error people do while benchmarking (and i assume the author did too) is to ignore the congestion control's caching of the cwnd in the tcp stack of your OS. That is, once the cwnd is raised from usually 10 mss (~14,6kb for most OSes and setups), your OS will cache the value and re-use the larger cwnd as initcwnd when you open a new tcp socket to the same host. So if you do 100 benchmark runs, you will have one "real" run, and 99 will re-use the cwnd cache and produce non-realistic results. Given the author didn't mention tcp congestion control at all and didn't mention any tweaks to the tcp metrics cache (you can disable or reset it through changing /proc/sys/net/ipv4/tcp_no_metrics_save on linux) I assume all the measured numbers are BS.
reply