Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

This paper seems unreservedly in favor of larger windows. http://research.google.com/pubs/pub36640.html

Based on our large scale experiments, we are pursuing efforts in the IETF to standardize TCP’s initial congestion window to at least ten segments. Preliminary experiments with even higher initial windows show indications of bene?ting latency further while keeping any costs to a modest level. Future work should focus on eliminating the initial congestion window as a manifest constant to scale to even large network speeds and Web page sizes.



sort by: page size:

The are a number of issues at play here — RTT (Round Trip Time, i.e. ping/latency), window sizes, packet loss and initcwnd (TCP's initial window).

Initial window size: not relevant AFAICS, I'm not talking about connection startup behavior.

RTT, Window size: if the bandwidth-delay product is large, obviously you need a large window size (>>65K). Thankfully, recent TCP stacks support TCP window scaling.

Packet loss: you need relatively large buffers (by the standards of traditional TCP) and a sane scheme for recovering from packet loss (e.g., SACK), but I don't see why this is a show stopper on modern TCP stacks.

I'm not super familiar with the SPDY work, but from what I recall, it primarily addresses connection startup behavior, rather than steady-state behavior.


There's all sorts of latency problems caused by the congestion window size (and how it gets reset), because of how the algorithm works unless you're sending a continuous stream of data (which allows the congestion window to grow) than the window gets reset to it's initial size which can mean waiting for an ack round-trip before you get the whole message.

While it's not that big a deal if your users are local to you, if they're on a different continent each extra roundtrip can easily add 100ms.

I used to do TCP/IP tuning for low latency trading applications (sometimes you need to use a third party data protocol so can't just use UDP), this sort of stuff used to bite us all the time.

If latency is important it is worth sitting down with tcpdump and seeing how your website loads (i.e how many packets, how many acks, etc.) as often there are ways of tweaking connection setting (either via socket options or kernel settings) that can result in higher performance.

(Try using tcp_slow_start_after_idle if you're using a recent linux kernel; this won't give you a bigger initial window, but it means once your window size has grown it won't get reset straight away if you have a gap between data sends)


Are there any reason why we cant have TCP slow start initial window to 100 packets or higher?

I could easily see 95% of internet could be 150KB page on first load.


Sure; I'm not suggesting dropping TCP entirely, just that e.g. a 1MB request / response is not going to be legitimate, and so you can simply not implement a lot of TCP's complexity (e.g. window scaling)

TCP Windowing

TCP throughout is limited by window size divided by round trip time (I.e. How much you can send before having to wait for an ACK). With RFC 1323, you can specify a window size well above the 65,536 limit that would otherwise exist. With scaling you can have a window size up to a gigabyte. With an RTT of 11 ms, you can saturate a 150 Mbps link with a window size of 160-170kb.

yes, they also have a bigger initial TCP window than the default

Right, when I said "the TCP implementation", I mean the implementation of TCP that doesn't allow large enough window sizes. As you correctly point out, there are other TCP implementations that don't have this problem.

My understanding is that window scaling and SACKs enable TCP to detect losses within large segments too. The only limitation is that most congestion controllers throttle back when detecting packet loss. Newer latency-based controllers don't suffer from that problem.

Do you have any information on Window's TCP algorithms and why they are so bad, or what kinds of problems they cause?

This is true, but my understanding is TCP's congestion control mechanism is far from ideal and the root cause of its performance problems.

True. Minimal TCP with 1 MSS window may be easy, but proper congestion control with fast recovery, F-RTO, tail loss probe, SACK etc. is much harder. Miss one of these aspects and you get a TCP that takes minutes to recover from a lost packet in some obscure case. It took years to debug Linux TCP stack. Even BSD stack is already way behind.

I call BS on the benchmarks AND the theoretical analysis. Every time I read those HTTP/X benchmarks, people don't mention TCP's congestion control and seem to just ignore it in their analysis. Well, you can't. At least if you call "realistic conditions". Congestion control will introduce additional round trips, depending on your link parameters (bandwidth, latency, OS configuration like initcwnd etc.) and limit your bandwidth on the transport layer. And based on your link parameter, 6 parallel TCP connections might achieve a higher bandwidth on cold start because the window scaling in tcp slow start is superior than in a single tcp connection used by HTTP/2.

Additionally, the most common error people do while benchmarking (and i assume the author did too) is to ignore the congestion control's caching of the cwnd in the tcp stack of your OS. That is, once the cwnd is raised from usually 10 mss (~14,6kb for most OSes and setups), your OS will cache the value and re-use the larger cwnd as initcwnd when you open a new tcp socket to the same host. So if you do 100 benchmark runs, you will have one "real" run, and 99 will re-use the cwnd cache and produce non-realistic results. Given the author didn't mention tcp congestion control at all and didn't mention any tweaks to the tcp metrics cache (you can disable or reset it through changing /proc/sys/net/ipv4/tcp_no_metrics_save on linux) I assume all the measured numbers are BS.


Even so, TCP windows generally don't start at the maximum size, so I'm really not buying that ratio of bytes ACKed to sent.

One could disprove this explanation with a packet capture; just add the remaining window.


Back then tcp window scaling wasn’t widely supported

This neglects the fact that IP imposes a fixed per-packet overhead. More packets means lower efficiency.

Using larger packets reduces the overhead as a percentage of traffic, so is desirable regardless of the TCP window size.

Higher efficiency means higher bandwidth, which means lower latency for small transfers. Waiting for two 10-byte packets yields latency equal to the higher of the two latencies. Waiting for one 20-byte packet is, on average, lower latency (because you take one sample instead of worst-of-two).


In the past (and even today) some servers didn’t enable tcp window scaling, which limited you to a couple of megabit in say a 100ms distance download, not because of any conscious blocking, but because you couldn’t have more than 64kbytes in flight at any one time.

This was mostly fixed a mover 20 years ago when window scaling became default, but companies like signiant actually had demos where they deliberately turned off window scaling to show how their proprietary file transfer formats were so much faster.


Interestingly, this is still relevant today when doing large data transfers between hosts with long RTTs using TCP: the typical TCP congestion control implementations (i.e. bic (http://en.wikipedia.org/wiki/BIC_TCP)) use ACKs to update their window size (which come back one RTT later) and hence their windows grow much slower than two hosts with the same capacity but a smaller RTT.

Cubic (http://en.wikipedia.org/wiki/CUBIC_TCP) and htcp (http://smakd.potaroo.net/ietf/all-ids/draft-leith-tcp-htcp-0...) are two congestion control methods which avoid this by not increasing the congestion window size by a function of the RTT (and are recommended if doing large data transfers across high capacity links with high RTT). In linux you can typically check your TCP congestion control algorithm by: "sysctl net.ipv4.tcp_congestion_control".


No, TCP performance goes up for the connection in question. Latencies are lower because the packets arrive earlier. Latency (for all protocols) goes down on the whole though due to the backlog.

Balancing these requirements against each other is a really hard problem. TCP slow start is (well, was, c.f. this article) an early attempt to get an auto-tuning solution. But it isn't the only part of the problem, nor is it an optimial (or even "good") solution to its part of the problem. It's defaults are very badly tuned for modern networks (though they'd be a lot better if everyone was using jumbograms...).

next

Legal | privacy