TCP and mobile IP

Posted by Marc 'Zugschlus' Haber on Friday, September 4. 2009

Steinar H. Gunderson, sesse, has written an interesting article about TCP performance. I didn't find your blog's comment function, so I am commenting with a trackback. (note: which didn't work either, "The auto-discovered trackback URI does not match our target URI")

I frequently use mobile internet, using various of the German GSM/UMTS network operators, out of a moving train. As you have written, this frequently causes packet loss which is not only not caused by congestion, but sends the congestion avoidance algorithms on a false path.

For example, when the train passes through the 3575 m long Distelrasentunnel between Frankfurt and Fulda, my network link is broken for like two minutes. Passing through other parts of Germany sometimes gives me a ping response of hundreds of thousands of microseconds by virtue of the rather huge send buffer the UMTS equipment has.

In these circumstances, ssh sessions frequently take tens of minutes to notice that the network is back before the session is useable again. Frequently, it doesn't come up again before an hour has passed. And I have not found a way to work myself around this. Can you explain what's happening here, and do you have any ideas to solve the issue?

Trackbacks

Trackback specific URI for this entry

No Trackbacks

Comments

Display comments as Linear | Threaded

Aaron M. Ucko on Friday, September 4. 2009:

Good question; I don't have an answer for you, other than to suggest running screen, tmux, or the like on the remote host and reconnecting as necessary. (To be sure, that's not the greatest solution because resuming login sessions can interact poorly with X11 or agent forwarding, but it still beats losing them altogether, and supplies other useful features as well.)

In general, I've been under the impression that the internet has grown less friendly to long-lived TCP sessions over the years, with the bulk of users just caring about HTTP. :-/

Marc 'Zugschlus' Haber on Friday, September 4. 2009:

yes, of course, screen is the obvious answer, but that makes some of my processes unnecessarily clumsy, and I lose konsole's excellent scrollback capabilities since screen interferes. I'd rather have the network stack handle the situation gracefully.

Marius Gedminas on Friday, September 4. 2009:

I put this into my ~/.screenrc in order to force screen to let my terminals maintain scrollback the way I'm used to:

# gimme back my scrollback in xterm! termcapinfo xterm|xterms|xs ti=\E7\E[?47l

What it does is change the terminal init sequence and removes the "switch to the alternate (scrollback-less) screen buffer" bit that's in the standard terminfo entry for xterm.

And here's the rest of my .screenrc; without this I find screen too obnoxious to be usable:

startup_message off

# I can't live without my ^a in bash/readline defescape ^bb escape ^bb

# probably don't need this any more, since I use the native xterm scrollback defscrollback 1000

Marc 'Zugschlus' Haber on Sunday, September 6. 2009:

That's nice, I'll use it in the future. It's however a pity that TCP cannot be tuned.

Leo 'costela' Antunes on Friday, September 4. 2009:

This isn't exactly an answer to your question, but it might help in case your usage is mostly wireless based: switching the default Linux congestion management algorithm to TCP Veno. It's theoretically optimized for wireless links and should interpret lost packets in a way less damaging to throughput.

I haven't tested this specific algorithm, but I did play a bit with other high-throughput ones and the difference in performance was sensible, though not particularly shocking.

This article is a nice overview of the different algorithms available: http://linuxgazette.net/135/pfeiffer.html

(you might have to recompile your kernel, my standard Debian kernel is showing only cubic and reno as available algorithms...)

Marc 'Zugschlus' Haber on Friday, September 4. 2009:

That is important information, rebuilding my kernel to try.

Leo 'costela' Antunes on Friday, September 4. 2009:

oops, I take it back, the right algorithms are there as modules, they just weren't loaded... so: no need to recompile!

XANi on Friday, September 4. 2009:

ConnectTimeout 30 ServerAliveInterval 30

try adding that to ssh conf

Marc 'Zugschlus' Haber on Friday, September 4. 2009:

I have that in my config for years. Doesn't help in these situations. Actually, I think it makes things worse since idle sessions might not even not notice the outage without ServerAliveInterval.

Josip Rodin on Friday, September 4. 2009:

The basic property of standard TCP is that it thinks packet loss is caused by congestion, whereas in wireless environments that is rarely so, instead it's caused by actual data loss, meaning these two are pretty much incompatible from the get-go. TCP sees loss and then it starts trying to resend data with reduced speed, which ends up causing more and more packets getting lost, making it think that you're on a very slow link, rather than just an unstable relatively fast link (which is what modern 3G networks are like). This is a documented fact, or so I was told at my university classes a year ago (wow, actual real-world use for those...).

My actual experience with ssh sessions on a train (Zagreb-Vinkovci, FWIW) was based on those factoids and on common sense I was usually able to learn quickly that I was going to enter a small period of downtime, seeing a part of the route where a BSS handover was likely - outside of populated areas etc - so I was forced to learn that, rather than trying to improve it, it's best avoid TCP congestion avoidance altogether. So I do two things:

First, use ssh only for processes where I trigger output changes, e.g. mutt. That way you retain control over when and how much new data comes (or tries to come) over the link. For example if you're running any form of instant messaging (like ssh session with irssi in it), you can't control how much new data will try to be pushed to your screen or when it will happen, so it's inherently incompatible with semi-random blackout periods.

Second, go idle on the active ssh connection as soon as possible when, best if I don't send a single byte over a dead link. As soon as the link comes back up, start up mtr (or have a mtr running all the time if it won't eat up your quota) and wait for proof that actual IP connectivity has come back up. Then resume the SSH session.

Hope that helps.

Marc 'Zugschlus' Haber on Sunday, September 6. 2009:

It is strangely disturbing that the TCP definitions and implementations weren't changed to cater for changed environments. And I absoutely hate the idea of having to adapt my work methods depending on what kind of network link I am on.

Maurice Massar on Wednesday, September 9. 2009:

A coworker had told me, that setting net.ipv4.tcplowlatency on both sides would help. I've no idea what this option does, and Documentation/networking/ip-sysctl.txt is quite unhelpful.

Andreas Ferber on Monday, September 28. 2009:

tcplowlatency shouldn't have any effect on this problem, since AFAIK it doesn't have any effect on the retransmission timeout (RTO), which is the main factor causing those long recovery times.

What you really want to change is TCPRTOMAX, which is 120 seconds by default on linux. This forces the stack to do retransmits earlier, so connections will recover faster once connectivity is restored. Unfortunately, patches adding a sysctl to change it at runtime have repeatedly been dropped, so you have to patch include/net/tcp.h and recompile the kernel.

Additionally, you might want to increase net.ipv4.tcp_retries2 (the maximum retry count for established connections) to avoid the kernel breaking connections earlier due to shorter RTO. Note that this only affects the sending direction, so the kernel has to be modified on both ends of the ssh connection.

A negative side effect is that the TCP stack behaves worse if a connection really has a very high latency (more spurious retransmits). It might be unwise to reduce the max RTO to much on a public server. The RFCs mention 60 seconds as a lower limit.

A completely different solution might be using rocks (http://freshmeat.net/projects/rocks). However, this project isn't maintained for quite some time now, and I don't know if it is still usable today.

Marc 'Zugschlus' Haber on Tuesday, September 29. 2009:

That's really insightful. Even if it doesn't offer a viable solution for me, I have learned a lot. Thanks!

Erik J on Thursday, October 15. 2009:

No Roks isn't useable anymore it carshes all the time, it is ment to handle a different problem though. Rocks is to be able to keep a TCP connection even though you change IP..

Add Comment

Name

Homepage

Comment

In reply to

Phone*

What is zero plus seven?

Markdown format allowed

Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.

Standard emoticons like :-) and ;-) are converted to images.

E-Mail addresses will not be displayed and will only be used for E-Mail notifications.

Ich bin mit der Speicherung meiner Daten einverstanden. Die Datenschutzerklärung habe ich zur Kenntnis genommen und bin einverstanden.

Form options

Remember Information?

Rudi about Ich hab doch nur gefragt

Wed, 03.02.2021 15:48

Herzliche Glückwünsche!

Steffi about Vom Umzug mit Katzen

Wed, 03.02.2021 13:04

Super Artikel, der zeigt, wie sensibel Katzen doch sind.

Norbert about How to amd64 an i386 Debian installation with multiarch

Thu, 14.01.2021 20:41

Albeit I upgraded in the middle of 2015 with the help of this blog, I just wanna thank You for giving the helpful [...]