Detective Work and Dropped TCP Connections


I had problems with TCP connections (mostly long-lasting ssh sessions) getting dropped on my ADSL line. In the end, I found that the problem had two different roots. The detective work behind establishing them is, I believe, interesting. It also shows how accessible source code, and the will to use it, can be a tremendous boost to difficult system administration problems.

The Idle ssh Connections

The first problem involved idle ssh connections getting disconnected after some time. I knew that the cause was my router clearing NAT entries of idle connections, and ensured that the sshd KeepAlive option was set. This did not solve the problem. However, I notice that not all hosts I used were dropping their idle connections, so I started by tracing packets to a host that dropped and one that didn't:

windump -i 6 host or host
windump: listening on \Device\NPF_{688215B7-A2BE-4953-BC81-114456AEE710}
18:20:52.424148 IP > . ack 3065 637357 win 58400
18:20:52.424168 IP > . ack 1 win 62972 (DF)
18:30:52.742158 IP > . ack 1 win 58400
18:30:52.742188 IP > . ack 1 win 62972 (DF)
18:40:52.992644 IP > . ack 1 win 58400
18:40:52.992669 IP > . ack 1 win 62972 (DF)
18:50:53.247049 IP > . ack 1 win 58400
18:50:53.247066 IP > . ack 1 win 62972 (DF)
74742 packets received by filter
0 packets dropped by kernel
As you can see, freefall was sending keep-alive packets, but istlab wasn't.

Next step: examine the sshd source to see how the KeepAlive option is implemented:

$ grep KeepAlive *.c
readconf.c:     { "keepalive", oKeepAlives },
Examine readconf.c
        case oKeepAlives:
                intptr = &options->keepalives;
                goto parse_flag;
Again, grep for the variable:
$ grep keepalives *.c
sshd.c: if (options.keepalives &&
Examine the source:
        if (options.keepalives &&
            setsockopt(sock_in, SOL_SOCKET, SO_KEEPALIVE, &on,
            sizeof(on)) < 0)
                error("setsockopt SO_KEEPALIVE: %.100s", strerror(errno));
See how the kernel implements SO_KEEPALIVE:
$ grep SO_KEEPALIVE */*.c
netinet/tcp_timer.c:         tp->t_inpcb->inp_socket->so_options & SO_KEEPALIVE) &&
Examine tcp_timer.c:
        if ((always_keepalive ||
             tp->t_inpcb->inp_socket->so_options & SO_KEEPALIVE) &&
            tp->t_state <= TCPS_CLOSING) {
                if ((ticks - tp->t_rcvtime) >= tcp_keepidle + tcp_maxidle)
                        goto dropit;
Look for tcp_keepidle:
int     tcp_keepidle;
    &tcp_keepidle, 0, sysctl_msec_to_ticks, "I", "");
Aha, a sysctrl variable! Examine it on both systems:
istlab$ sysctl net.inet.tcp.keepidle
net.inet.tcp.keepidle: 7200000
freefall$ sysctl net.inet.tcp.keepidle
net.inet.tcp.keepidle: 600000
Presto, problem found. As always, the resolution is a lot simpler than the location of the problem. A new value for the variable and a new setting in /etc/sysctrl.conf, solved the problem for now and forever.

Of course, as Dave Thomas and Andy Hunt say:

coding ain't done till all tests run
windump -i 6 host or host
windump: listening on \Device\NPF_{688215B7-A2BE-4953-BC81-114456AEE710}
21:09:11.331698 IP > . ack 3065 651769 win 58400
21:09:11.331708 IP > . ack 1 wi n 63080 (DF)
21:10:00.749281 IP > . ack 37848 16799 win 58400
21:10:00.749298 IP > . ack 1 win 63780 (DF)


  • Observe.
  • Use the Source Luke!

The Dropped ppp Link

The other problem involved the ADSL PPP connection dropping every six hours. The suggestion offered by my ISPs helpdesk (at three different instances of the problem) was to reboot the SpeedTouch 530 rooter, because it was getting stuck (again, and again). They claimed that small routers tend to crash and often require rebooting. The persistent nature of the problem, and the fact that the connection was dropped after approximatelly six hours convinced me that the problem was more complicated.

I let the router operate without reboot for about a day, and observed the log. The router was picking a new global IP address after the link wend down, which happened 1h after the router got a new internal DHCP address (the same one). See the following three groups:

19:00:17 PPP link up (PPPoA_1) []
18:59:58 PPP link down (PPPoA_1) []
17:59:58 DHCP lease ip-address bound to intf eth0
17:59:58 DHCP intf eth0 renews lease ip-address

12:51:38 PPP link up (PPPoA_1) []
12:51:20 PPP link down (PPPoA_1) []
11:59:58 DHCP lease ip-address bound to intf eth0
11:59:58 DHCP intf eth0 renews lease ip-address

07:00:19 PPP link up (PPPoA_1) []
07:00:00 PPP link down (PPPoA_1) []
06:00:00 DHCP lease ip-address bound to intf eth0
06:00:00 DHCP intf eth0 renews lease ip-address

00:00:22 PPP link up (PPPoA_1) []
00:00:16 xDSL linestate up (downstream: 448 kbit/s, upstream: 160 kbit/s)
00:00:02 DHCP Auto DHCP: server detected on LAN, own dhcp server disabled
00:00:02 DHCP lease ip-address bound to intf eth0
00:00:02 DHCP ( set on intf eth0: ok.
Consistently, one hour after the DHCP lease was obtained the link wend down and up again.
"Once is happenstance. Twice is coincidence. Three times is enemy action." (Auric Goldfinger).
So I assumed the problem was a router bug: "a new internal DHCP lease? Let's have fun restarting the external connection."

I increased the DHCP lease time for the router to 3 years, and the problem went away.


  • Observe and gather the facts, keeping notes.
  • Use hour/day/week multiples for DHCP/DNS/etc leases and timeouts. They can make time dependent problems stand out, in situations where a round decimal number of seconds wouldn't.

Comments   Toot! Share

Last modified: Monday, August 23, 2004 11:12 am

Creative Commons Licence BY NC

Unless otherwise expressly stated, all original material on this page created by Diomidis Spinellis is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.