blog dds: 2004-08-23 — Detective Work and Dropped TCP Connections

I had problems with TCP connections (mostly long-lasting ssh sessions) getting dropped on my ADSL line. In the end, I found that the problem had two different roots. The detective work behind establishing them is, I believe, interesting. It also shows how accessible source code, and the will to use it, can be a tremendous boost to difficult system administration problems.

The Idle ssh Connections

The first problem involved idle ssh connections getting disconnected after some time. I knew that the cause was my router clearing NAT entries of idle connections, and ensured that the sshd KeepAlive option was set. This did not solve the problem. However, I notice that not all hosts I used were dropping their idle connections, so I started by tracing packets to a host that dropped and one that didn't:

windump -i 6 host freefall.freebsd.org or host istlab.dmst.aueb.gr
windump: listening on \Device\NPF_{688215B7-A2BE-4953-BC81-114456AEE710}
18:20:52.424148 IP freefall.freebsd.org.22 > eagle.spinellis.gr.4316: . ack 3065 637357 win 58400
18:20:52.424168 IP eagle.spinellis.gr.4316 > freefall.freebsd.org.22: . ack 1 win 62972 (DF)
18:30:52.742158 IP freefall.freebsd.org.22 > eagle.spinellis.gr.4316: . ack 1 win 58400
18:30:52.742188 IP eagle.spinellis.gr.4316 > freefall.freebsd.org.22: . ack 1 win 62972 (DF)
18:40:52.992644 IP freefall.freebsd.org.22 > eagle.spinellis.gr.4316: . ack 1 win 58400
18:40:52.992669 IP eagle.spinellis.gr.4316 > freefall.freebsd.org.22: . ack 1 win 62972 (DF)
18:50:53.247049 IP freefall.freebsd.org.22 > eagle.spinellis.gr.4316: . ack 1 win 58400
18:50:53.247066 IP eagle.spinellis.gr.4316 > freefall.freebsd.org.22: . ack 1 win 62972 (DF)
74742 packets received by filter
0 packets dropped by kernel

As you can see, freefall was sending keep-alive packets, but istlab wasn't.

Next step: examine the sshd source to see how the KeepAlive option is implemented:


$ grep KeepAlive *.c
readconf.c:     { "keepalive", oKeepAlives },

Examine readconf.c


        case oKeepAlives:
                intptr = &options->keepalives;
                goto parse_flag;

Again, grep for the variable:


$ grep keepalives *.c
sshd.c: if (options.keepalives &&

Examine the source:


        if (options.keepalives &amp;&amp;
            setsockopt(sock_in, SOL_SOCKET, SO_KEEPALIVE, &amp;on,
            sizeof(on)) < 0)
                error("setsockopt SO_KEEPALIVE: %.100s", strerror(errno));

See how the kernel implements SO_KEEPALIVE:


$ grep SO_KEEPALIVE */*.c
netinet/tcp_timer.c:         tp->t_inpcb->inp_socket->so_options & SO_KEEPALIVE) &&

Examine tcp_timer.c:


        if ((always_keepalive ||
             tp->t_inpcb->inp_socket->so_options & SO_KEEPALIVE) &&
            tp->t_state <= TCPS_CLOSING) {
                if ((ticks - tp->t_rcvtime) >= tcp_keepidle + tcp_maxidle)
                        goto dropit;

Look for tcp_keepidle:


int     tcp_keepidle;
SYSCTL_PROC(_net_inet_tcp, TCPCTL_KEEPIDLE, keepidle, CTLTYPE_INT|CTLFLAG_RW,
    &tcp_keepidle, 0, sysctl_msec_to_ticks, "I", "");

Aha, a sysctrl variable! Examine it on both systems:


istlab$ sysctl net.inet.tcp.keepidle
net.inet.tcp.keepidle: 7200000


freefall$ sysctl net.inet.tcp.keepidle
net.inet.tcp.keepidle: 600000

Presto, problem found. As always, the resolution is a lot simpler than the location of the problem. A new value for the variable and a new setting in /etc/sysctrl.conf, solved the problem for now and forever.

Of course, as Dave Thomas and Andy Hunt say:

coding ain't done till all tests run

windump -i 6 host freefall.freebsd.org or host istlab.dmst.aueb.gr
windump: listening on \Device\NPF_{688215B7-A2BE-4953-BC81-114456AEE710}
21:09:11.331698 IP freefall.freebsd.org.22 > eagle.spinellis.gr.4316: . ack 3065 651769 win 58400
21:09:11.331708 IP eagle.spinellis.gr.4316 > freefall.freebsd.org.22: . ack 1 wi n 63080 (DF)
21:10:00.749281 IP istlab.dmst.aueb.gr.22 > eagle.spinellis.gr.4527: . ack 37848 16799 win 58400
21:10:00.749298 IP eagle.spinellis.gr.4527 > istlab.dmst.aueb.gr.22: . ack 1 win 63780 (DF)

QED.

Moral

Observe.
Use the Source Luke!

The Dropped ppp Link

The other problem involved the ADSL PPP connection dropping every six hours. The suggestion offered by my ISPs helpdesk (at three different instances of the problem) was to reboot the SpeedTouch 530 rooter, because it was getting stuck (again, and again). They claimed that small routers tend to crash and often require rebooting. The persistent nature of the problem, and the fact that the connection was dropped after approximatelly six hours convinced me that the problem was more complicated.

I let the router operate without reboot for about a day, and observed the log. The router was picking a new global IP address after the link wend down, which happened 1h after the router got a new internal DHCP address (the same one). See the following three groups:


19:00:17 PPP link up (PPPoA_1) [194.219.141.49]
18:59:58 PPP link down (PPPoA_1) [212.54.221.194]
17:59:58 DHCP lease ip-address 192.168.136.30 bound to intf eth0
17:59:58 DHCP intf eth0 renews lease ip-address 192.168.136.30

12:51:38 PPP link up (PPPoA_1) [212.54.221.194]
12:51:20 PPP link down (PPPoA_1) [194.219.73.233]
11:59:58 DHCP lease ip-address 192.168.136.30 bound to intf eth0
11:59:58 DHCP intf eth0 renews lease ip-address 192.168.136.30

07:00:19 PPP link up (PPPoA_1) [194.219.73.233]
07:00:00 PPP link down (PPPoA_1) [194.219.73.150]
06:00:00 DHCP lease ip-address 192.168.136.30 bound to intf eth0
06:00:00 DHCP intf eth0 renews lease ip-address 192.168.136.30

00:00:22 PPP link up (PPPoA_1) [194.219.73.150]
00:00:16 xDSL linestate up (downstream: 448 kbit/s, upstream: 160 kbit/s)
00:00:02 DHCP Auto DHCP: server detected on LAN, own dhcp server disabled
00:00:02 DHCP lease ip-address 192.168.136.30 bound to intf eth0
00:00:02 DHCP 192.168.136.30 (255.255.255.0) set on intf eth0: ok.

Consistently, one hour after the DHCP lease was obtained the link wend down and up again.

"Once is happenstance. Twice is coincidence. Three times is enemy action." (Auric Goldfinger).

So I assumed the problem was a router bug: "a new internal DHCP lease? Let's have fun restarting the external connection."

I increased the DHCP lease time for the router to 3 years, and the problem went away.

Morals

Observe and gather the facts, keeping notes.
Use hour/day/week multiples for DHCP/DNS/etc leases and timeouts. They can make time dependent problems stand out, in situations where a round decimal number of seconds wouldn't.

Comments Post Toot! Tweet Share