System administration stories: The Revolt
Can a small embedded system the size of a paperback lead a group of machines into revolt? Apparently yes.
This week the power company (DEI/PPC) graced us with a power failure
Sep 3 02:53:19 spiti upsmon: UPS spiti-ups@localhost on batterywhich lasted more than the UPS batteries could hold
Sep 3 03:20:07 spiti upsmon: UPS spiti-ups@localhost battery is criticalso the main system providing DHCP, DNS, mail, and bootp services was shut down
Sep 3 03:20:07 spiti upsmon: Executing automatic power-fail shutdownuntil power came back, almost an hour later
Sep 3 03:42:48 spiti /kernel: FreeBSD 4.10-STABLE #4: Tue Aug 31 02:41:28 EEST
In the morning I found that all diskless machines (the DNARD Shark and a 133MHz Pentium MP3 player), the wavelan bridge, and the SpeedTouch ADSL router had disapeared from the network. The link lights on the hub were lit, but would not respond to pings. Thinking the hub failed I started patching them together, to no avail. I suspected a bad network card on the server, but the same problem occurred pinging from other machines as well.
Suddenly the solution dawned on me like a flash. The ADSL router contains an embedded DHCP server, which, helpfully, is automatically disabled if it finds another one on the network. When the power came up, the ADSL router was running long before the normal server had a chance to boot. Its DHCP server started distributing IP addresses to the diskless machines from its own 10.0.0.* pool. The server, having a 192.168.* address was thus unable to reach the revolting group. Rebooting each and every diskless machine solved the problem.
You might ask, why was the router configured as a DHCP server? There is an interesting and simple answer to that. When I installed the SpeedTouch 530 router and tried to disable the built-in DHCP server I found that the corresponding command
dhcp server config state=disabledwould crash the router. I left it at that, believing that the router's auto DHCP server enable was adequate, but apparently it isn't. This is the second bug I hit on this router within less than a month.
Following this incident and unsuccessful attempts to get support from Thomson and the local PTT (OTE) that sold me the device I was able to setup a workaround by adding the following lines in a configuration file I uploaded:
[ dhcp.ini ] config autodhcp=on scantime=10 state=disabled trace=offRead and post comments, or share through