blog dds: 2020-03-06 — What can software developers learn from the Soviet Moon Landing Program?

In the twentieth century space race between the Soviet Union and the United States the former started way ahead. In 1957 it launched the first artificial satellite, Sputnik 1, and in 1961 it had Yuri Gagarin orbiting the Earth as the first human to space. Yet, when it came to landing a person on the Moon it flopped spectacularly, abandoning its N1 rocket and Soyuz spacecraft program after a series of fiery failures. It turns out that the problems of Soviet program’s N1 rocket — one cased one of the largest artificial non-nuclear explosions in human history — offer some important lessons to software developers.

The 105 meter tall N1 was a huge heavy-lift launch vehicle designed for crewed travel to the Moon and beyond. Its first stage, Block A, continues to be the most powerful rocket stage ever built. It was powered by 30 NK-15 engines arranged in two rings: the main ring of 24 at the outer edge of the booster and the core propulsion system of inner 6 engines. The engines were the first ever staged combustion cycle engines. The control system was primarily based on differential throttling of the engines of the outer ring for pitch and yaw. The block’s 45MN thrust exceeded by far the Saturn V thrust of 34MN.

The N1 rocket on the Baikonur launch pad

The rocket’s operational history can be summarized as a series of failed launch attempts.

On February 21st, 1969, a few seconds into the first launch, a transient voltage caused KORD, the rocket’s analogue computer, to shut down Engine #12. After this happened, the KORD shut off Engine #24 to maintain symmetrical thrust. At T+6 seconds, self-reinforcing oscillation in the #2 engine tore several components off their mounts and started a propellant leak. At T+25 seconds, further vibrations ruptured a fuel line and caused kerosene propellant to spill into the aft section of the booster. When it came into contact with the leaking gas, a fire started. The fire then burned through wiring in the power supply, causing electrical arcing which was picked up by sensors and interpreted by the KORD as a pressurization problem in the turbopumps. The KORD responded by issuing a general command to shut down the entire first stage at T+68 seconds into launch. Investigators discovered the remains of the rocket 32 miles (52 kilometers) from the launch pad.

On the second launch on July 3rd 1969, just before liftoff, the liquid oxygen turbopump in the #8 engine exploded. The resultant shock wave severed surrounding propellant lines and started a fire from leaking fuel. For a few moments, the rocket lifted into the night sky. The fire damaged various components in the thrust section leading to the engines gradually being shut down between T+10 and T+12 seconds. As soon as the rocket cleared the tower, there was a flash of light, and debris could be seen falling from the bottom of the first stage. All the engines instantly shut down except engine #18. This caused the N-1 to lean over at a 45-degree angle and drop back onto launch pad. The nearly 2300 tons of propellant on board triggered a massive blast and shock wave that shattered windows across the launch complex and sent debris flying as far as 10 kilometers from the explosion’s center. Launch crews were permitted outside half an hour after the accident and encountered droplets of unburned kerosene still raining down from the sky. The launch complex was thoroughly leveled by the blast. It was later photographed by American satellites, disclosing that the Soviet Union was building a Moon rocket.

It took 18 months to rebuild the launch pad, so the third launch happened on June 26th, 1971. Soon after lift-off, due to unexpected eddies and counter-currents at the base of the first stage, the N-1 experienced an uncontrolled roll beyond the capability of the control system to compensate. KORD sensed an abnormal situation and sent a shutdown command to the first stage. However, to avoid catastrophic accidents such as that of the preceding launch, the guidance program had since been modified to prevent shutdowns until 50 seconds into launch. The roll, which had initially been 6° per second, began rapidly accelerating. At T+39 seconds, the booster was rolling at nearly 40° per second, and at T+48 seconds, the vehicle disintegrated from structural loads. Finally, at T+50 seconds, the cutoff command to the first stage was unblocked and the engines immediately shut down. Despite this, the first and second stages still had enough momentum to travel for some distance before falling to earth about 15 kilometers from the launch complex and blasting a 15-meter-deep (50-foot) crater in the steppe.

On the fourth attempt on November 23, 1972 the start and lift-off went well. At T+90 seconds, a programmed shutdown of the core propulsion system (the six center engines) was performed to reduce structural stress on the booster. Because of excessive dynamic loads caused by a hydraulic shock wave when the six engines were shut down abruptly, lines for feeding fuel and oxidizer to the core propulsion system burst, spilling fuel and oxidizer onto the shut down, but still hot, engines. This started a fire in the booster’s boattail. In addition, the #4 engine exploded, probably due to a failed turbopump. The first stage broke up starting at T+107 seconds. The upper stages were ejected from the stack and crashed into the steppe.

Mercifully, the Soviet N1-L3 crewed Moon landing program was canceled in May 1974 and no further launches took place.

If you’re a software developer and have read this far, I’m sure you’re wondering: how could these failures have happened? How come the failure causes were not discovered during testing? Could it be that the rocket was being launched without ever being fully tested, relying on actual launches to discover and iron out problems?

It turns out that this was exactly the case. Although various component tests were carried out, it seems that the rocket’s stages were certified for flight tests based on individual component tests. Unlike the US Kennedy Space Center Launch Complex 39, the N1’s Baikonur launch complex could not be reached by heavy barge. To allow transport by rail, all of the rocket’s stages had to be shipped in pieces and assembled at the launch site. This led to difficulties in testing.

In addition, the NK-15 engines had a number of valves that, for weight saving purposes, were activated by pyrotechnics rather than hydraulic or mechanical means. Once shut, the valves could not be re-opened. This meant that the first stage’s engines were only test-fired individually and the entire cluster of 30 engines was never static test fired as a unit. Even so, it seems that only two out of every batch of six engines were tested. As a result, the complex and destructive vibrational modes (which ripped apart propellant lines and turbines), as well as exhaust plume and fluid dynamic problems (causing vehicle roll, vacuum cavitation, and other problems), were not discovered and worked out before flight.

What lessons can software developers learn from this sad story?

First, build software components that can be easily tested. Just as Baikunur’s poor transport links and the pyrotechnic valves prevented the full testing of N1, large methods, components with complex dependencies and interactions, and reliance on non-deterministic inputs or events can hinder the software’s testability. Implementing small loosely-coupled components allows them to be easily unit-tested in isolation. Abstracting non-deterministic inputs and events into components that can be instrumented to provide test data can help create realistic test scenarios.

Second, don’t skimp on integration testing. The N1’s developers went over various ground tests, including strength, vacuum, high temperature, and pressure integrity tests; testing of mechanical and pyrotechnic separation and docking systems; investigation of gas-dynamic processes during launch and stage separation. However, problems associated with the complex interactions of the rocket’s systems weren’t discovered, because these were apparently never tested as a whole. Similarly, for your software, apart from unit tests, you want to run integration tests that will verify its operation in a realistic environment. This means you must have a test environment that mirrors the production one as closely as possible, stress load generators, and test automation (possibly in an continuous integration environment) that allows you to easily run all tests whenever needed.

—/—

If you found this post interesting, see also the related one Lessons from Space in which Henry Spencer and I discuss what positive lessons software developers can learn from the much more successful Soviet (and now Russian) Soyuz program.

Note: Material for this post related to the N1 Rocket was derived or excerpted from the corresponding Wikipedia article, which is available under the Creative Commons Attribution-ShareAlike License.

Comments Post Toot! Tweet Share