blog dds: 2017-08-15 — Debugging in Practice: dgsh Issue 85

Fixing an insidious bug in the new Unix directed graph shell dgsh allowed me to demonstrate in practice 10 of the 66 principles, techniques, and tools I describe in the book Effective Debugging. Almost all steps all documented in the corresponding issue and commits. Here’s a detailed retrospective.

In the following description I list the titles of the corresponding book sections in bold.

Four participants handled the bug through a GitHub issue. (Handle All Problems through an Issue-Tracking System)
We tried to reproduce the problem in diverse systems under various settings. (Diversify Your Build and Execution Environment) This allowed us to find that the problem occurred after sourcing some shell initialization files.
Then, in a series of successive iterations, I cut down the 65 line shell script that triggered the problem and the 8 line interactive script that demonstrated it, into a single two line script. It was thus easy to run the script with a single command. (Enable the Efficient Reproduction of the Problem) Removing the second statement of the first line (changing true || false into true) allowed me to obtain very compact traces of the correctly running and failing execution. (Minimize the Differences between a Working Example and the Failing Code)
I then used the Unix strace command to record the system calls of the working and the failing script. (Trace the Code’s Execution)
With the two traces at hand I used the Unix grep command to find and display the execve calls of the working and the failing invocation. (Analyze Debug Data with Unix Command-Line Tools) This showed me that the problem was associated with a wrong executable program search path being used. (Find the Difference between a Known Good System and a Failing One) However, I still did not know why the wrong path was used.
When I run dgsh with debug output enabled, it recorded in the output of the failing system a key line: A variable that was supposed to be temporarily set to false, was never set back to its previous value. (Use the Software’s Debugging Facilities)
At that point I was ready to fix the bug. However, before fixing it, I constructed and added a test case that exercised the bug and run the program’s tests to ensure the test failed. (Find the Fault by Constructing a Test Case) I then corrected the code, and run all tests again to verify that the bug was fixed and that I had not introduced another fault in the system.
After committing the changes and the test, I removed the various log files and test scripts in order to get a clean output from git status. (Houseclean Before and After Debugging)