blog dds: 2017-10-03 — An Embarrassing Failure

My colleague Georgios Gousios and I are studying the impact of software engineering research in practice. As part of our research, we identified award-winning and highly-cited papers, and asked their authors to complete an online survey. Each survey was personalized with the author’s name and the paper’s title and publication venue. After completing a trial and a pilot run, I decided to contact the large number of remaining authors. This is when things started going horribly wrong.

What went wrong

While the script I had written to send the more than 100 emails was still executing, I checked my in-box to look for bounced emails. Indeed, there were a few bounces and a couple of automatic responses. Amazingly, an author response was already there. It’s content caused my heart to skip a beat.

Hi Diomidis and Georgios,

Happy to participate in the survey, but it seems there is some matching
problem, as I am not co-author of the first paper that shows up in
the survey.

Cheers,

I clicked on the link, and indeed a survey came up with the wrong author and paper.

Had I just demonstrated my sloppiness and incompetence to the brightest minds in software engineering? My hope was that this was just a on-off data entry problem. To check I quickly opened the table with the author details, Worryingly, these were correct. A minute later another email appeared in my in-box.

Thank you for the invitation. The included link is for work by a different author. Thx.

Two minutes later I got another similar one.

Dear Diomidis,

The link takes me to a link that addresses me as "[...]" and ask me about

"[...]"

a 1995 TOSEM paper I didn't write.

Best,

Investigation

By that time all emails had been sent, and I was frantically trying to understand what was going on.

I was clicking on the provided links and was getting the correct survey forms.
Could it be a web browser incompatibility issue? I tried a different web client, but the correct page still came up.
Could the issue be associated with the client operating system? I fetched the page from my mobile phone, and it was correct. I tried a different desktop operating system, and still the correct page appeared.
Could the network be playing tricks on me? I logged in to a host in the US, and fetched the page with curl. The correct page came up.
Maybe the web server delivered incorrectly a few of the requested pages? I wrote a small shell script to compare each local file with the web server response.

for i in * ; do
  curl -s "https://istlab.dmst.aueb.gr/s/$i/" |
    diff - "./$i/index.html"
done

All responses were correct. * Could the fault be due to a race condition? I fired two loops continuously requesting different pages and comparing the results. No problem here either. * Could the hashing algorithm I was using to shorten the links be causing collisions? I looked at my code, and there was a check against that possibility.

At that time another email appeared in my in-box.

Diomidis,

I checked the links you provided. Two of the papers were not written by me. 

I received the [...] SIGSOFT Most Influential Paper Award (although
I don't remember what paper it was for but I think it was in the
Int. Software Engineering Conference). It is not included. I also wrote
a paper that has been cited [...] times (by Google Scholar), reprinted
many times, and even translated into braille and sound recordings for the
blind. That paper is not included and neither are my papers that have had
the most influence on practice (including several on a technique that had
300 people attend a users meeting last spring from 24 countries and has
other user meetings around the world). So I'm unsure that your protocol
is going to find the most influential papers in industry (as opposed
to those cited by researchers but never really used on real projects,
i.e., having practical impact).

It really hurt. Here I was, wasting the time of prominent software engineering researchers with what was probably a software bug.

At that point I stopped looking for the fault, and devoted my time on damage control. I responded to the emails, providing an alternative way to reach the correct survey page. I also notified my co-author, Georgios to obtain his view, because I was at my wit’s end.

An hour later the situation was entering the Twilight Zone. One of the authors wrote to tell me that one of the initially received links was for a deceased co-author and was no longer included in the new email I had sent.

Diomidis,

Now you left out the one that was correct originally :-). It was the
paper with [...] on [...]. [...] is deceased so only I will be able
to answer the survey for that one.

That email ruled out another possibility. The fault was not caused through mixed up web server responses; in this case the deceased author was definitely not trying to complete the survey at the same time.

It was getting late, Georgios was not responding (it was Sunday evening), and I was getting nowhere. Thankfully, the complaints regarding the wrong survey had stopped arriving, and I was also getting some emails from authors with other questions regarding the survey. So it seemed that at least some of the sent links were indeed correctly received. I reasoned that maybe the tests I run had somehow corrected the problem by refreshing a cache in the web server or operating system. Consequently, I decided to follow the advice I’m giving in my book Effective Debugging and sleep over the problem,

It was an uneasy sleep. I woke up at about 6am, before my alarm rang, with the answer to the mystery clear in my head.

The survey email sending program worked by creating for each publication a short URL link that pointed to a redirection page with the full details of the corresponding publication and author. I did this to prevent the mangling of the long URLs by email clients and servers. The redirection pages were generated locally, and once all emails were sent, a separate command uploaded them to the web server. Due to the large number of the emails, these started arriving before the sending program had uploaded the redirection pages to the server. Consequently, authors clicking on the redirection link were receiving data from a July 2017 test run. (I verified this by retrieving the older files from backup storage.) The test run had generated redirection pages for all authors, whereas the actual run was excluding the authors that had already received emails from the trial and pilot runs. Therefore, the shortened URLs, which were generated by encrypting a counter, were out of sync, resulting in links that were incorrect until the correct pages were uploaded to the server. This was also the reason why the problem didn’t surface in the previous runs.

Lessons Learned

Thankfully, the problem only affected four respondents; the results of a different fault could have been much worse. What are the reasons we can learn from this?

End-to-end testing cannot guarantee the correctness of a process. Code reviews and pair programming are valuable additional safeguards. They should be used, even when writing code for research purposes, when failures can affect many people. In my case a second pair of relatively competent eyes went over the code, but did not brutally scrutinize every element of the process.
Review both the code and the scripts associated with its operations.
Execute risky processes in small batches, leaving ample time between them to let any problems surface.
Clean up after test runs, or, better, execute them in a separate throwaway environment. This would not have prevented the problem, but would have resulted in a much easier to track failure mode.
Publishing post-mortem investigations of our failures should be standard practice for people working in production and computing academics alike. This can help us all improve by avoiding past mistakes.

A note on the survey

If you’ve performed software engineering research that has had a direct impact on software development practice, or if you would like to identify such a paper, please help us by completing this survey.

Comments Post Toot! Tweet Share