How to Perform Set Operations on Terabyte Files

The Unix sort command can efficiently handle files of arbitrary size (think of terabytes). It does this by loading into main memory all the data that can fit into it (say 16GB), sorting that data efficiently using an O(N log N) algorithm, and then merge-sorting the chunks with a linear complexity O(N) cost. If the number of sorted chunks is higher than the number of file descriptors that the merge operation can simultaneously keep open (typically more than 1000), then sort will recursively merge-sort intermediate merged files. Once you have at hand sorted files with unique elements, you can efficiently perform set operations with them through linear complexity O(N) operations. Here is how to do it.

The Shoemaker's Children Go Barefoot

Earlier today I submitted the camera-ready version of a technical briefing on mining Git repositories, which Georgios Gousios and I will be presenting at the 2018 International Conference on Software Engineering. I was struck by the complexity and inefficiency of the administrative process.

Reviving the 1973 Unix Programmer's Manual

The 1973 Fourth Edition of the Unix Programmer's Manual doesn't seem to be available online in typeset form. This is how I managed to recreate it from its source code.

How I Recovered my Firefox Tab Groups

When quit and restarted Firefox today I received an unwelcomed shock. All my tab groups, which I maintained using the Tab Groups by Quicksaver plugin, were gone! This happened because it upgraded to Firefox Quantum (57), whose API does not maintain backward compatibility with the one used by the plugin. Although I knew the plugin would one day stop working, I thought there would be some last-minute warning and chance to export the tab groups.

An Embarrassing Failure

My colleague Georgios Gousios and I are studying the impact of software engineering research in practice. As part of our research, we identified award-winning and highly-cited papers, and asked their authors to complete an online survey. Each survey was personalized with the author's name and the paper's title and publication venue. After completing a trial and a pilot run, I decided to contact the large number of remaining authors. This is when things started going horribly wrong.

Who are the Publishers of Computer Science Research?

To answer this question, I downloaded the DBLP database and used the DOI publisher prefix of each publication to determine its publisher. I grouped the 3.4 million entries by publisher and joined the numeric prefixes with the publisher names available in the list of Crossref members. Based on these data, here is a pie chart of the major publishers of computer science research papers.

The Origins of Malloc

The 1973 Fourth Edition Unix kernel source code contains two routines, malloc and mfree, that manage the dynamic allocation and release of main memory blocks for in-memory processes and of continuous disk swap area blocks for swapped-out processes. Their implementation and history can teach us many things regarding modern computing.

Of BOOL and stdbool

The C99 standard has added to the C programming language a Boolean type, _Bool and the bool alias for it. How well does this type interoperate with the Windows SDK BOOL type? The answer is, not at all well, and here's the complete story.

Debugging in Practice: dgsh Issue 85

Fixing an insidious bug in the new Unix directed graph shell dgsh allowed me to demonstrate in practice 10 of the 66 principles, techniques, and tools I describe in the book Effective Debugging. Almost all steps all documented in the corresponding issue and commits. Here's a detailed retrospective.

Display Git's and Current Directory on Terminal Bar

I typically have more than ten windows open on my desktop and rely on their names to select them. Being a command-line aficionado, most of them are terminals. I have them configured to display the current directory by setting the bash PROMPT_COMMAND environment variable to 'printf "\033]0;%s:%s\007" "${HOSTNAME%%.*}" "${PWD/#$HOME/~}"'. The problem is that the directory I'm often in has a generic name, such as src or doc, so the terminal's name isn't very useful.

