How to Perform Set Operations on Terabyte Files
The Unix sort command can efficiently handle files of arbitrary size (think of terabytes). It does this by loading into main memory all the data that can fit into it (say 16GB), sorting that data efficiently using an O(N log N) algorithm, and then merge-sorting the chunks with a linear complexity O(N) cost. If the number of sorted chunks is higher than the number of file descriptors that the merge operation can simultaneously keep open (typically more than 1000), then sort will recursively merge-sort intermediate merged files. Once you have at hand sorted files with unique elements, you can efficiently perform set operations with them through linear complexity O(N) operations. Here is how to do it.
Continue reading "How to Perform Set Operations on Terabyte Files"
Monitor Process Progress on Unix
I often run file-processing commands that take many hours to
finish, and I therefore need a way to monitor their progress.
The Perkin-Elmer/Concurrent OS32 system I worked-on for a couple
of years back in 1993 (don't ask)
had a facility that displayed for any executing
command the percentage of work that was completed.
When I first saw this facility working on the programs I maintained,
I couldn't believe my eyes, because I was sure that those rusty
Cobol programs didn't contain any functionality to monitor their progress.
Continue reading "Monitor Process Progress on Unix"
Open and Closed Source Kernels Go Head to Head
Earlier today I presented at the
30th International Conference on Software Engineering a
research paper comparing the
code quality of Linux, Windows (its
research kernel distribution),
For the comparison I parsed multiple configurations of these systems (more than ten million lines), and stored the results in four databases, where I could run SQL queries on them. This amounted to 8GB of data, 160 million records.
(Iíve made the databases and the SQL queries available
The areas I examined were file organization, code structure, code style, preprocessing, and data organization.
To my surprise there was no clear winner or looser, but there were interesting differences in specific areas.
Continue reading "Open and Closed Source Kernels Go Head to Head"
The Treacherous Power of Extended Regular Expressions
I wanted to filter out lines containing the word "line" or a double quote
from a 1GB file.
This can be easily specified as an extended regular expression,
but it turns out that I got more than I bargained for.
Continue reading "The Treacherous Power of Extended Regular Expressions"
What Can System Administrators Learn from Programmers?
Although we often hear about program bugs and techniques to get
rid of them, we seldom see a similar focus in the field of system
This is unfortunate, because increasingly the reliability of an IT system
depends as much on the software comprising the system as on the support
infrastructure hosting it.
Continue reading "What Can System Administrators Learn from Programmers?"
Code Reading Example: the Linux Kernel Load Calculation
A colleague's Linux machine was exhibiting a very high load value,
for no obvious reason.
I wanted to make him point the kernel debugger on the routine calculating
It has been more than 7 years since the last time I worked on a Linux
so I had to find my way around from first principles.
This is an annotated and slightly edited version of what I did.
Continue reading "Code Reading Example: the Linux Kernel Load Calculation"