http://www.spinellis.gr/pubs/trade/1999-login-dd/html/dd.html
This is an HTML rendering of a working paper draft that led to a publication. The publication should always be cited in preference to this draft using the following reference:
  • Diomidis Spinellis. Optimal peripheral access using pipe-based double-buffering. ;login:, 24(4):43–45, August 1999. Green Open Access

The document's metadata is available in BibTeX format.

Find the publication on Google Scholar

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Diomidis Spinellis Publications

Optimal Peripheral Access Using Pipe-based Double-buffering

Diomidis D. Spinellis
<dds@sena.gr>

When working with slow peripherals such as tape drives and CD-R recorders careful tuning of the access patterns and process architecture can result in substantial increases in performance. One of the responsibilities of an operating system (OS) is to optimise the utilisation of a computer system by the appropriate scheduling of tasks. Thus, when a process waits for a peripheral to complete an operation the OS can schedule other processes that are waiting to run. Although scheduling happens behind the scenes of the normal system operation and is transparent to the user, in some cases one can optimise execution time by carefully arranging interdependent tasks. With ever increasing processor speeds and with peripheral access times relatively stagnant such optimisations can deliver impressive results. In the following paragraphs I will describe and explain how I reduced the overhead of tape access by almost 30% by simply adding two dd(1) commands to our backup pipeline.

The Problem

The problem I wanted to solve was to minimise the actual (wall) time taken by our backup process which currently needs 8 hours for a full backup. I thought I could achieve this by making our drive stream, and by increasing the parallelism of the backup process. Our LAN consists of Windows 95/98/NT and Unix machines. We perform our backups on a machine running Linux (2.0.36). At backup time the Linux machine mounts the disks of all Windows hosts using the SMB file system and runs a tar(1)-based backup process. The process is responsible for administering incremental and full backups, maintaining backup quota, measuring the backup set size, comparing the files after backup, and logging the backup results. The heart of the process is essentially the following command:

tar czf /dev/tape -b 1000 /smb
The command creates a tar file in /dev/tape from the contents of /smb using gzip compression and a blocking factor of 1000. The blocking factor is supposed to write to the tape device in chunks of 1000 blocks i.e. of 0.5MB. Accessing the tape in large blocks is for some devices mandatory, and according to some manuals of tar can be more efficient than access in smaller chunks. As tape drives typically read or write data at a fixed rate, if that rate can not be maintained the rest of the system, the drive has to read or write some data and then stop and backtrack to a point where the next block can be written. When this phenomenon occurs the performance of the drive drastically deteriorates and we say that the drive is not streaming. Writing in large blocks can help the drive to stream.

In our environment our drive consistently refused to stream. Increasing the size of the blocking factor had absolutely no effect. Going through the source code of tar we are using (GNU tar 1.11.8 with a small local modification to handle the lack of stable inode numbers in SMB filesystems) I noticed that when compression is specified tar ignores the blocking factor. That explained the lack of streaming and prompted me to examine the matter further.

Measurements and the Solution

A golden rule of performance optimisation is to profile the process to be optimised. I therefore first measured the time it takes for tar and gzip to process a medium-sized directory:

% time sh -c 'tar cf - /usr/src/linux | gzip -c >/dev/null'
375.40user 39.38system 7:02.18elapsed 98%CPU

This measurement establishes our performance bottom line. No matter how much we improve tape access we will never be able to make the process faster than 7 minutes. The next step was to measure the overhead of writing to tape:

% time sh -c 'tar cf - /usr/src/linux | gzip -c >/dev/tape'
389.81user 37.34system 10:19.96elapsed 68%CPU

So we now know that writing to the tape increases the execution time by about 3 minutes, or 47%. It is this increase that we want to fight. As our full backup takes about 8 hours, optimising tape access could result in significant time savings. The measurement also tells us that the CPU is idle about 30% of the time. Our task mix is therefore I/O bound and we have CPU power to spare in order to optimise it. As a first try I piped the gzip output through dd(1) specifying a large (0.5MB) output block size. This causes dd to accumulate data from its input until it has 0.5MB of data available and then write that data in one large chunk to the tape.

% time sh -c 'tar cf - /usr/src/linux | gzip -c |
  dd obs=512k of=/dev/tape'
27979+1 records in
27+1 records out
392.65user 41.55system 11:28.28elapsed 63%CPU

This time I could hear and see the tape drive stream; a beautiful sound and sight. Embarrassingly the execution time increased by more than one minute (11%), while CPU utilisation decreased. Streaming the tape drive did not appear to help. An analysis of the situation was now in order.

Apparently, in the system I am using write commands to the tape device are processed synchronously, i.e. the process issuing the write system call is suspended until the data is written to the device. In the first case involving the tape drive, while data was being written by gzip to the tape drive (and gzip was waiting), tar could continue to process some files writing its output to the pipe buffer connecting the two processes. In the second case where dd was added, dd took a longer time to write the 0.5MB block to the tape drive. During that time the pipe buffer between gzip and dd got full thereby blocking gzip. That in turn, made the pipe buffer between tar and gzip fill-up blocking tar as well. The explanation of the timing anomaly made me realise the problem's root and think about a possible solution.

In order to optimise the backup process I would need a way to prevent the pipeline stalls while data was being written to the tape drive. The addition of another dd process after gzip could provide this buffer:

% tar cf - /usr/src/linux | gzip -c |
  dd obs=512k | dd obs=512k of=/dev/tape
In the pipeline above gzip writes data to the first dd process. As soon as this process accumulates 0.5MB it writes it to the second dd process using a single write call. As soon as the second dd process accumulates 0.5MB (i.e. immediately after the first process writes it) it will issue a write call to send the data to the tape drive. While data is being written to the tape the second dd process is blocked; however, the first one is still running, as it has written its block to the second process and is now again accumulating the next block to write. Therefore tar and gzip can continue gathering and compressing data at the same time as data is being written to the tape drive. Thus, the two dd processes effectively implement a double buffering mechanism. The time measurements prove the validity of this approach:
% time sh -c 'tar cf - /usr/src/linux | gzip -c |
  dd obs=512k | dd obs=512k of=/dev/tape
27979+1 records in
27+1 records out
27979+1 records in
27+1 records out
364.34user 48.57system 9:23.22elapsed 73%CPU

Our change resulted in a 28% decrease in the tape overhead giving a 10% decrease in overall execution time. Depending on the performance differences between the CPU and the peripheral in question, the execution time can decrease even more.

Conclusions

As I mentioned in the beginning one of the responsibilities of an OS is to optimise the utilisation of a computer system by the appropriate scheduling of tasks. In our initial case while the system was waiting for the tape drive to write the data it could indeed run other unrelated processes such as sendmail or httpd. However, the operation of the processes in the backup pipeline could not be parallelised beyond the limits imposed by the small buffer provided by the implementation of pipes under Unix. Soon after one process in the pipeline blocked the whole pipeline was stalled. Adding an additional dd process before the process writing to the tape provided extra breathing space to the pipeline and increased the possibility for parallel operation. The same approach can be used in all cases where a pipeline consists of a mixture of I/O and CPU-bound tasks.