How to Perform Set Operations on Terabyte Files

 

The Unix sort command can efficiently handle files of arbitrary size (think of terabytes). It does this by loading into main memory all the data that can fit into it (say 16GB), sorting that data efficiently using an O(N log N) algorithm, and then merge-sorting the chunks with a linear complexity O(N) cost. If the number of sorted chunks is higher than the number of file descriptors that the merge operation can simultaneously keep open (typically more than 1000), then sort will recursively merge-sort intermediate merged files. Once you have at hand sorted files with unique elements, you can efficiently perform set operations with them through linear complexity O(N) operations. Here is how to do it.

Create sorted sets

To create a sorted set out of a file containing line records pass it through sort | uniq or, more efficiently, through sort -u. As an example, these are two such sets, one containing a set of mammals and one a set of sea animals.

$ cat mammals
cow
dog
dolphin
tiger
whale

$ cat sea-animals
dolphin
shark
trout
whale

Set union

To find the union of the two sets run them through sort -um. This will merge-sort the specified files and output each line only once. The following example shows the animals in both files.

$ sort -mu mammals sea-animals
cow
dog
dolphin
shark
tiger
trout
whale

Set intersection

To find the intersection of the two sets run them through comm -12. The comm command will identify common and non-common elements in the two specified sorted files. With the specified flags it will not output elements only in the first set (1) or only in the second set (2), but just elements common to both sets. The following example shows sea animals that are also mammals.

$ comm -12 mammals sea-animals
dolphin
whale

Set difference

To find the difference between the two sets run them through comm -23. As we saw, this will find common and non-common elements in the two files. With the specified flags it will not output elements only in set 2, nor elements common to both sets, but only elements that exist in set 1. The following example shows mammals that are not sea animals.

$ comm -23 mammals sea-animals
cow
dog
tiger

Similarly, you can also find sea animals that are not mammals.

$ comm -13 mammals sea-animals
shark
trout

Comments   Toot! Share


Last modified: Tuesday, April 3, 2018 8:44 pm

Creative Commons Licence BY NC

Unless otherwise expressly stated, all original material on this page created by Diomidis Spinellis is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.