How to Perform Set Operations on Terabyte Files
The Unix sort command can efficiently handle files of arbitrary size (think of terabytes). It does this by loading into main memory all the data that can fit into it (say 16GB), sorting that data efficiently using an O(N log N) algorithm, and then merge-sorting the chunks with a linear complexity O(N) cost. If the number of sorted chunks is higher than the number of file descriptors that the merge operation can simultaneously keep open (typically more than 1000), then sort will recursively merge-sort intermediate merged files. Once you have at hand sorted files with unique elements, you can efficiently perform set operations with them through linear complexity O(N) operations. Here is how to do it.
Create sorted sets
To create a sorted set out of a file containing line records pass it through
sort | uniq or, more efficiently, through
sort -u. As an example, these are two such sets, one containing a set of mammals and one a set of sea animals.
$ cat mammals cow dog dolphin tiger whale $ cat sea-animals dolphin shark trout whale
To find the union of the two sets run them through
sort -um. This will merge-sort the specified files and output each line only once. The following example shows the animals in both files.
$ sort -mu mammals sea-animals cow dog dolphin shark tiger trout whale
To find the intersection of the two sets run them through
comm -12. The comm command will identify common and non-common elements in the two specified sorted files. With the specified flags it will not output elements only in the first set (1) or only in the second set (2), but just elements common to both sets. The following example shows sea animals that are also mammals.
$ comm -12 mammals sea-animals dolphin whale
To find the difference between the two sets run them through
comm -23. As we saw, this will find common and non-common elements in the two files. With the specified flags it will not output elements only in set 2, nor elements common to both sets, but only elements that exist in set 1. The following example shows mammals that are not sea animals.
$ comm -23 mammals sea-animals cow dog tiger
Similarly, you can also find sea animals that are not mammals.
$ comm -13 mammals sea-animals shark trout