blog dds

2018.04.03

How to Perform Set Operations on Terabyte Files

The Unix sort command can efficiently handle files of arbitrary size (think of terabytes). It does this by loading into main memory all the data that can fit into it (say 16GB), sorting that data efficiently using an O(N log N) algorithm, and then merge-sorting the chunks with a linear complexity O(N) cost. If the number of sorted chunks is higher than the number of file descriptors that the merge operation can simultaneously keep open (typically more than 1000), then sort will recursively merge-sort intermediate merged files. Once you have at hand sorted files with unique elements, you can efficiently perform set operations with them through linear complexity O(N) operations. Here is how to do it.

Continue reading "How to Perform Set Operations on Terabyte Files"


Creative Commons License Last update: Tuesday, April 3, 2018 8:44 pm
Unless otherwise expressly stated, all original material on this page created by Diomidis Spinellis is licensed under a Creative Commons Attribution-Share Alike 3.0 Greece License.