1. Context

Twice during my career, I was given the project to compare two large CSV files (Comma Separated Values: text files corresponding to either an Excel file export or an export of database tables):

  • In the first case, it was a comparison between trades entered in a customized software package in an independent country P1, and those entered in the same software package customized differently in a P2 country. It was necessary to primarily transcode the information then to recognize those who were absent from the Paris system and those who had been modified in order to update the data of the Paris database. Each file was about 300Mb and the computers had about 1Gb of RAM. The development os expected to be done in the next 3 days maximum. (The principle: we want the result for yesterday with a run time inferior to 8 hours).
  • In the second case, it was necessary to reconcile flow files between two different software packages (the old one which had to be decommissioned and the replacement one) that ran in competition in order to make non-regressions. Each file was 2.5Gb and the goal was to compare them daily in less than 2 hours. Development needed to be operational in less than two weeks.

To be able to compare them, the first step (and the most complex step) consists of sorting the 2 files: by sorting them we can decrease the number of lines read as well as the number of comparison. Unfortunately, sorting the file requires loading it, and when it is large it becomes complicated.

2. Technical choices

2.1 Language

The choice of language strongly depends on the context:

  • If you have sufficient development time, use a compiled language like C / C ++ or just in time compiler like C# that will provide you with excellent performance and fine memory management.
  • If your need requires a short development time you have to choose a scripting language such as Perl or Python.

For example, in this post, we will use the Python language in order to obtain the most compact and clear code possible, allowing you to appropriate the method and then quickly transpose it into another language.

2.2 Algorithm

The choice of the algorithm will condition the performance as well as the feasibility. Indeed « loading a CSV file » of 2.5Gb in RAM is not possible unless you have an exceptional memory capacity prohibiting the storage of the file in memory. Give it multiple attempts and experiment and you will find that the memory consumption is almost 10 times greater as we cut lines in vector of values.

Here is a small example:

def read_csv( filename ):
    results = []
    with open( filename, 'r' ) as file:
        for line in file:
            line    = line.rstrip()
            columns = line.split( ';' )
            results.append( columns )
    return results
path    = "C:\\personnel\\python\\blog\\compare_csv"
before  = psutil.virtual_memory()
results = read_csv( path + "\\sample1.csv" )
after   = psutil.virtual_memory()
print( "psutil.size =", after.used - before.used )

If we use a csv file of 39Mo, we get 341Mo in memory.

The solution is to use an external merge sorting: we divide the large file into several files of the same size that we sort independently from each other and rebuild the file by merging the sorted files. The details of the method can be found on these following sites:

Divide to conquer

The first part of the algorithm is to divide the main file into sub-files of equivalent size (we will rely on a number of lines of the original file) that are sorted independently of each other. We then cut the filename. In the temporary_path directory, we place the cut files. Each file will contain rows_count rows and will be sorted using the ‘sort’ function of the lists and then pass the function get_key (this function extracts from the initial file columns a tuple serving as a sort key) in the key argument.

def split_file( filename, temporary_radix, rows_count, get_key ):
    """ Split file
        Split input file in several sorted files of rows_count lines.
    
        Parameters:
            filename: input full path name
            temporary_radix: full path radix to create splitted sorted files
            get_key: the function to extract sorting key from columns
        
        Return:
            number of generated files
    """
    # local variables
    file_counter = 0
    
    # write sub file
    def write_file( filename, rows ):
        rows.sort( key = get_key )
        with open( filename, 'w' ) as output_file:
            for r in rows:
                print( *r, sep = ';', end = '\n', file = output_file )
    
    # loop
    with open( filename, "r" ) as input_file:
        rows = []
        for line in input_file:
            line    = line.rstrip()
            columns = line.split( ';' )
            rows.append( columns )
            if len( rows ) == rows_count:
                write_file( temporary_radix % ( 0, file_counter ), rows )
                rows.clear()
                file_counter += 1
        if len( rows ) > 0:
            write_file( temporary_radix % ( 0, file_counter ), rows )
            file_counter += 1
    return file_counter

Merging in order to win

The second part of the algorithm consists in merging the 2 by 2 files, making sure it keeps sorting. We will read one line at a time in each file, compare them and write the smallest one in the result file and replace the line written by the next one of the same original file. The risk is famine (we have consumed all the lines of one of the two files but not in the other), it will then be required to evolve the comparison function to be able to handle empty lines: we shift the line (well defined) to the output file.

def fusion( temporary_radix, depth, id1, id2, new_id, get_key ):
    """ File fusion
        F(id1) + F(id2) -> F(new_id)
    
        Parameters:
            temporary_radix: full path radix to read/write temporary files
            depth: depth
            id1: id of 1st file ("%s\\temp_%d_%d.csv" % ( temporary_path, depth, id1 ))
            id2: id of 2nd file ("%s\\temp_%d_%d.csv" % ( temporary_path, depth, id2 ))
            new_id: id of the new file ("%s\\temp_%d_%d.csv" % ( temporary_path, depth+1, new_id ))
            get_key: function to extract the columns from the line
    """
    # local variables
    columns1 = None
    columns2 = None
    name1    = temporary_radix % (     depth,    id1 )
    name2    = temporary_radix % (     depth,    id2 )
    name_o   = temporary_radix % ( depth + 1, new_id )
    
    # compare
    def compare():
        if None == columns1:
            return 1
        elif None == columns2 or get_key( columns1 ) < get_key( columns2 ):
            return -1
        else:
            return 1
    
    # open the files
    with open( name1,  "r" ) as inFile1, \
         open( name2,  "r" ) as inFile2, \
         open( name_o, "w" ) as outFile:
        while True:
            # read lines
            if not columns1:
                line = inFile1.readline().rstrip()
                if 0 != len( line ):
                    columns1 = line.split( ';' )
            if not columns2:
                line = inFile2.readline().rstrip()
                if 0 != len( line ):
                    columns2 = line.split( ';' )
            if not columns1 and not columns2:
                break
            # compare
            if compare() < 0:
                print( *columns1, sep = ';', end = '\n', file = outFile )
                columns1 = None
            else:
                print( *columns2, sep = ';', end = '\n', file = outFile )
                columns2 = None
    # finalize
    unlink( name1 )
    unlink( name2 )

5. Sorting or Nothing

The purpose of this function is to organize the cut and then merge until you get a sorted file. Once cut, we will merge the files 2 by 2: at each step we divide the number of files by 2. We must continue until there is more than one file. And in the case where at one step there is an odd number of files, there will be a file to copy from step i to i + 1.

def sort_file( input_filename, \
               sorted_filename, \
               rows_count, \
               temporary_radix = "C:\\Temp\\temp_%d_%d.csv", \
               get_key         = lambda c: c ):
    """Sort file
    
       Parameters:
           input_filename: input file name (file to sort)
           sorted_filename: output file name (sorted file)
           temporary_radix: full path radix name for temporary files ({path}\\temp_{depth}_{id}.csv)
           rows_count: number of rows per file in splitting phasis
           get_key: key function to sort
    """
    # initialize
    count   = split_file( input_filename, temporary_radix, rows_count, get_key )
    depth   = 0
    
    # main loop: keep 2 files, join them into one sorted file until all files are consumed
    while 1 < count:
        n = 0
        for i in range( 0, count, 2 ):
            j = i + 1
            if count == j:
                rename( temporary_radix % (     depth, i ), \
                        temporary_radix % ( depth + 1, n ) )
            else:
                fusion( temporary_radix, depth, i, j, n, get_key )
            n += 1
        depth += 1
        count  = n
    
    # rename temporary file into attempted file
    rename( temporary_radix % ( depth, 0 ), sorted_filename )

Here is an example :

get_key = lambda c: (c[1],)
sort_file( "C:\\personnel\\python\\blog\\compare_csv\\sample1.csv", \
           "C:\\personnel\\python\\blog\\compare_csv\\sample1_sorted.csv", \
           10000, \
           "C:\\personnel\\python\\blog\\compare_csv\\temp\\temp_%d_%d.csv", \
           get_key )

Conclusion

The method makes it possible to replace memory allocation / management problems by file accesses. The performance will therefore depend on 2 parameters:

  • Disk access time: avoid using a network drive to manage temporary files, prefer a local hard drive.
  • The number of lines when cutting: The lower the number, the more the files are important and therefore the more the disk access will be expensive.

It is necessary to calibrate the algorithm according to the machine used (amount of memory and hard disk) to execute the operation and the format of the CSV file. This will make an efficient choice of the number of lines for cutting.