It might be better for you to sort each of file1 and file2 first. That way you can simply iterate through the file you want to filter removing any matches in the other file. Since the two will be sorted you can filter in a single pass instead of the total number of lines in file2 .
L7Sqr
Practically a Master Poster
657 posts since Feb 2011
Reputation Points: 201
Solved Threads: 124
comm expects sorted files so your output is invalid if you are sorting your input beforehand.
As far as the speed, you wont likely get much faster than that. The problem is that you will have to load both files into memory (or portions of them) and read through all the lines of the longest file. Just reading each line of a 40 million-line file takes time. To get an idea of how long it should take you can do the following two tests: cat file2.txt > /dev/null : How long to read the file cat file2.txt > tmpfile.txt : How long to read and write file
On top of that, I'd suggest that you try to store the files already sorted. Running sort on a file 40 million lines long is going to take a while.
L7Sqr
Practically a Master Poster
657 posts since Feb 2011
Reputation Points: 201
Solved Threads: 124