Hi ALL,

Can any one suggest me on below requirements with less memory and cpu time.

I have used below code and seems it is not efficient for huge files.

fgrep -v -f file2 file1 >file3
This will output file3 containing all lines from file1 that are not in file2.


file1 contains list of 2Lakh 10 digit numbers and
file2 varies between list of almost 8-10 cores 10 digit numbers

i want to be extract the number from file1 which are not there in file2.

please guide me which command will be good or any perl script will be helpfull.
I am new to perl but i can mange.

Recommended Answers

All 4 Replies

It might be better for you to sort each of file1 and file2 first. That way you can simply iterate through the file you want to filter removing any matches in the other file. Since the two will be sorted you can filter in a single pass instead of the total number of lines in file2 .

It might be better for you to sort each of file1 and file2 first. That way you can simply iterate through the file you want to filter removing any matches in the other file. Since the two will be sorted you can filter in a single pass instead of the total number of lines in file2 .

Thanks for your reply..

I found one solution using comm command
comm -23 file1.txt file2.txt >> outfile.txt
(it is sorting internally i guess)

file1.txt contains 40 million lines (10digit number)
and file2 contions 1 lakh lines (10digit number)

it is taking 15-20 sec to execute ... i need to run on 160 million lines of file.

it will be great help if any one suggest for better option.

Thanks!!

comm expects sorted files so your output is invalid if you are sorting your input beforehand.
As far as the speed, you wont likely get much faster than that. The problem is that you will have to load both files into memory (or portions of them) and read through all the lines of the longest file. Just reading each line of a 40 million-line file takes time. To get an idea of how long it should take you can do the following two tests: cat file2.txt > /dev/null : How long to read the file cat file2.txt > tmpfile.txt : How long to read and write file

On top of that, I'd suggest that you try to store the files already sorted. Running sort on a file 40 million lines long is going to take a while.

Thanks for info!!
I thought comm will sort and process.

I will check it once..

Also have tried
fgrep -v -f bigfile smallfile > outfile
it is taking 1 minutes.

And the output of the file is same for both the commands.
(the smallfile is already sorted)

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.