comparing to large files in unix/perl

Question

ajai.solinfi 0 Newbie Poster

13 Years Ago

Hi ALL,

Can any one suggest me on below requirements with less memory and cpu time.

I have used below code and seems it is not efficient for huge files.

fgrep -v -f file2 file1 >file3
This will output file3 containing all lines from file1 that are not in file2.

file1 contains list of 2Lakh 10 digit numbers and
file2 varies between list of almost 8-10 cores 10 digit numbers

i want to be extract the number from file1 which are not there in file2.

please guide me which command will be good or any perl script will be helpfull.
I am new to perl but i can mange.

perl shell-scripting unix

2 Contributors
4 Replies
264 Views
16 Hours Discussion Span
Latest Post 13 Years Ago Latest Post by ajai.solinfi

All 4 Replies

L7Sqr 227 Practically a Master Poster

13 Years Ago

It might be better for you to sort each of file1 and file2 first. That way you can simply iterate through the file you want to filter removing any matches in the other file. Since the two will be sorted you can filter in a single pass instead of the total number of lines in file2 .

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

ajai.solinfi 0 Newbie Poster · Answer 1 · 2012-02-03T18:01:35+00:00

It might be better for you to sort each of file1 and file2 first. That way you can simply iterate through the file you want to filter removing any matches in the other file. Since the two will be sorted you can filter in a single pass instead of the total number of lines in file2 .

Thanks for your reply..

I found one solution using comm command
comm -23 file1.txt file2.txt >> outfile.txt
(it is sorting internally i guess)

file1.txt contains 40 million lines (10digit number)
and file2 contions 1 lakh lines (10digit number)

it is taking 15-20 sec to execute ... i need to run on 160 million lines of file.

it will be great help if any one suggest for better option.

Thanks!!

L7Sqr 227 Practically a Master Poster · Answer 2 · 2012-02-03T19:28:04+00:00

comm expects sorted files so your output is invalid if you are sorting your input beforehand.
As far as the speed, you wont likely get much faster than that. The problem is that you will have to load both files into memory (or portions of them) and read through all the lines of the longest file. Just reading each line of a 40 million-line file takes time. To get an idea of how long it should take you can do the following two tests: cat file2.txt > /dev/null : How long to read the file cat file2.txt > tmpfile.txt : How long to read and write file

On top of that, I'd suggest that you try to store the files already sorted. Running sort on a file 40 million lines long is going to take a while.

ajai.solinfi 0 Newbie Poster · Answer 3 · 2012-02-03T20:12:45+00:00

Thanks for info!!
I thought comm will sort and process.

I will check it once..

Also have tried
fgrep -v -f bigfile smallfile > outfile
it is taking 1 minutes.

And the output of the file is same for both the commands.
(the smallfile is already sorted)

comparing to large files in unix/perl

Recommended Answers Collapse Answers

All 4 Replies

Recommended Answers