How to find the difference of two files and get a 3rd file with the newer information alone.

for example:

There are two files file1 and file2
file1
apple
mango
pear

file2
apple
mango
pear
cat
and
dog

I need a file with
fileout
cat
and
dog

how to achieve this with diff and patch.
are there any other way?

note: the files have millions of records.

Recommended Answers

All 6 Replies

I assume you are using a Linux or similar *nix type of oeprating system? So, look at the man page for the 'diff' command. It will do what you want. The purpose of the 'patch' command is to apply diff's to an original file. I believe that with the correct options, diff can do what you want without using 'patch'.

Do note that if you are serious about "millions of records", then you might be better off to store the data in a hadoop database and use mapreduce to process the data.

And if you are using a windows OS, then the console command FC will do it, and also you ca redirect the output to a file using the standar output redirector.

First of all I aplogize for the delay.

Yes, I'm working on an UNIX Box.

I'm not able to figure out the soulution using diff command.

However I came up with a solution wherin I can get the 3rd column of the comm command like the following

comm -13 ${FILE1} ${FILE2}>${DELTA}

This solves my problem. But I am skeptical about it processing millions of data. Can somebody give thumbs up for the same or should I go with rubberman's suggestion of hadoop databases?

Be aware that your solution using comm will only include those rows that are found only in the second file. If there are rows in the first file that do not exist in the second, you will fail to find them. So this is not a solution to your problem as expressed in your original post. There, you specified that you wanted to see the difference between the files.
If you only wanted to find the 'new' rows in file2, you are OK. If you need all differences, also execute the this command: comm -23 ${FILE1} ${FILE2} >> ${DELTA}.
If you want to use diff, you will need to pipe the results through grep and cut to output only the original rows and that will be quite a bit slower.
I am not familiar with hadoop but I don't think a database solution will be faster that comm, all things being equal. I assume that you must compare both files in their entirety each time and that you cannot seperate out the incremental changes. That implies that the rows of both files will have to be dropped and reloaded every time before comparing them and that there will be no indexing on the rows unless new indexes are built each time.
If you need to know how any solution will scale - build some test files and try it!

Thanks All!

Yes all4n. I require only records present in the second file. Just like a litteral subraction (fil2-file1).

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.