Find matching values between 2 text files

Question

leo002000 2 Junior Poster in Training

14 Years Ago

Hi all,

Basically I am attempting to determine whether the first 2 columns (tab delimited file) from file1.txt match the first 2 columns of file2.txt. If they do I want to output the entire row of file1.txt to outputfile.txt...

I can do this using vbs but due to the amount of data that this is going to have to trawl through it's not an option. Can anyone please help?

Many thanks :)

perl

2 Contributors
9 Replies
239 Views
2 Days Discussion Span
Latest Post 14 Years Ago Latest Post by mitchems

mitchems 12 Posting Whiz in Training

14 Years Ago

So let's be clear... you have 2 tab delimited text files. What do they look like? Like this?

file1.txt

1 2
3 4
5 6
7 8

file2.txt
2 1
3 4
5 6
8 7

And you want to see if BOTH values in BOTH columns match? That is, in this example the output would be:

outputfile.txt
3 4
5 6

Correct?

mitchems 12 Posting Whiz in Training

14 Years Ago

I copied the contents of your file1.txt and file2.txt over mine and converted them to tab delimited files (the tabs don't come through when you paste them here). I ran my original program against the two files and I got the outputfile that you wished. The code I originally wrote should work on either numeric or text (that's why I used "eq" instead of "=="). Make sure you have real tab-delimited files and try it again.

mitchems 12 Posting Whiz in Training

14 Years Ago

Here is the code that I believe will work for your purposes:

use strict;
use warnings;
open(FILE1,"<file1.txt");
open(FILE2,"<file2.txt");
open (OUT,">outputfile.txt");
my @arr1=<FILE1>;
my @arr2=<FILE2>;
close FILE1;
close FILE2;
my $x=0;
my %chash;
for (@arr2){
	chomp;
	my($col1,$col2,$rest)=split(/\t/);
	my $ckey="$col1$col2";
	$chash{$ckey}=1; #this puts the first two columns of file2 in a hash
}

for (@arr1){
	chomp;
	my($col1,$col2,$rest)=split(/\t/);
	$col1="" if(!$col1);
	$col2="" if(!$col2);
	my $ckey="$col1$col2";
	if($chash{$ckey}){
		print OUT "$_\n";
	}
}
close OUT;

Yes, I know I reuse variable names. Scope allows me to.

Here are my source files:

file1.txt

Supercont	1.9	1479890	T 2 @ 1,3,5
Supercont	1.9	1459998	G 4 @.. 1,2,6
Supercont	1.3	1234876	C 6 @... 1,8
Supercont	1.4	1234878	C 7 @... 1,8
Supercont	1.7	1454876	C 8 @... 1,8
Supercont	1.12	1564876	C 9 @... 1,8

file2.txt

Supercont	1.9	1479890 
Supercont	1.33	2343434
Supercont	1.4	0099879

outputfile.txt

Supercont	1.9	1479890	T 2 @ 1,3,5
Supercont	1.9	1459998	G 4 @.. 1,2,6
Supercont	1.4	1234878	C 7 @... 1,8

They are all tab-delimited.

leo002000 commented: 3 +3

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

mitchems 12 Posting Whiz in Training · Answer 1 · 2010-08-18T20:31:10+00:00

OK, I created 2 text files.

file1.txt

file2.txt

Code that matches the first two columns:

use strict;
use warnings;
open(FILE1,"<file1.txt");
open(FILE2,"<file2.txt");
open (OUT,">outputfile.txt");
my @arr1=<FILE1>;
my @arr2=<FILE2>;
my $x=0;
for (@arr1){
	chomp;
	my($col1,$col2,$rest)=split(/\t/);
	my($col21,$col22,$rest2)=split(/\t/,$arr2[$x]);
	 $col22=~s/(\n|\r)//g;
	 if($col1 eq $col21 && $col2 eq $col22){
	 	print OUT "$_\n";
	 }
	$x++;
}
close FILE1;
close FILE2;
close OUT;

outputfile.txt

3       4
5       6
9       1

Is that what you want? Also, are there any more columns in these files? If so, which entire row do you wish to print to the file? The one from file1 or file2 or both?

leo002000 2 Junior Poster in Training · Answer 2 · 2010-08-19T02:10:44+00:00

Many thanks for the time you have spent on this. Your assumption/logic is correct. I can get this to work on purely numerical data (as you have detailed below) but when I attempt to use them on my actual text files which look like this... I get empty outputfile.txt

Also I would like to print all the columns from file1.txt as shown in my example below (outputfile.txt)

file1.txt

Supercont 1.9 1479890 T 2 @ 1,3,5
Supercont 1.9 1459998 G 4 @.. 1,2,6
Supercont 1.8 1234876 C 6 @... 1,8

File2.txt

Supercont 1.9 1479890
Supercont 1.33 2343434
Supercont 1.4 0099879

Outputfile.txt

Supercont 1.9 1479890 T 2 @ 1,3,5

Many thanks for your help :)

leo002000 2 Junior Poster in Training · Answer 3 · 2010-08-19T03:26:00+00:00

Yes your 100% correct. If if I was looking to find a match anywhere in the 2 columns, so say for example...

file1.txt

1 6

2 8

file2.txt

3 9

1 6

outputfile

1 6

and also how would I print the other columns associated with the match from file1.txt? say columns 1,2,3,4 which obviously I am not showing here...

Many thanks

:)

mitchems 12 Posting Whiz in Training · Answer 4 · 2010-08-19T20:52:52+00:00

One more quick clarification, so you said anywhere in the two columns? So, if I got this right you want to check to see if the two columns of any of the rows in file1 are located anywhere in file2? If so, you want to print out the line from file1? Right?

leo002000 2 Junior Poster in Training · Answer 5 · 2010-08-20T03:33:24+00:00

Mike you are an absolute gent... It works a treat. Would it be possible to put in some additional commenting, so that I am able to get some of it into my head (particularly) the hashing bit...

Thanks a million

mitchems 12 Posting Whiz in Training · Answer 6 · 2010-08-20T20:52:27+00:00

Ok, the hashing bit. What I did is I read through the second file, which you are using as a reference to match against the elements in the first file.

my %chash; #defines a hash in scope that the later loop can see
for (@arr2){
	chomp;
	my($col1,$col2,$rest)=split(/\t/); #splits the row into 2 columns
	my $ckey="$col1$col2"; #concats the first two columns into one variable
	$chash{$ckey}=1; #sets the value of the hash to 1 for that key
        #example: $chash{Supercont1.9}=1 so it points the key Supercont1.9 to 1
}

Then I roll through file1 by traversing the @arr1 and concat the first two columns:

for (@arr1){
	chomp;
	my($col1,$col2,$rest)=split(/\t/);
	$col1="" if(!$col1); # I do this so warnings will not complain
	$col2="" if(!$col2); # I do this so warnings will not complain
	my $ckey="$col1$col2"; #concat the first two columns of file1
	if($chash{$ckey}){ #if the value exists (set to 1) it was in file2
		print OUT "$_\n"; #print the entire row from file1 to outputfile
	}
#if the value does NOT exist in the hash, do nothing and go on to the next row
}

I hope that's clearer. Look into hashes in Perl. They are very powerful and can point to anything - values, other hashes, subroutines, arrays and more. IMO hashes are one of the most useful elements of Perl.