Hi all,

Basically I am attempting to determine whether the first 2 columns (tab delimited file) from file1.txt match the first 2 columns of file2.txt. If they do I want to output the entire row of file1.txt to outputfile.txt...

I can do this using vbs but due to the amount of data that this is going to have to trawl through it's not an option. Can anyone please help?

Many thanks :)

So let's be clear... you have 2 tab delimited text files. What do they look like? Like this?

file1.txt

1 2
3 4
5 6
7 8

file2.txt
2 1
3 4
5 6
8 7

And you want to see if BOTH values in BOTH columns match? That is, in this example the output would be:

outputfile.txt
3 4
5 6

Correct?

OK, I created 2 text files.

file1.txt

1	2
3	4
5	6
7	8
9	1

file2.txt

2	1
3	4
5	6
8	7
9	1

Code that matches the first two columns:

use strict;
use warnings;
open(FILE1,"<file1.txt");
open(FILE2,"<file2.txt");
open (OUT,">outputfile.txt");
my @arr1=<FILE1>;
my @arr2=<FILE2>;
my $x=0;
for (@arr1){
	chomp;
	my($col1,$col2,$rest)=split(/\t/);
	my($col21,$col22,$rest2)=split(/\t/,$arr2[$x]);
	 $col22=~s/(\n|\r)//g;
	 if($col1 eq $col21 && $col2 eq $col22){
	 	print OUT "$_\n";
	 }
	$x++;
}
close FILE1;
close FILE2;
close OUT;

outputfile.txt

3       4
5       6
9       1

Is that what you want? Also, are there any more columns in these files? If so, which entire row do you wish to print to the file? The one from file1 or file2 or both?

Many thanks for the time you have spent on this. Your assumption/logic is correct. I can get this to work on purely numerical data (as you have detailed below) but when I attempt to use them on my actual text files which look like this... I get empty outputfile.txt

Also I would like to print all the columns from file1.txt as shown in my example below (outputfile.txt)

file1.txt

Supercont 1.9 1479890 T 2 @ 1,3,5
Supercont 1.9 1459998 G 4 @.. 1,2,6
Supercont 1.8 1234876 C 6 @... 1,8


File2.txt

Supercont 1.9 1479890
Supercont 1.33 2343434
Supercont 1.4 0099879

Outputfile.txt

Supercont 1.9 1479890 T 2 @ 1,3,5


Many thanks for your help :)

I copied the contents of your file1.txt and file2.txt over mine and converted them to tab delimited files (the tabs don't come through when you paste them here). I ran my original program against the two files and I got the outputfile that you wished. The code I originally wrote should work on either numeric or text (that's why I used "eq" instead of "=="). Make sure you have real tab-delimited files and try it again.

Yes your 100% correct. If if I was looking to find a match anywhere in the 2 columns, so say for example...

file1.txt

1 6

2 8


file2.txt

3 9

1 6


outputfile

1 6


and also how would I print the other columns associated with the match from file1.txt? say columns 1,2,3,4 which obviously I am not showing here...

Many thanks

:)

Edited 6 Years Ago by leo002000: n/a

One more quick clarification, so you said anywhere in the two columns? So, if I got this right you want to check to see if the two columns of any of the rows in file1 are located anywhere in file2? If so, you want to print out the line from file1? Right?

Here is the code that I believe will work for your purposes:

use strict;
use warnings;
open(FILE1,"<file1.txt");
open(FILE2,"<file2.txt");
open (OUT,">outputfile.txt");
my @arr1=<FILE1>;
my @arr2=<FILE2>;
close FILE1;
close FILE2;
my $x=0;
my %chash;
for (@arr2){
	chomp;
	my($col1,$col2,$rest)=split(/\t/);
	my $ckey="$col1$col2";
	$chash{$ckey}=1; #this puts the first two columns of file2 in a hash
}

for (@arr1){
	chomp;
	my($col1,$col2,$rest)=split(/\t/);
	$col1="" if(!$col1);
	$col2="" if(!$col2);
	my $ckey="$col1$col2";
	if($chash{$ckey}){
		print OUT "$_\n";
	}
}
close OUT;

Yes, I know I reuse variable names. Scope allows me to.

Here are my source files:

file1.txt

Supercont	1.9	1479890	T 2 @ 1,3,5
Supercont	1.9	1459998	G 4 @.. 1,2,6
Supercont	1.3	1234876	C 6 @... 1,8
Supercont	1.4	1234878	C 7 @... 1,8
Supercont	1.7	1454876	C 8 @... 1,8
Supercont	1.12	1564876	C 9 @... 1,8

file2.txt

Supercont	1.9	1479890 
Supercont	1.33	2343434
Supercont	1.4	0099879

outputfile.txt

Supercont	1.9	1479890	T 2 @ 1,3,5
Supercont	1.9	1459998	G 4 @.. 1,2,6
Supercont	1.4	1234878	C 7 @... 1,8

They are all tab-delimited.

Comments

Mike you are an absolute gent... It works a treat. Would it be possible to put in some additional commenting, so that I am able to get some of it into my head (particularly) the hashing bit...

Thanks a million

Ok, the hashing bit. What I did is I read through the second file, which you are using as a reference to match against the elements in the first file.

my %chash; #defines a hash in scope that the later loop can see
for (@arr2){
	chomp;
	my($col1,$col2,$rest)=split(/\t/); #splits the row into 2 columns
	my $ckey="$col1$col2"; #concats the first two columns into one variable
	$chash{$ckey}=1; #sets the value of the hash to 1 for that key
        #example: $chash{Supercont1.9}=1 so it points the key Supercont1.9 to 1
}

Then I roll through file1 by traversing the @arr1 and concat the first two columns:

for (@arr1){
	chomp;
	my($col1,$col2,$rest)=split(/\t/);
	$col1="" if(!$col1); # I do this so warnings will not complain
	$col2="" if(!$col2); # I do this so warnings will not complain
	my $ckey="$col1$col2"; #concat the first two columns of file1
	if($chash{$ckey}){ #if the value exists (set to 1) it was in file2
		print OUT "$_\n"; #print the entire row from file1 to outputfile
	}
#if the value does NOT exist in the hash, do nothing and go on to the next row
}

I hope that's clearer. Look into hashes in Perl. They are very powerful and can point to anything - values, other hashes, subroutines, arrays and more. IMO hashes are one of the most useful elements of Perl.

This question has already been answered. Start a new discussion instead.