I have two data sets Hdata.txt and Edata.txt that are tab delimited. Both data sets contain information about two groups of people. The first column in both data sets contains the last name of the individual. I wrote a perl program to make a comparison and print out the individuals that are in BOTH data sets. Both of my files are huge and to run this program it takes more than an hour so I was wondering if there is a better way to do this.

I appreciate your help,
Thanks

Here is the program

#!/usr/bin/perl _w

 
open(Hdata, "Hdata.txt") || die "Can not open the file1\n";
open(Edata, "Edata.txt") || die "Can not open the file2\n";
open(outdata, ">Match.txt") || die "Can not open the file3\n";

# Reads the data #

@H=<Hdata>;
@E=<Edata>;


for($j=1; $j<=$#H; $j++){

 for($k=1; $k<=$#E; $k++){


  $l1=$H[$j];
  chomp $l1;
  @line1 = split(/\t/,$l1);

  $l2=$E[$k];
  chomp $l2;
  @line2 = split(/\t/,$l2);


  $flag=0;

for($i=1; $i<=$#line1; $i++){ 

  if ($line1[$i] ne $line2[$i]){

  $flag=1;
  last;
  
  } # end if 


}# end for i

if ($flag==0){

print outdata "$line1[0]\t$line2[0]\n";

}#end if

 

} # end for k
} # end for j


close(Hdata);
close(Edata);
close(outdata);

> Both of my files are huge and to run this program it takes more than an hour
> so I was wondering if there is a better way to do this.
So you run it, then go have lunch, or find something else to do.

The basic combination algorithm you have means you're not going to get your 1 hour down to say 3 seconds. Maybe 10 minutes, so you can go get coffee as opposed to lunch say.

You have 3 nested for loops, that's never going to be that quick over large files.

One immediate suggestion would be

for($j=1; $j<=$#H; $j++){

    for($k=1; $k<=$#E; $k++){
        $l1=$H[$j];
        chomp $l1;
        @line1 = split(/\t/,$l1);

into

for($j=1; $j<=$#H; $j++){
    $l1=$H[$j];
    chomp $l1;
    @line1 = split(/\t/,$l1);

    for($k=1; $k<=$#E; $k++){

line1 doesn't depend on $k, and isn't modified, so there's no point chomping and splitting it every time.

You might then consider chomping and splitting the whole data set, then running the comparisons.


How often do you need to run it
- 10 times a day?
- Once a week?

How many people use it
- Only you?
- A deparment of 100's?

In other words, what's the long-term payoff in cumulative time saved compared to the effort it takes to make it better?

Hi ,

Is it possible to dump the two files into a multi dimentional array ? if yes how can that be done? If this can be done than I do not need to chomp and split every time.

Thanks

This article has been dead for over six months. Start a new discussion instead.