matching two files

Question

SNN 0 Newbie Poster

16 Years Ago

I have two data sets Hdata.txt and Edata.txt that are tab delimited. Both data sets contain information about two groups of people. The first column in both data sets contains the last name of the individual. I wrote a perl program to make a comparison and print out the individuals that are in BOTH data sets. Both of my files are huge and to run this program it takes more than an hour so I was wondering if there is a better way to do this.

I appreciate your help,
Thanks

Here is the program

#!/usr/bin/perl _w

 
open(Hdata, "Hdata.txt") || die "Can not open the file1\n";
open(Edata, "Edata.txt") || die "Can not open the file2\n";
open(outdata, ">Match.txt") || die "Can not open the file3\n";

# Reads the data #

@H=<Hdata>;
@E=<Edata>;


for($j=1; $j<=$#H; $j++){

 for($k=1; $k<=$#E; $k++){


  $l1=$H[$j];
  chomp $l1;
  @line1 = split(/\t/,$l1);

  $l2=$E[$k];
  chomp $l2;
  @line2 = split(/\t/,$l2);


  $flag=0;

for($i=1; $i<=$#line1; $i++){ 

  if ($line1[$i] ne $line2[$i]){

  $flag=1;
  last;
  
  } # end if 


}# end for i

if ($flag==0){

print outdata "$line1[0]\t$line2[0]\n";

}#end if

 

} # end for k
} # end for j


close(Hdata);
close(Edata);
close(outdata);

perl

3 Contributors
3 Replies
96 Views
1 Day Discussion Span
Latest Post 16 Years Ago Latest Post by SNN

All 3 Replies

Salem 5,265 Posting Sage

16 Years Ago

> Both of my files are huge and to run this program it takes more than an hour
> so I was wondering if there is a better way to do this.
So you run it, then go have lunch, or find something else to do.

The basic combination algorithm you have means you're not going to get your 1 hour down to say 3 seconds. Maybe 10 minutes, so you can go get coffee as opposed to lunch say.

You have 3 nested for loops, that's never going to be that quick over large files.

One immediate suggestion would be

for($j=1; $j<=$#H; $j++){

    for($k=1; $k<=$#E; $k++){
        $l1=$H[$j];
        chomp $l1;
        @line1 = split(/\t/,$l1);

into

for($j=1; $j<=$#H; $j++){
    $l1=$H[$j];
    chomp $l1;
    @line1 = split(/\t/,$l1);

    for($k=1; $k<=$#E; $k++){

line1 doesn't depend on $k, and isn't modified, so there's no point chomping and splitting it every time.

You might then consider chomping and splitting the whole data set, then running the comparisons.

How often do you need to run it
- 10 times a day?
- Once a week?

How many people use it
- Only you?
- A deparment of 100's?

In other words, what's the long-term payoff in cumulative time saved compared to the effort it takes to make it better?

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

KevinADC 192 Practically a Posting Shark · Answer 1 · 2008-08-13T03:24:25+00:00

KevinADC 192 Practically a Posting Shark

16 Years Ago

Both of my files are huge

How big are they?

SNN 0 Newbie Poster · Answer 2 · 2008-08-13T23:54:40+00:00

Hi ,

Is it possible to dump the two files into a multi dimentional array ? if yes how can that be done? If this can be done than I do not need to chomp and split every time.

Thanks

matching two files

Recommended Answers Collapse Answers

All 3 Replies

Recommended Answers