Hi,

I need help to make a perl program work. I have two files - file 1 and file 2. The contents of File2 is to be searched with the contents of file 1.

File2: 2 tab-delimited columns

XM:1120002  complex-solution 
MM:0999111  blue-green solution
UX:1020022  activity unknown, (simple/complex?)

File1:(one column of space separated strings)

XM:1120002 MM:0999111 UX:1020022 
UX:1020022 XM:1120002
XM:1120002

Output required: 2 tab-delimited columns
Date:
The following were found in file2:

XM:1120002 MM:0999111 UX:1020022    complex-solution;blue-green solution;activity unknown, (simple/complex?)  
UX:1020022 XM:1120002           complex-solution;blue-green solution;activity unknown
XM:1120002              complex-solution

The code is as follows:

use warnings;
use strict;

my $time = scalar localtime(); # get date of search

my $heading = qq(Date of search: $time\nThe Following Matches were Found in File 2:\n);

my %hash1;
my %hash2;

my $file2 = 'file2.txt'; # file 2: file with two columns to be searched with contents of file1.
open my $fh, '<', $file2 or die "can't open $file2:$!";
while (<$fh>) {
    chomp;
    my ($ID, $value) = split /\s+/, $_;
    $hash1{$ID} = $value;
}
 close $fh;

my $file1 = 'file1.txt'; # file1: one column of probe names for searching the second file.

open my $fh_new, '>', 'ouput.txt' or die "can't open file:$!"; # Open output file for writing

my $i=(); 
open $fh, '<', $file1 or die "can't open this file:$!";  #Open query file for reading
print $fh_new $heading;
while ( defined( my $line = <$fh> ) ) {
    chomp $line;
    my ($checked_word[$i]) = split /\s+/,$line;
    $hash2{$checked_word[$i]} = $checked_word[$i];
    if ( $line ) {
        $checked_word[$i] = $line;
        exists $hash{$checked_word[$i}
        ? print $fh_new $checked_word[$i], "\t", $hash{$checked_word}[$i], $/
        : print $fh_new $checked_word, "\t", "######", $/;
    }
}
close $fh or die "can't close file:$!";
close $fh_new or die "can't close file:$!";

I will appreciate any help! Thanks

Hi perly,
Yeap, just like want you have done, but with a little modification.
The code below should solve your problem nicely. I hope this help.

#!/usr/bin/perl
use warnings;
use strict;
use Carp qw(croak);

croak "usage: <perl_script> <file1> <file2>" unless @ARGV;

my $time = scalar localtime();    # get date of search

print
  qq(Date of search: $time\nThe Following Matches were Found in  File 2:\n\n);

my ( $file1, $file2 ) = @ARGV;
my $file_data = {};

open my $fh, '<', $file2 or croak "can't open file:$!";
while ( defined( my $line = <$fh> ) ) {
    my ( $rec_name, $rec_description ) = split /\s+/, $line, 2;
    $file_data->{$rec_name} = $rec_description;
}
close $fh or croak "can't close file:$!";

open $fh, '<', $file1 or croak "can't open file:$!";
while ( defined( my $line = <$fh> ) ) {
    chomp $line;
    my $msg = "";
    foreach my $data ( split /\s+/, $line ) {
        $msg .= sprintf "%s", $file_data->{$data};
    }
    print $line, ' ', $msg;
}
close $fh or croak "can't close file:$!";

Files file1 and file2 has to be indicated on the Command Line Interface like so:

perl_script.pl file1.txt file2.txt

Else, this code wouldn't work and you get a message on how the script should be used.

Edited 4 Years Ago by 2teez

Hello Perly again,
The previous code I showed solved the problem, but I think is kind of noisy. For example, why should I use open function twice, only to end up doing almost the same thing? So, I came up with another one.
Put your open function in a subroutine, call it twice and use CODEREFs achieve what you want on the files. Using both file and CODEREFs as your subroutine parameters. Bingo! More organised and less noisy!
To run this just type the name of your perl script on the CLI, you don't need to specify file1.txt and file2.txt.

#!/usr/bin/perl
use warnings;
use strict;
use Carp qw(croak);

my $time = scalar localtime();    # get date of search
print
  qq(Date of search: $time\nThe Following Matches were Found in  File 2:\n\n);

my $file_data = {};               # a hash ref. to use

# read_file subroutine to read in the file
# and a subroutine ref to work on the file

read_file( 'file2.txt', \&file_handler2 );
read_file( 'file1.txt', \&file_handler1 );

sub read_file {
    my ( $filename, $code_ref ) = @_;

    open my $fh, '<', $filename or croak "can't open file: $0 - $!";
    while (<$fh>) {
        s/^\s+|\s+$//;
        $code_ref->($_);
    }
    return;
}

sub file_handler1 {
    my ($line) = @_;
    my $msg = q{};
    foreach my $data ( split /\s+/, $line ) {
        $msg .= sprintf "%s;", $file_data->{$data};
    }
    $msg =~ s/;$//;
    print $line, ' ', $msg, $/;
}

sub file_handler2 {
    my ( $rec_name, $rec_description ) = split /\s+/, $_[0], 2;
    return $file_data->{$rec_name} = $rec_description;
}

OUTPUT:

Date of search: Sat Jul  7 19:51:02 2012
The Following Matches were Found in  File 2:

XM:1120002 MM:0999111 UX:1020022 complex-solution;blue-green solution;activity unknown, (simple/complex?)
UX:1020022 XM:1120002 activity unknown, (simple/complex?);complex-solution
XM:1120002 complex-solution

I hope this helps

Edited 4 Years Ago by 2teez

Hi 2teez,

Thanks for the two codes. They both worked. For some reason,

 my output was like this:



    XM:1120002 MM:0999111 UX:1020022 complex-solution;blue-green solution;ac
    tivity unknown, (simple/complex?)
    UX:1020022 XM:1120002 activity unknown, (simple/complex?);complex-solution
    XM:1120002 complex-solution

I also tried to write the output into a file unsuccessfully, and also tried to have the columns separated by a tab. The amended code is as folows:



    use warnings;
    use strict;
    use Carp qw(croak);

    my $time = scalar localtime(); # get date of search
    print
    qq(Date of search: $time\nThe Following Matches were Found in File 2:\n\n);

    my $file_data = {}; # a hash ref. to use

    # read_file subroutine to read in the file
    # and a subroutine ref to work on the file
    read_file( 'file2.txt', \&file_handler2 );
    read_file( 'file1.txt', \&file_handler1 );
    sub read_file {
    my ( $filename, $code_ref ) = @_;
    open my $fh, '<', $filename or croak "can't open file: $0 - $!";
           while (<$fh>) {
                s/^\s+|\s+$//;
                $code_ref->($_);
            }
            return;
    }

    #open my $fh_new, '>', 'ouput.txt' or die "can't open file:$!"; 

    sub file_handler1 {
            my ($line) = @_;
            my $msg = q{};
            foreach my $data ( split /\s+/, $line ) {
                $msg .= sprintf "%s;", $file_data->{$data};
                }
                $msg =~ s/;$//;
                print $line, ' ',"\t", $msg, "\n", $/;
    }
    #print my $fh_new $line, ' ',"\t", $msg, "\n", $/;

    sub file_handler2 {
                my ( $rec_name, $rec_description ) = split /\s+/, $_[0], 2;
                return $file_data->{$rec_name} = $rec_description;
    }
    #close $fh_new or croak "can't close file:$!";


Kindly explain the following code lines:

1. split /\s+/, $_[0], 2

2. my $msg = q{};

3. $file_data->{$rec_name}


The initial confusion I had was how to search file2 with the multiple line elements in File1. I will be happy for some explanation.

Thank you!

Edited 4 Years Ago by perly: corrections

HI Perly,

> Thanks for the two codes. They both worked. For some reason,
> my output was like this:
> XM:1120002 MM:0999111 UX:1020022 complex-solution;blue-green solution;ac
> tivity unknown, (simple/complex?)
> UX:1020022 XM:1120002 activity unknown, (simple/complex?);complex-solution
> XM:1120002 complex-solution

The two codes would work because the second code is just a refined of the first code.

The output you got was what you specified as what you wanted in the orginial post.
If you need to seperate each column with  tab space or write it out to another file, all you need do is to modify the subroutine called **file_handler1**, like so:



    sub file_handler1 {
        my ($line) = @_;
        open my $fh, '>', 'output.txt' or croak "can't open file: $0 - $!";
        print {$fh} $heading, $/;
        $msg .= sprintf "\n%s\t", $line;
        foreach my $data ( split /\s+/, $line ) {
            $msg .= sprintf "\t%s", $file_data->{$data};
        }
        print {$fh} $msg, $/;
    }


Please, note that the variable $msg is now been used as a global variable, i.e not declared within the subroutine, and also new variable $heading is now been used. Please, check the complete script below. 

> Kindly explain the following code lines:
> 1. split /\s+/, $_[0], 2
> 2. my $msg = q{};
> 3. $file_data->{$rec_name}


1. You know the "split" function am sure. If not, then you have to do this:    


    perldoc -f split    


 What maybe confusing will be the $_[0] and the number '2'
   I used the $_[0] to get the first element of the parameter passed to the subroutine, since, parameter are passed to a subroutine or a function as a flat list in perl.
   So, $_[0], could also have been written as:
   a.) shift @_; OR my $line = shift;
   b.) my ($line) = @_;
The number '2', represent LIMIT in the "split" function. Which defined the maximum number of field an expression would be split into.

2. my $msg = q{}. Is just the same way as saying '' i.e a single quote, just as qq, is a double quote.   


    perldoc perlop 


for more info.

3 Instead of using an hash, which of course would work, we used an hash reference, to get, both the name and description of which line in file2. Then we return the result.

> The initial confusion I had was how to search file2 with the multiple line elements in File1. I will be happy for some explanation.

For me, file2 is the key to handlering file1. How?
Since, file2 has all the names contained in file1, tied to a particular description. Then, we use an hash or hash ref., to get a key and value for file2. Then use that to sort out file1. That is we say foreach line of file1, we split into an array and use each element of the array as a key in our generated hash variable, then print the result.

The script below seperated each file with tab and printed output in a file called **output.txt**




    #!/usr/bin/perl
    use warnings;
    use strict;
    use Carp qw(croak);

    my $time = scalar localtime();    # get date of search
    my $heading =
      qq(Date of search: $time\nThe Following Matches were Found in  File 2:\n);

    my $file_data = {};               # a hash ref. to use
    my $msg       = qq{   };

    # read_file subroutine to read in the file
    # and a subroutine ref to work on the file

    read_file( 'file2.txt', \&file_handler2 );
    read_file( 'file1.txt', \&file_handler1 );

    sub read_file {
        my ( $filename, $code_ref ) = @_;

        open my $fh, '<', $filename or croak "can't open file: $0 - $!";
        while (<$fh>) {
            s/^\s+|\s+$//;
            $code_ref->($_);
        }
        return;
    }

    sub file_handler1 {
        my ($line) = @_;
        open my $fh, '>', 'output.txt' or croak "can't open file: $0 - $!";
        print {$fh} $heading, $/;
        $msg .= sprintf "\n%s\t", $line;
        foreach my $data ( split /\s+/, $line ) {
            $msg .= sprintf "\t%s", $file_data->{$data};
        }
        print {$fh} $msg, $/;
    }

    sub file_handler2 {
        my ( $rec_name, $rec_description ) = split /\s+/, $_[0], 2;
        return $file_data->{$rec_name} = $rec_description;
    }




I hope this helps.
Please, mark this as solved if satisfied.

Edited 4 Years Ago by 2teez

Hi 2teez,

Thanks for the two codes. They both worked. For some reason,

 my output was like this:



    XM:1120002 MM:0999111 UX:1020022 complex-solution;blue-green solution;ac
    tivity unknown, (simple/complex?)
    UX:1020022 XM:1120002 activity unknown, (simple/complex?);complex-solution
    XM:1120002 complex-solution

I also tried to write the output into a file unsuccessfully, and also tried to have the columns separated by a tab. The amended code is as folows:



    use warnings;
    use strict;
    use Carp qw(croak);

    my $time = scalar localtime(); # get date of search
    print
    qq(Date of search: $time\nThe Following Matches were Found in File 2:\n\n);

    my $file_data = {}; # a hash ref. to use

    # read_file subroutine to read in the file
    # and a subroutine ref to work on the file
    read_file( 'file2.txt', \&file_handler2 );
    read_file( 'file1.txt', \&file_handler1 );
    sub read_file {
    my ( $filename, $code_ref ) = @_;
    open my $fh, '<', $filename or croak "can't open file: $0 - $!";
           while (<$fh>) {
                s/^\s+|\s+$//;
                $code_ref->($_);
            }
            return;
    }

    #open my $fh_new, '>', 'ouput.txt' or die "can't open file:$!"; 

    sub file_handler1 {
            my ($line) = @_;
            my $msg = q{};
            foreach my $data ( split /\s+/, $line ) {
                $msg .= sprintf "%s;", $file_data->{$data};
                }
                $msg =~ s/;$//;
                print $line, ' ',"\t", $msg, "\n", $/;
    }
    #print my $fh_new $line, ' ',"\t", $msg, "\n", $/;

    sub file_handler2 {
                my ( $rec_name, $rec_description ) = split /\s+/, $_[0], 2;
                return $file_data->{$rec_name} = $rec_description;
    }
    #close $fh_new or croak "can't close file:$!";


Kindly explain the following code lines:

1. split /\s+/, $_[0], 2

2. my $msg = q{};

3. $file_data->{$rec_name}


The initial confusion I had was how to search file2 with the multiple line elements in File1. I will be happy for some explanation.

Thank you!

Hi 2teez,

I really appreciate your explanations and the new code. The code works with the correct output on small and medium-sized files, but it hangs with large files (file2 ~ 35000 rows and file1 ~ 7000rows). I'm sorry I forgot to mention that the strings in file1 maybe more than 3 per line, sometimes up to 23.

HI Perly,

The code works with the correct output on small and medium-sized files, but it hangs with large files (file2 ~ 35000 rows and file1 ~ 7000rows)

Which of the codes; the first, second or modified second code? And when you say hangs, is it the execution of the code, or openning of the output file generated?

I should also mention that you could have the same desired result with code number 1 and 2, by doing this from the CLI:

    perl_script.pl file1.txt file2.txt > output.txt    # code 1

    perl_script.pl > output.txt   # code 2

Please, also include a

    return;

just after

     print {$fh} $msg, $/;

in file_handler1 subroutine

Hope this helps

Edited 4 Years Ago by 2teez

That helped a lot. Thanks. I got an output in a file. Unfortunately, there is still one problem. The output is truncated. For example I get something like:

XM:1120002 MM:0999111 UX:1020022    complex-solution    blue-green  activity

instead of :

XM:1120002 MM:0999111 UX:1020022    complex-solution    blue-green solution activity unknown, (simple/complex?) 

The program seems to output the first string only for every search string.

Edited 4 Years Ago by perly: corrections

Hi Perly,
And you are sure your editor don't have it's word wrap enabled?
Test with the raw data you gave in the original post and see, if you are not getting the correct output, then you have alter the script somewhere, else you should have something like so:

XM:1120002 MM:0999111 UX:1020022 complex-solution blue-green solution activity unknown, (simple/complex?) 

Hi 2teez,

You're right!!! I think I inadvertently altered something; I turned off wordwrap and I downloaded a fresh modidfied code 2 (added return;) and it worked perfectly. Thanks for your patience and help.

Edited 4 Years Ago by perly: corrections

This question has already been answered. Start a new discussion instead.