Search a file containing multiple headers and data

Question

perly 0 Newbie Poster

12 Years Ago

Hi,

I have a file containing multiple-headed data (input file 1), and also a second file containing elements for searching the first file.

input file 1:

    UROPA
    sseD    1.2.3.3.3   crimson
    ddsU    2.1.4.1.2   green
    SAMEL
    aadH    7.4.1.1.1   blue
    uuoG    10.1.2.3.4  white
    MOONA
    gmaL    3.4.1.6.7   red
    oolJ    9.1.1.4.1   yellow

input file 2:

    sseD
    aadH
    oolJ
    gmaL

My required output:

    UROPA:
    sseD

    SAMEL:
    aadH

    MOONA:
    oolJ
    gmaL

I have the following code which is not doing what I wanted:

    use strict;
    use warnings;
    my $time = scalar localtime();
    my $heading = qq(Date of search: $time\nThe following were found:\n);
    my %hash;
    my %hash1;
    my %hash2;
    my $input1 = 'input1.txt';
    #open input file1 for reading
    open my $FH1, '<', $input1 or die "can't open $input1:$!";
    while (<$FH1>) 
    {
        chomp;
        if (/^[A-Z]/)
        {
                my $Header;
                $hash{$Header} = $_;
                next;
        }   
        my ($Col1, $col2, $Col3) = split /\t/, $_;  
        $hash1{$Col1}=$Col2;
        $hash2{$Col1}=$Col3;
    } 
    close $FH1;
    my $input2 = 'input2.txt'; #read input file2
    #open output file
    open my $OUTPUT, '>', 'outfile.txt' or die "can't open outfile.txt:$!";
    #Open input file2
    open my $FH2, '<', $input2 or die "Can't open $input2:$!";
    print $OUTPUT $heading;
    while( defined( my $line = <$FH2> ) )
    {
        chomp $line;
        print $OUTPUT $hash{$Header};
        my $matchLine;
        if ($line)
        {           
                $matchLine = $line;         
                exists $hash1{$matchLine}, $hash2{$matchLine}
                ? print $OUTPUT $matchLine, "\t", $hash1{$matchLine}, "\t", $hash2{$matchLine}, $/
                : print $OUTPUT $matchLine, "\t", "none", "\t", "none", $/;

        }   
    }
    close $OUTPUT or die "Can't close $OUTPUT:$!";
    close $FH2 or die "Can't close $FH2:$!";

Can someone please point me in the right direction. Thanks

file-system perl seo

4 Contributors
10 Replies
266 Views
5 Days Discussion Span
Latest Post 12 Years Ago Latest Post by 2teez

All 10 Replies

d5e5 109 Master Poster

12 Years Ago

Look closely at lines 16 - 17.

                my $Header;
                $hash{$Header} = $_;

You declare a lexical variable called $Header and assign no value to it, so its value is undefined. Then you use $Header (whose value is undefined) as a key to $hash. In effect you are assigning all the header values to one key (key = undef) in your hash. Each value replaces the previous value because you always use the same undefined key.

Edited 12 Years Ago by d5e5

d5e5 109 Master Poster

12 Years Ago

You could also try the following:

#!/usr/bin/perl
use strict; 
use warnings; 

open my $fh, '<', 'input1.csv' or die "Failed to open input1.csv $!";
my $save_header;
while (<$fh>){
    my @flds = split;
    if (@flds == 1){
        s/\s//g;
        print "\n" if $save_header and $_ ne $save_header;
        print "$_:\n";
        $save_header = $_;
    }
    else{
        print "$flds[0]\n" if findit($flds[0]);
    }
}

sub findit {
    my $word = shift;
    open my $fh, '<', 'input2.csv' or die "Failed to open input2.csv $!";   
    while (<$fh>) {
        return(1) if index($_,$word) > -1;
    }
}

d5e5 109 Master Poster

12 Years Ago

my $save_header; #define g̶l̶o̶b̶a̶l̶ lexical variable whose scope includes entire script

my @flds = split;#r̶e̶a̶d̶ ̶f̶i̶l̶e̶ ̶c̶o̶n̶t̶e̶n̶t̶s̶ ̶i̶n̶t̶o̶ ̶a̶n̶ ̶a̶r̶r̶a̶y̶ Equivalent to my(@flds) = split(' ', $_, 0);

print "\n" if $save_header and $_ ne $save_header; #print newline i̶n̶ ̶b̶o̶t̶h̶ ̶c̶a̶s̶e̶s̶ before printing a new header (blank line between groups)

$save_header = $_;#assign the value of the current header ($_) to a variable so we can compare in the future to know if we are about to start another header group (i.e. whether we want a blank line separating previous group from current group.

sub findit {#sub routine -opens file and returns m̶a̶t̶c̶h̶e̶d̶ ̶e̶l̶e̶m̶e̶n̶t̶s̶ true (non-zero value) if parameter found in file, false (0) if not found in file.

return(1) if index($_,$word) > -1;# c̶h̶e̶c̶k̶ ̶i̶f̶ ̶s̶e̶a̶r̶c̶h̶ ̶w̶o̶r̶d̶ ̶f̶r̶o̶m̶ ̶i̶n̶p̶u̶t̶ ̶f̶i̶l̶e̶2̶.̶t̶x̶t̶ ̶m̶a̶t̶c̶h̶e̶s̶ ̶a̶ ̶p̶a̶r̶t̶i̶c̶u̶l̶a̶r̶ ̶i̶n̶d̶e̶x̶ ̶e̶l̶e̶m̶e̶n̶t̶ ̶i̶n̶ ̶i̶n̶p̶u̶t̶ ̶f̶i̶l̶e̶1̶ If the word is NOT found in the current record the index will give a value of -1. If the word is found, the index will give a number indicating where in the current record the word was found. I can't explain it better than http://perldoc.perl.org/functions/index.html

Edited 12 Years Ago by d5e5

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

perly 0 Newbie Poster · Answer 1 · 2012-09-03T18:14:46+00:00

Thanks for your comment, d5e5.

I have tried to ammend the code (below) as I understood your comments. But, still I have some errrors.

    use strict;
    use warnings;

    my $time = scalar localtime();
    my $heading = qq(Date of search: $time\nThe following were found:\n);

    my %hash;
    my %hash1;
    my %hash2;


    my $input1 = 'input1.txt';

    #open input file1 for reading
    open my $FH1, '<', $input1 or die "can't open $input1:$!";
    while (defined(my $line = <$FH1>)) {
        chomp;
        if (/^[A-Z]/){
                my $Header = $line;
                $hash{$Header} = $Header;
                next;
        }

        my ($Col1, $Col2, $Col3) = split /\t/, $line;   
        $hash1{$Col1}=$Col2;
        $hash2{$Col1}=$Col3;
    } 
    close $FH1;

    my $input2 = 'input2.txt'; #read input file2


    #open output file
    open my $OUTPUT, '>', 'outfile.txt' or die "can't open outfile.txt:$!";

    #Open input file2
    open my $FH2, '<', $input2 or die "Can't open $input2:$!";
    print $OUTPUT $heading;
    while( defined( my $line = <$FH2> ) ){
        chomp $line;
        #my $Header;

        my $matchLine;
        if ($line){

                $matchLine = $line; 
                exists $hash{$matchLine}, $hash1{$matchLine}, $hash2{$matchLine}
                ? print $OUTPUT $matchLine, "\t", $hash{$matchLine}, $hash1{$matchLine}, "\t", $hash2{$matchLine}, $/
                : print $OUTPUT $matchLine, "\t", "none", "\t", "none", "\t", "none", $/;

        }


    }
    close $OUTPUT or die "Can't close $OUTPUT:$!";
    close $FH2 or die "Can't close $FH2:$!";

Regards.

Kwetal · Answer 2 · 2012-09-03T20:03:18+00:00

Sounds like a homework assignment. Here are some tips.

It seems that the order in which the all-caps header words (UROPA, SAMEL, MOONA) appear is important. Use an array to store these words in the order they appear.
```
$headerwords = [
      'UROPA',
      'SAMEL',
      'MOONA'
    ];
```
When reading input2, you have to look up under which header word the colour code (oolJ, aadH) appeared earlier in input1. Therefore, when processing input1, use a hash table to map colour codes to header words.
```
$color2hdr = {
      'uuoG' => 'SAMEL',
      'ddsU' => 'UROPA',
      'gmaL' => 'MOONA',
      'sseD' => 'UROPA',
      'oolJ' => 'MOONA',
      'aadH' => 'SAMEL'
    };
```

You have to output below each header word which colour codes were hit. Use another hash to remember per header the colour codes that appear in input2 by looking them up in %color2hdr.

$hits = {
      'MOONA' => [
                   'oolJ',
                   'gmaL'
                 ],
      'SAMEL' => [
                   'aadH'
                 ],
      'UROPA' => [
                   'sseD'
                 ]
    };

When you have your data structures right, the program itself is trivial.

perly 0 Newbie Poster · Answer 3 · 2012-09-04T03:41:48+00:00

Thanks for your contribution, Kwetal. This is not at all a homework assignment as you mentioned. I appreciate the approach in your code snippets, however, since I will be dealing with very large files (apologies, I did not mention that the input file contents that I showed are only minute parts of the real files), I feel that it may not be practical for my need.

2teez 43 Posting Whiz · Answer 4 · 2012-09-05T08:00:50+00:00

This work perfectly, but you can modify it for your use.

#!/usr/bin/perl
use warnings;
use strict;
use Tie::File;

my $time    = scalar localtime();
my $heading = qq(Date of search: $time\nThe following were found:\n);
my %header;
my $input1 = 'input1.txt';
my $head   = "";

#open input file1 for reading
open my $FH1, '<', $input1 or die "can't open $input1:$!";
while (<$FH1>) {
    s/^\s+|\s+$//;
    if (/^(?<header>[A-Z]+?)$/) {
        $head = $+{header};
    }
    else {
        push @{ $header{$head} }, $_;
    }
}
close $FH1 or die "can't close file: $!";

tie my @tied_file, 'Tie::File', 'input2.txt' or die "can't tie file: $!";

open my $OUTPUT, '>', 'outfile.txt' or die "can't open outfile.txt:$!";

print $OUTPUT $heading;

foreach my $line ( keys %header ) {
    my @matched_line;
    foreach my $value ( @{ $header{$line} } ) {
        my ($col1) = split / /, $value;
        foreach (@tied_file) {
            s/^\s+|\s+$//;
            if ( $_ eq $col1 ) {
                push @matched_line, $value;
            }
        }
    }
    print $OUTPUT $line, ":\n", @matched_line, $/;
}

close $OUTPUT or die "Can't close $OUTPUT:$!";

untie @tied_file;

perly 0 Newbie Poster · Answer 5 · 2012-09-06T06:25:51+00:00

I appreciate your codes, 2teez and d5e5. The two codes worked!!! I have read the codes and tried to comment lines with what I understand them to mean and also indicate not clear about lines that I did not understand. I would be glag if you could help check how far I understand the two codes.

2teez code:

    #!/usr/bin/perl
    use warnings;
    use strict;
    use Tie::File;
    my $time = scalar localtime();
    my $heading = qq(Date of search: $time\nThe following were found:\n);
    my %header;
    my $input1 = 'input1c.txt';
    my $head = "";

    #open input file1 for reading
    open my $FH1, '<', $input1 or die "can't open $input1:$!";
    while (<$FH1>) {
        s/^\s+|\s+$//; ###substitute any number of whitespace (at the beginning/ or at the end) with nothing
        if (/^(?<header>[A-Z]+?)$/) {###??? not clear
        $head = $+{header}; ##increment header?
        }
        else {
            push @{ $header{$head} }, $_; #not clear - push the hash key and value for header or non-headers into an array???
        }
    }
    close $FH1 or die "can't close file: $!";

    tie my @tied_file, 'Tie::File', 'input2.txt' or die "can't tie file: $!";

    open my $OUTPUT, '>', 'outfile.txt' or die "can't open outfile.txt:$!";
    print $OUTPUT $heading;

    foreach my $line ( keys %header ) {
        my @matched_line;
            foreach my $value ( @{ $header{$line} } ) {### not clear
                my ($col1) = split / /, $value;#split  each line in file1 with a space; saved as variable $col1
                foreach (@tied_file) {#input2 loop
                    s/^\s+|\s+$//; ###substitute any number of whitespace (at the beginning/ or at the end of eac line) with nothing
                    if ( $_ eq $col1 ) {#check if there is a match in file1(col1) and file2 elements
                    push @matched_line, $value; #push matched value into array
                    }
                }
            }
    print $OUTPUT $line, ":\n", @matched_line, $/;
    }
    close $OUTPUT or die "Can't close $OUTPUT:$!";
    untie @tied_file;

d5e5 code:

    #!/usr/bin/perl
    use strict;
    use warnings;

    #open input file1 for reading
    open my $fh, '<', 'input1.txt' or die "Failed to open input1.txt $!";
    my $save_header; #define global variable save_header
    while (<$fh>){
    my @flds = split;#read file contents into an array
    if (@flds == 1){ #check for elements in the file
        s/\s//g; #carry out global substitution of all whitespace charcters with nothing
        print "\n" if $save_header and $_ ne $save_header; #print newline in both cases
        print "$_:\n";#print header followed by colon and newline
        $save_header = $_; ###???? not clear
        }
        else{
            print "$flds[0]\n" if findit($flds[0]); #print search word if it is first index element i.e., col 1.
        }
    }

    sub findit {#sub routine -opens file and return matched elements if true
    my $word = shift;
    open my $fh, '<', 'input2.txt' or die "Failed to open input2.csv $!";
    while (<$fh>) {
        return(1) if index($_,$word) > -1;# check if search word from input file2.txt matches a particular index element in input file1
        }
    }

Thank you!

perly 0 Newbie Poster · Answer 6 · 2012-09-08T00:46:47+00:00

Thanks for the explanations, d5e5, and many thanks to you both, 2teez and d5e5, for the codes. I really appreaciate your help!

2teez 43 Posting Whiz · Answer 7 · 2012-09-08T07:01:49+00:00

Hi Perly,
Sorry, I didn't do this on time. However, I think is not too late.
Concerning the portion of the script I gave above, that is not clear. Let me quickly explain and then refer you to documentations that could ground you in your understanding of theses things.

if (/^(?<header>[A-Z]+?)$/) {
        $head = $+{header}; 
        }
        else {
            push @{ $header{$head} }, $_;
        }

Using capture group name as one could use capture group numbering like $1, $2, etc, if the $_ matches lines of strings that are capital letters, using (?<haeder>...) this is capture group name. In the place of 'header', you could use any name, which is then stored in a hash %+ automatically and you could refer to it later as $+{...}. You can do the same thing like so,

        if (/^([A-Z]+?)$/) {
                $head = $1; 
                }
                else {
                    push @{ $header{$head} }, $_;
                }

And it will work. Then we assign the capture name or number to variable $head. Else, using the captured named as key to the hash $header, if $_ is NOT captical, get the line and use as value.

foreach my $value ( @{ $header{$line} } ) {
                my ($col1) = split / /, $value;
                ...

Iterate over the array variables stored in the hash $header as values, then using a split function, take only the first value, after spliting on "space", then compare.
Please, you might have to read perlretut doing, 'perldoc perlretut' will really help.
Hope this helps.

Search a file containing multiple headers and data

Recommended Answers Collapse Answers

All 10 Replies

Recommended Answers