help with b-tree using two files and hash/key lookoop - Page 2

Question

d5e5 109 Master Poster

13 Years Ago

When we update the code is it possible change line 76
next unless defined $g_other and $g_other eq 'gene'
to also run if $g_other eq ' ' as there are additional queries which should be examined which have nothing in the column $g_other column.

I moved the following statement from the read_genedata sub to the nearest_gene sub.

next unless defined $g_other and $g_other eq 'gene'
                  or
                    defined $g_id and $g_id =~ m/^JG/
                    and !defined $g_other;#Read only 'gene' records

Also I modified the script to read the gene data into a hash of arrays instead of a plain array so the nearest_gene sub can look up only the gene records for a specific scaffold which serves as a key to the hash. Looking up data by key in a hash should be faster than reading through an array to find the genes for that scaffold.

Please try the following. On my laptop it takes roughly half an hour and creates an output file called 'hairpin-nearest-gene.txt' containing 8026 lines.

#!/usr/bin/perl;
#comparepremiRNAtocoding-pulloverlap3.pl;
use strict;
use warnings;
use List::Util qw[min max];
use Benchmark;

my $t0 = Benchmark->new;

#You can delete the following line. My testing setup works better without command-line arguments
@ARGV = qw(cab-ALL.LU-premiRNAs.withLOCATION.Vmatch-Nvmr-EST.list bab-All.LuCoding.061511) if @ARGV == 0;
#@ARGV = qw(dani-sampleHairpindata.txt dani-sampleGenedata.txt) if @ARGV == 0;
my $hairpin_filename = $ARGV[0];
my $gene_filename = $ARGV[1];

#Save output to a file instead of STDOUT
#because we are printing debug info to STDOUT
my $outfilename = 'hairpin-nearest-gene.txt';

my %genedata;#Hash of genedata arrays for each scaffold
$| = 1;#Flush print buffer

print "Reading $gene_filename into hash of arrays. This could take a while.\n";
read_genedata();#Call subroutine to read selected records from file into hash

my $t1 = Benchmark->new;
my $td = timediff($t1, $t0);
print "Reading gene file into hash of arrays took:",timestr($td),"\n";

print "Reading $hairpin_filename into hash. This could take a while.\n";
my %h_hash = ();
read_hairpindata();
my $t2 = Benchmark->new;
$td = timediff($t2, $t1);
print "Building the hairpin-to-nearest-gene hash took:",timestr($td),"\n";

#We can access the data for each hairpin id in each scaffold location
#by looping through %h_hash
my $header = 'Hairpin_ID, hairpin other, outside/inside/overlap,'
               . 'gene ID, gene other, distance from hairpin, scaffold,'
               . 'hairpin start, hairpin stop, hairpin sequence';#Heading line
               
#Because we're printing debug info (time spent, record counts) to screen
#Let's open an output file for the hairpin-nearest-gene results
print "Writing output to $outfilename. This could take a while.\n";

open my $fh_output, '>', $outfilename or die "Failed to open $outfilename: $!";
print $fh_output $header, "\n";
        
foreach my $s(keys %h_hash){
    foreach my $h(keys %{$h_hash{$s}}){
        #dereference the genedata array
        my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other, $dist) = @{$h_hash{$s}{$h}{nearest_gene_data}};
        
        #Dereference array reference and join array elements
        my $hairpin_other = join '', @{$h_hash{$s}{$h}{other}};
        my $hairpin_start = $h_hash{$s}{$h}{start};
        my $hairpin_stop = $h_hash{$s}{$h}{stop};
        my $hairpin_seq = $h_hash{$s}{$h}{sequence_read};
        
        if (defined $g_id){
            my $oio = calc_rel_pos($hairpin_start, $hairpin_stop, $g_start, $g_stop);
            print $fh_output "$h,$hairpin_other,$oio,$g_id,$g_other,$dist,$s,$hairpin_start,$hairpin_stop,$hairpin_seq\n";
        }
        else{
            print $fh_output "$h,$hairpin_other,,,,,$s,$hairpin_start,$hairpin_stop,$hairpin_seq\n";
        }
    }
}

my $t3 = Benchmark->new;
$td = timediff($t3, $t2);
print "Writing results to $outfilename took:",timestr($td),"\n";

$td = timediff($t3, $t0);
print "Total tile taken running this script:", timestr($td), "\n";

#Subroutines start here------------------------------------------
sub read_genedata{
    #loop through genedata file and add required records into hash
    open(my $fh, '<', $gene_filename)||die "open $gene_filename failed: $!";
    while (<$fh>){
        next if m/^\s*$/;#Skip blank lines
        chomp;
        my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other) = split /\s+/, $_, 6;
        #Moved the following to nearest_gene sub
        #next unless defined $g_other and $g_other eq 'gene'
        #          or
        #            defined $g_id and $g_id =~ m/^JG/
        #            and !defined $g_other;#Read only 'gene' records
        push @{$genedata{$g_scaffold}}, [$g_id, $g_scaffold, $g_start, $g_stop, $g_other];
    }
}

sub read_hairpindata{
    open(my $fh, '<', $hairpin_filename)|| die "open $hairpin_filename failed: $!";

    #Read entire hairpindata file into a hash
    while(<$fh>){
        chomp;
        my @cols = split/\t/;#split by tab
        next if scalar @cols < 6;#Skip blank or incomplete lines
        
        my $scaffold_loc = $cols[1];#make column 2 (index 1) the scaffold location key (i.e. the scaffolds);
        my $hairpin_id = $cols[0];#hairpin_id also needs to be a key so you can look up data later
        $h_hash{$scaffold_loc}{$hairpin_id}{start} = $cols[2];
        $h_hash{$scaffold_loc}{$hairpin_id}{stop} = $cols[3];
        $h_hash{$scaffold_loc}{$hairpin_id}{other} = [@cols[4..14]];
        $h_hash{$scaffold_loc}{$hairpin_id}{sequence_read} = $cols[15];
        $h_hash{$scaffold_loc}{$hairpin_id}{nearest_gene_data} = nearest_gene($scaffold_loc, $hairpin_id);
    }
}

sub nearest_gene{
    my ($s, $h) = @_; #scaffold location and hairpin_id received as arguments
    
    my $save_min = 999999;#Big number for comparing distances
    my @gene_rec;
    
    foreach (@{$genedata{$s}}){
        chomp;

        my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other) = @$_;
                #Moved the following statement from read_genedata sub
                next unless defined $g_other and $g_other eq 'gene'
                  or
                    defined $g_id and $g_id =~ m/^JG/
                    and !defined $g_other;#Read only 'gene' records
        
        my $min = min( abs($h_hash{$s}{$h}{start} - $g_start),
                    abs($h_hash{$s}{$h}{stop} - $g_start),
                    abs($h_hash{$s}{$h}{start} - $g_stop),
                    abs($h_hash{$s}{$h}{stop} - $g_stop) );

        if ($min < $save_min){
            $save_min = $min;
            @gene_rec = ($g_id, $g_scaffold, $g_start, $g_stop, $g_other,$save_min);
        }
    }
    return \@gene_rec;
}

sub calc_rel_pos{
    my ($h_start, $h_stop, $g_start, $g_stop) = @_;
    
    if (is_between($h_start, $g_start, $g_stop)
        && is_between($h_stop, $g_start, $g_stop)){
        return 'inside';
    }
    
    if (!is_between($h_start, $g_start, $g_stop)
    && is_between($h_stop, $g_start, $g_stop)
    ||
    is_between($h_start, $g_start, $g_stop)
    && !is_between($h_stop, $g_start, $g_stop)){
    return 'overlap';
    }
    
    if (!is_between($h_start, $g_start, $g_stop)
    && !is_between($h_stop, $g_start, $g_stop)){
        return 'outside';
    }
}

sub is_between{
    my ($x, $lower_limit, $upper_limit) = @_;
    return ( $x >= $lower_limit && $x <= $upper_limit );
}

d5e5 109 Master Poster

13 Years Ago

For each line in the gene data file we create a key value pair in the %genedata hash. The key is the gene scaffold and the value is a reference to an array containing the line of gene data that we split by spaces.

So line 120 iterates through this array. To de-reference this array we have to look it up in the hash and wrap it in dereferencing syntax resulting in @{$genedata{$s}}

d5e5 109 Master Poster

13 Years Ago

There should by a line for every hairpin id but not every line in the hairpin data file if there are multiple lines having the same hairpin id. I was assuming hairpin id can be a unique key so it can be the key to the %h_hash. If hairpin ids aren't unique then the hash will contain data for only the last line of each bunch of lines for a particular hairpin_id. If hairpin ids aren't unique, I guess I still don't understand what hairpins represent.

As for an explanation of lines 120 - 123: the following has to dereference an array reference referring to an array of array references. That's why it looks complicated. The complex syntax has nothing to do with its being in a subroutine. Each scaffold key in the hash contains a reference to an array of references to arrays containing the split gene data lines for that scaffold.

An iteration repeatedly puts the current element in the $_ variable by default (if no other variable is specified). In this case $_ contains an array reference so we must dereference it to make it an array again, like so: @$_

foreach (@{$genedata{$s}}){
        chomp;

        my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other) = @$_;

d5e5 109 Master Poster

13 Years Ago

Each hairpin id should is unique - that's why I couldn't figure out why there were so few lines.

Sorry, my mistake. I just loaded the hairpin-nearest-gene.txt containing the output from yesterday's test run into a text editor and scrolled to the end and it shows there are 168026 lines. I don't know where I got the impression there were only 8026. Maybe too much Benadryl. It looks like the hairpin ids in the input are unique, so that's good. You can see the output at http://d5e5.fyels.com/dab

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

bio-grad 0 Junior Poster in Training · Answer 1 · 2011-08-21T10:49:40+00:00

Hi,
I was out all day today so I was able to let it run last night and today (so I haven't tried the new code). It finished! There are 168028 lines in the file. I will need to go over the output and double-check to be sure, but I think it worked :)

I did get some error messages though. However, they were more numerous than the terminal will show me. Anyhow, here are the one's which I have:

Use of uninitialized value $s in string eq at Desktop/8-19-11run.pl line 105, <$fh> line 167869.

Use of uninitialized value $scaffold_loc in hash element at Desktop/8-19-11run.pl line 89, <$fh> line 167869.

Use of uninitialized value $hairpin_id in hash element at Desktop/8-19-11run.pl line 89, <$fh> line 167869.

Use of uninitialized value in join or string at Desktop/8-19-11run.pl line 54.

Use of uninitialized value $hairpin_start in concatenation (.) or string at Desktop/8-19-11run.pl line 64.

Use of uninitialized value $hairpin_stop in concatenation (.) or string at Desktop/8-19-11run.pl line 64.

Use of uninitialized value $hairpin_seq in concatenation (.) or string at Desktop/8-19-11run.pl line 64.

d5e5 109 Master Poster · Answer 2 · 2011-08-21T23:36:38+00:00

Hi,
I was out all day today so I was able to let it run last night and today (so I haven't tried the new code). It finished! There are 168028 lines in the file. I will need to go over the output and double-check to be sure, but I think it worked :)
I did get some error messages though. However, they were more numerous than the terminal will show me. Anyhow, here are the one's which I have:
Use of uninitialized value $s in string eq at Desktop/8-19-11run.pl line 105, <$fh> line 167869.
Use of uninitialized value $scaffold_loc in hash element at Desktop/8-19-11run.pl line 89, <$fh> line 167869.
Use of uninitialized value $hairpin_id in hash element at Desktop/8-19-11run.pl line 89, <$fh> line 167869.
Use of uninitialized value in join or string at Desktop/8-19-11run.pl line 54.
Use of uninitialized value $hairpin_start in concatenation (.) or string at Desktop/8-19-11run.pl line 64.
Use of uninitialized value $hairpin_stop in concatenation (.) or string at Desktop/8-19-11run.pl line 64.
Use of uninitialized value $hairpin_seq in concatenation (.) or string at Desktop/8-19-11run.pl line 64.

That's great that the program finally completed. Apparently looking up the nearest gene for each hairpin is taking a lot of time. I'll see if I can speed up that part of the script.

When I look at line 167869 of the hairpin file I see there is a blank line which probably caused the non-fatal error messages. I can modify the program to check for and skip blank lines in the hairpin file which hopefully will avoid those errors.

***Update***
I just noticed that the last script I sent you had an error when it tries to print to the output file, so please don't bother running it as it will take several hours and then fail to create the output file. I'll send you a corrected version as soon as I see if there is any way to speed it up.

bio-grad 0 Junior Poster in Training · Answer 3 · 2011-08-22T00:17:21+00:00

When we update the code is it possible change line 76

next unless defined $g_other and $g_other eq 'gene'

to also run if $g_other eq ' ' as there are additional queries which should be examined which have nothing in the column $g_other column.

d5e5 109 Master Poster · Answer 4 · 2011-08-22T02:11:43+00:00

When we update the code is it possible change line 76
next unless defined $g_other and $g_other eq 'gene'
to also run if $g_other eq ' ' as there are additional queries which should be examined which have nothing in the column $g_other column.

I'm not sure about that. I've been trying to modify the script to save the gene data in a more suitable data structure that would allow faster look up of the genes for each scaffold when determining nearest gene, but so far I'm not getting good results.

Maybe we would need to consider a radical change to the program design, such as reading the gene data into a database table. Then you could access it with SQL queries that could include or filter out rows according to values in the columns, and it might run faster. Do you have SQLite for Perl?

Meanwhile, here's the corrected version of the last script I posted. It skips blank lines in the hairpin file to avoid the error messages you were getting.

#!/usr/bin/perl;
#comparepremiRNAtocoding-pulloverlap3.pl;
use strict;
use warnings;
use List::Util qw[min max];
use Benchmark;

my $t0 = Benchmark->new;

#You can delete the following line. My testing setup works better without command-line arguments
#@ARGV = qw(cab-ALL.LU-premiRNAs.withLOCATION.Vmatch-Nvmr-EST.list bab-All.LuCoding.061511) if @ARGV == 0;
@ARGV = qw(dani-sampleHairpindata.txt dani-sampleGenedata.txt) if @ARGV == 0;
my $hairpin_filename = $ARGV[0];
my $gene_filename = $ARGV[1];
my @genedata;
$| = 1;#Flush print buffer

read_genedata();#Call subroutine to read selected records from file into array

my $t1 = Benchmark->new;
my $td = timediff($t1, $t0);
print "Reading gene file into array took:",timestr($td),"\n";

@genedata = sort by_scaffold @genedata;
my $t2 = Benchmark->new;
$td = timediff($t2, $t1);
print "Sorting the gene array took:",timestr($td),"\n";

my %h_hash = ();
read_hairpindata();
my $t3 = Benchmark->new;
$td = timediff($t2, $t0);
print "Building the hairpin-to-nearest-gene hash took:",timestr($td),"\n";

#We can access the data for each hairpin id in each scaffold location
#by looping through %h_hash
my $header = 'Hairpin_ID, hairpin other, outside/inside/overlap,'
               . 'gene ID, gene other, distance from hairpin, scaffold,'
               . 'hairpin start, hairpin stop, hairpin sequence';#Heading line
               
#Because we're printing debug info (time spent, record counts) to screen
#Let's open an output file for the hairpin-nearest-gene results
my $outfilename = 'hairpin-nearest-gene.txt';
open my $fh_output, '>', $outfilename or die "Failed to open $outfilename: $!";
print $fh_output $header, "\n";
        
foreach my $s(keys %h_hash){
    foreach my $h(keys %{$h_hash{$s}}){
        #dereference the genedata array
        my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other, $dist) = @{$h_hash{$s}{$h}{nearest_gene_data}};
        
        #Dereference array reference and join array elements
        my $hairpin_other = join '', @{$h_hash{$s}{$h}{other}};
        my $hairpin_start = $h_hash{$s}{$h}{start};
        my $hairpin_stop = $h_hash{$s}{$h}{stop};
        my $hairpin_seq = $h_hash{$s}{$h}{sequence_read};
        
        if (defined $g_id){
            my $oio = calc_rel_pos($hairpin_start, $hairpin_stop, $g_start, $g_stop);
            print $fh_output "$h,$hairpin_other,$oio,$g_id,$g_other,$dist,$s,$hairpin_start,$hairpin_stop,$hairpin_seq\n";
        }
        else{
            print $fh_output "$h,$hairpin_other,,,,,$s,$hairpin_start,$hairpin_stop,$hairpin_seq\n";
        }
    }
}

sub read_genedata{
    #loop through genedata file and push required records into array
    my $count = 0;
    open(my $fh, '<', $gene_filename)||die "open $gene_filename failed: $!";
    while (<$fh>){
        next if m/^\s*$/;#Skip blank lines
        chomp;
        my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other) = split /\s+/, $_, 6;
        next unless defined $g_other and $g_other eq 'gene'
                  or
                    defined $g_id and $g_id =~ m/^JG/
                    and !defined $g_other;#Read only 'gene' records
        $count++;
        print "$count records saved to genedata array\n";
        push @genedata, $_;
    }
}

sub read_hairpindata{
    open(my $fh, '<', $hairpin_filename)|| die "open $hairpin_filename failed: $!";
my $count;
    #Read entire hairpindata file into a hash
    while(<$fh>){
        chomp;
        my @cols = split/\t/;#split by tab
        next if scalar @cols < 6;#Skip blank or incomplete lines
        
        my $scaffold_loc = $cols[1];#make column 2 (index 1) the scaffold location key (i.e. the scaffolds);
        my $hairpin_id = $cols[0];#hairpin_id also needs to be a key so you can look up data later
        $h_hash{$scaffold_loc}{$hairpin_id}{start} = $cols[2];
        $h_hash{$scaffold_loc}{$hairpin_id}{stop} = $cols[3];
        $h_hash{$scaffold_loc}{$hairpin_id}{other} = [@cols[4..14]];
        $h_hash{$scaffold_loc}{$hairpin_id}{sequence_read} = $cols[15];
        $h_hash{$scaffold_loc}{$hairpin_id}{nearest_gene_data} = nearest_gene($scaffold_loc, $hairpin_id);
        $count++;
        print "$count hairpin added to hash\n";
    }
}

sub nearest_gene{
    my ($s, $h) = @_; #scaffold location and hairpin_id received as arguments
    
    my $save_min = 999999;#Big number for comparing distances
    my @gene_rec;
    my @genes = grep {
        my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other) = split /\s+/, $_, 6;
        $g_scaffold eq $s;
    } @genedata;
    
    foreach (@genes){
        chomp;
        my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other) = split /\s+/, $_, 6;
        my $min = min( abs($h_hash{$s}{$h}{start} - $g_start),
                    abs($h_hash{$s}{$h}{stop} - $g_start),
                    abs($h_hash{$s}{$h}{start} - $g_stop),
                    abs($h_hash{$s}{$h}{stop} - $g_stop) );

        if ($min < $save_min){
            $save_min = $min;
            @gene_rec = ($g_id, $g_scaffold, $g_start, $g_stop, $g_other,$save_min);
        }
    }
    return \@gene_rec;
}

sub calc_rel_pos{
    my ($h_start, $h_stop, $g_start, $g_stop) = @_;
    
    if (is_between($h_start, $g_start, $g_stop)
        && is_between($h_stop, $g_start, $g_stop)){
        return 'inside';
    }
    
    if (!is_between($h_start, $g_start, $g_stop)
    && is_between($h_stop, $g_start, $g_stop)
    ||
    is_between($h_start, $g_start, $g_stop)
    && !is_between($h_stop, $g_start, $g_stop)){
    return 'overlap';
    }
    
    if (!is_between($h_start, $g_start, $g_stop)
    && !is_between($h_stop, $g_start, $g_stop)){
        return 'outside';
    }
}

sub is_between{
    my ($x, $lower_limit, $upper_limit) = @_;
    return ( $x >= $lower_limit && $x <= $upper_limit );
}

sub by_scaffold{
    my ($id_a, $scaffold_a) = split /\s+/, $a;
    my ($id_b, $scaffold_b) = split /\s+/, $b;
    $scaffold_a cmp $scaffold_b;
}

bio-grad 0 Junior Poster in Training · Answer 5 · 2011-08-22T06:59:47+00:00

Thanks so much for the help. Yes, I have SQLite, but I've never used it. What are the advantages to using SQLite? Is it worth the extra time involved to re-write the script?

d5e5 109 Master Poster · Answer 6 · 2011-08-23T21:17:37+00:00

Relational database management software can load not only the data but, if you design your database well, some of the important relationships between data elements. Because the gene data file is rather large, and we have to load the entire file into an array in memory and then loop through the array from the start to find and test rows having the same scaffold as the hairpin, and do this for each hairpin, the script takes a long time. Database software can select subsets (AKA result sets) from the data without loading all the data (together, at one time) into memory, and may do it more efficiently if you first build indexes on the columns by which your selecting.

Another advantage of relational databases is you can do your data processing in a series of steps. In the first program you can load the data files into database tables and build the appropriate indexes. In another step you can build a table associating each hairpin id with the id of the nearest gene. Then you can write one or more programs that query this data in various ways without having to repeatedly load, sort and otherwise massage the original data. That way you can build the database once and then develop and test various queries that shouldn't take as long to run as the job that builds, indexes and links the data.

The downside is I'm not an expert in SQLite and have only played with it because it's free and easy to install. For now I'm just mentioning it as an option in case scanning the gene data array takes too much time, or if you use the same script on an even larger gene data file and get a 'memory full' error or unacceptably long run time.

I don't know if it is worth it to rewrite the program. That would depend on how often and long you need to use it, and how many other queries you need to run on the same data.

bio-grad 0 Junior Poster in Training · Answer 7 · 2011-08-24T05:16:42+00:00

I'll check this out with the data first thing in the morning. I'm curious as to why you saw so little output (~8000 lines) as there should be a line for every hairpin listed in the hairpin file.

Also, would you mind explaining line 120-123. You are referencing, yes? I guess I'm confused about referencing in a sub...

Thanks again. Although learning more about database management, I don't think its necessary for what I'm doing and what I have left to do with the data.

I'll be in touch again tomorrow.

bio-grad 0 Junior Poster in Training · Answer 8 · 2011-08-24T22:37:52+00:00

Each hairpin id should is unique - that's why I couldn't figure out why there were so few lines.

bio-grad 0 Junior Poster in Training · Answer 9 · 2011-09-01T19:34:13+00:00

sorry - i just saw your most recent post. Let met take a look at the data and get back to you. I think you've pretty much solved this one, so thank you.