954,523 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

help with b-tree using two files and hash/key lookoop

Hi,

So I am modifying some code from discussion thread: http://www.daniweb.com/software-development/perl/threads/323871

I am still very much a newbie, but am trying to get a better handle on hashes - so if I've messed mine up please tell me ;).

I am attaching 3 files. One is an image of the scenario I am evaluating and the other two are the data files to be use by this code.

While looking at the image - we have two sets of data -
1) gene id [c0], scaffold location [c1], start [c2], stop [c3]and other info [c4]
2) hairpin ids [h0], scaffold locations [h1], start [h2], stop [h3],other info [h4], and sequence read [h15].

Note: c2 may be > c3 or < c3; likewise h2 may be > h3 or < h3 as these sequences have directionality and can read right to left or left to right.

I want to:
find every occurrence where h0 eq c0 (ie. scaffold1000 = scaffold1000) with the hairpin scaffold being the $key

when the hairpin scaffold $key matches any gene scaffold location [c1], I need it to test for conditions (ie. if h2<=c2 && h3>c2 etc.). However, there will be numerous genes with a matching scaffold for a single hairpin and I only want the output for the gene located closest to the hairpin start and stop, hence the need for a b-tree.

I'm thinking I should run the b-tree to locate the closes gene before testing the conditions, but I'm not sure. If there is another way to run this I'm open to suggestions.

Thanks everyone!

#!/usr/bin/perl;
#comparepremiRNAtocoding-pulloverlap3.pl;
use strict;
use warnings;
use FileHandle;
use Data::Dumper;


open(H,"<$ARGV[0]")||die"open h failed";
open(C,"<$ARGV[1]")||die"open c failed";

my %h_hash=();
while(<H>)  # first load h into hash of arrays
{
  chomp;
  my $key=(split/\t/)[1];   # split by tab and make column 2 (index 1) the key (i.e. the scaffolds);
  push @{$h_hash{$key}},[split(/\t/)];
}
close H;
while(<C>) # then sequentially read this file and do the compares on the fly
{
  chomp;
  my @c=split/\t/;    #  split by tab
  foreach my $i ( 0 .. $#{ $h_hash{$c[1]} } ) # for each array of this type (scaffold)
	{  
	   if (  (($h_hash{$c[1]}[$i][2] <= $c[2] #outside hits
		&& $h_hash{$c[1]}[$i][3] <= $c[2] 
		&& $h_hash{$c[1]}[$i][2] <= $c[3] 
		&& $h_hash{$c[1]}[$i][3] <= $c[3])) 
	      || (($h_hash{$c[1]}[$i][2] >= $c[2]
		&& $h_hash{$c[1]}[$i][3] >= $c[2]  
		&& $h_hash{$c[1]}[$i][2] >= $c[3] 
		&& $h_hash{$c[1]}[$i][3] >= $c[3]))  )
    		{ 
# 		need a B-tree for every $key[1] (aka scaffold) so that for every hairpin array I can which coding "region" is nearest- calculate distance between $h_hash{$c[1]}[$i][2] and $h_hash{$c[1]}[$i][3] to $c[2] and $c[3] foreach $i ( 0 .. $#{ $h_hash{$c[1]} } ) (i.e. for each $c[2] and $c[3] where $c[1] eq $key[1]) 
# such as: 
#$h_hash{$c[1]}[$i][2] - $c[2] = |$a|
#$h_hash{$c[1]}[$i][2] - $c[3] = |$b|
#$h_hash{$c[1]}[$i][3] - $c[2] = |$d|
#$h_hash{$c[1]}[$i][3] - $c[3] = |$e|
# for all $c[2] and $c[3] where $key[1] == $c[1]
# sort ($a <==> $b) and select the smallest value - return scalar to be used in print output
				
			
		   {print "$h_hash{$c[1]}[$i][0]."\t".$h_hast{$c[1]}[$i][4]."\t outside"."\t".$c[0]."\t".$c[1]."\t".$h_hash{$c[1]}[$i][2]."\t".$h_hash{$c[1]}[$i][3]."\t".$h_hash{$c[1]}[$i][15]."\n"";}
		}

	 else if      (  (($h_hash{$c[1]}[$i][2] <= $c[2] #inside hits:
			&& $h_hash{$c[1]}[$i][3] <= $c[2] 
			&& $h_hash{$c[1]}[$i][2] >= $c[3] 
			&& $h_hash{$c[1]}[$i][3] >= $c[3])) 
		      || (($h_hash{$c[1]}[$i][2] >= $c[2]
			&& $h_hash{$c[1]}[$i][3] >= $c[2]  
			&& $h_hash{$c[1]}[$i][2] <= $c[3] 
			&& $h_hash{$c[1]}[$i][3] <= $c[3]))  )
               	{ print "$h_hash{$c[1]}[$i][0]."\t".$h_hast{$c[1]}[$i][4]."\t inside"."\t".$c[0]."\t".$c[1]."\t".$h_hash{$c[1]}[$i][2]."\t".$h_hash{$c[1]}[$i][3]."\t".$h_hash{$c[1]}[$i][15]."\n""}

	else  (
		{ print "$h_hash{$c[1]}[$i][0]."\t".$h_hast{$c[1]}[$i][4]."\t overlap"."\t".$c[0]."\t".$c[1]."\t".$h_hash{$c[1]}[$i][2]."\t".$h_hash{$c[1]}[$i][3]."\t".$h_hash{$c[1]}[$i][15]."\n"";}
  }
}
close C;
exit 0;
Attachments dani-sampleimage.txt (0.32KB) dani-sampleGenedata.txt (0.94KB) dani-sampleHairpindata.txt (1.06KB)
bio-grad
Junior Poster in Training
50 posts since Oct 2010
Reputation Points: 10
Solved Threads: 1
 

perhaps, rather than a b-tree, someone could help me put together a method of finding the "gene" with the minimum distance to the "hairpin" by doing something like

my @dist = ( abs($h_hash{$c[1]}[$i][2] - $c[2]), abs($h_hash{$c[1]}[$i][3] - $c[2]), abs($h_hash{$c[1]}[$i][2] - $c[3]), abs($h_hash{$c[1]}[$i][3] - $c[3]) );

@dist = sort {$a <==> $b} @dist;
$min = $dist[0]; # this will find the distance to a single gene and would need to be done again for every array with a matching key (scaffold) and then I would want to find the min of each of these.

I just don't know how to reference back into the hash once the entry with the minimum distance from the hairpin is determined.

bio-grad
Junior Poster in Training
50 posts since Oct 2010
Reputation Points: 10
Solved Threads: 1
 

perhaps, rather than a b-tree, someone could help me put together a method of finding the "gene" with the minimum distance to the "hairpin" by doing something like

my @dist = ( abs($h_hash{$c[1]}[$i][2] - $c[2]), abs($h_hash{$c[1]}[$i][3] - $c[2]), abs($h_hash{$c[1]}[$i][2] - $c[3]), abs($h_hash{$c[1]}[$i][3] - $c[3]) );

@dist = sort {$a <==> $b} @dist; $min = $dist[0]; # this will find the distance to a single gene and would need to be done again for every array with a matching key (scaffold) and then I would want to find the min of each of these.

I just don't know how to reference back into the hash once the entry with the minimum distance from the hairpin is determined.

Due to my unfamiliarity with the b-tree concept, I would favour your second idea for finding the minimum distance. I don't have any more computer time today but will try to find a way tomorrow.

What I might suggest (but haven't worked out yet) is using the Tie:File module to treat the Genedata file as an array that you can loop through repeatedly. As you find the nearest gene for each hairpin you could save the index of the Genedata array in the hairpin hash, or in another hash that links hairpin with nearest gene. I think that's promising but won't know if it works until I try it tomorrow.

d5e5
Practically a Posting Shark
810 posts since Sep 2009
Reputation Points: 159
Solved Threads: 159
 

Hi d5e5!

Thanks for checking my post. Your suggestions sounds promising. I am trying to work on it with another coworker tomorrow and will post if we come up with something. How does the rest of the code look? I believe this was originally some of your code from another project you helped me with last year. I've modified it a good bit since then, and I really appreciate your help once again.

bio-grad
Junior Poster in Training
50 posts since Oct 2010
Reputation Points: 10
Solved Threads: 1
 

One quick question and then I hope to have some time this afternoon: I see your program loads a module called FileHandle but doesn't seem to do anything with any of that module's properties or methods. I'm not familiar with the FileHandle module. Is it an important part of your solution, or can I ignore it for now?

Avoid using single-letter bare words such as 'H' and 'C' when opening files. It may not have caused you a problem yet but it seems like fingernails on chalkboard to Perl programmers.
# Bareword filehandles such as SARAN are package globals. Use lexical filehandles.
# Prefer the three-argument form of open, especially if the filename is not hardcoded.
# Include the filename in the error message. ~Comment by Sinan Ünür on StackOverflow

The data structure in your hash looks unnecessarily complex but I'll have to experiment with that this afternoon before recommending a change. In principle, with the right data structures you can refer to any hairpin scaffold data by its key. I'll post more this afternoon.

d5e5
Practically a Posting Shark
810 posts since Sep 2009
Reputation Points: 159
Solved Threads: 159
 

As I mentioned before, this is some old code someone helped me with. I believe you can ignore FileHandle for now and thanks for the fyi on the single-letter words when opening files... btw, can you tell me why we have call a file by two things (ie. H, $hairpin)?

bio-grad
Junior Poster in Training
50 posts since Oct 2010
Reputation Points: 10
Solved Threads: 1
 
As I mentioned before, this is some old code someone helped me with. I believe you can ignore FileHandle for now and thanks for the fyi on the single-letter words when opening files... btw, can you tell me why we have call a file by two things (ie. H, $hairpin)?

I'm not sure I understand the question but the open statement associates a file handle with a file name. The file name consists of the path and name of the file. The file handle represents a reference to some location in memory for internal use by perl. I'm not a computer scientist so that may not be exactly correct.

The following is obviously not complete, but I modified the structure of the hairpin hash so that the hairpin id can serve as a key within the scaffold location hash. The goal is to be able to store and look up the saved data for each hairpin in the scaffold. Let me know if this structure seems better or worse than what you had.

#!/usr/bin/perl;
#comparepremiRNAtocoding-pulloverlap3.pl;
use strict;
use warnings;
use Data::Dumper;

#You can delete the following line. My testing setup works better without command-line arguments
@ARGV = qw(dani-sampleHairpindata.txt dani-sampleGenedata.txt) if @ARGV == 0;

my $hairpin_filename = $ARGV[0];
my $gene_filename = $ARGV[1];

#open a file for reading and associate it with a lexical filehandle
open(my $fh, '<', $hairpin_filename)||die "open $hairpin_filename failed: $!";

my %h_hash=();
while(<$fh>)  # first load h into hash of arrays
{
  chomp;
  my @cols = split/\t/;#split by tab
  my $scaffold_loc = $cols[1];#make column 2 (index 1) the scaffold location key (i.e. the scaffolds);
  my $hairpin_id = $cols[0];#hairpin_id also needs to be a key so you can look up data later
  $h_hash{$scaffold_loc}{$hairpin_id} = [@cols];
}
close $fh;
print Dumper(\%h_hash);#Dump hairpin data structure

#Now if we want to lookup the data for, for example hairpin E2 in scaffold3
#we can do it as follows:
print join ', ', @{$h_hash{scaffold3}{E2}}, "\n";

#Open another file. We *could* associate it with a different filehandle
#but since $fh has been closed, we can use it again.
open($fh, '<', $gene_filename)||die "open $gene_filename failed: $!";

while(<$fh>){#Read genedata file line by line
    #calculate distance of gene to each hairpin
}
close $fh;
d5e5
Practically a Posting Shark
810 posts since Sep 2009
Reputation Points: 159
Solved Threads: 159
 

I may need to see more of the code to see where you are going with this, but so far I'm confused as to why the hairpin id becomes a new hash key rather than a gene id as the idea is to compare the distances between a series of gene start and stops on a scaffold to the hairpin on that scaffold and then determine the gene with the minimum distance to the hairpin start or stop. I would have thought that the gene id that is found to be the min would become a key that could then be used to pull the data from that gene id to be printed in the output. But I'm a total newbie, so I will defer to you and wait to see more code. Thanks again for the help.
Btw, I won't be able to check back in until tomorrow evening EST.

bio-grad
Junior Poster in Training
50 posts since Oct 2010
Reputation Points: 10
Solved Threads: 1
 
I may need to see more of the code to see where you are going with this, but so far I'm confused as to why the hairpin id becomes a new hash key rather than a gene id as the idea is to compare the distances between a series of gene start and stops on a scaffold to the hairpin on that scaffold and then determine the gene with the minimum distance to the hairpin start or stop. I would have thought that the gene id that is found to be the min would become a key that could then be used to pull the data from that gene id to be printed in the output. But I'm a total newbie, so I will defer to you and wait to see more code. Thanks again for the help. Btw, I won't be able to check back in until tomorrow evening EST.


You're right, the script I posted hasn't done anything with the gene data yet. It seemed strange to me that the hairpin hash had only scaffold as key, as scaffold is not unique. I may still be confused. I'll have another look on the weekend.

d5e5
Practically a Posting Shark
810 posts since Sep 2009
Reputation Points: 159
Solved Threads: 159
 
I may need to see more of the code to see where you are going with this, but so far I'm confused as to why the hairpin id becomes a new hash key rather than a gene id as the idea is to compare the distances between a series of gene start and stops on a scaffold to the hairpin on that scaffold and then determine the gene with the minimum distance to the hairpin start or stop. I would have thought that the gene id that is found to be the min would become a key that could then be used to pull the data from that gene id to be printed in the output. But I'm a total newbie, so I will defer to you and wait to see more code. Thanks again for the help. Btw, I won't be able to check back in until tomorrow evening EST.


The following reads the gene data into an array, then builds the hash of hairpin data, which also includes a reference to the row of data from the genedata file that is nearest to each hairpin. So far it just dumps the contents of the hash linking hairpins to the nearest gene data record. Please let me know if I'm on the right track.

#!/usr/bin/perl;
#comparepremiRNAtocoding-pulloverlap3.pl;
use strict;
use warnings;
use List::Util qw[min max];
use Data::Dumper;

#You can delete the following line. My testing setup works better without command-line arguments
@ARGV = qw(dani-sampleHairpindata.txt dani-sampleGenedata.txt) if @ARGV == 0;

my $hairpin_filename = $ARGV[0];
my $gene_filename = $ARGV[1];

my @genedata = read_genedata();
chomp @genedata;

my %h_hash = ();
read_hairpindata();

print Dumper(\%h_hash);#Dump hairpin data structure

sub read_genedata{   
    open(my $fh, '<', $gene_filename)||die "open $gene_filename failed: $!";
    
    #Read entire genedata file into an array
    return <$fh>;
}

sub read_hairpindata{
    open(my $fh, '<', $hairpin_filename)||die "open $hairpin_filename failed: $!";
    
    #Read entire hairpindata file into a hash
    while(<$fh>){
        chomp;
        my @cols = split/\t/;#split by tab
        my $scaffold_loc = $cols[1];#make column 2 (index 1) the scaffold location key (i.e. the scaffolds);
        my $hairpin_id = $cols[0];#hairpin_id also needs to be a key so you can look up data later
        $h_hash{$scaffold_loc}{$hairpin_id}{start} = $cols[2];
        $h_hash{$scaffold_loc}{$hairpin_id}{stop} = $cols[3];
        $h_hash{$scaffold_loc}{$hairpin_id}{other} = [@cols[4..14]];
        $h_hash{$scaffold_loc}{$hairpin_id}{sequence_read} = $cols[15];
        $h_hash{$scaffold_loc}{$hairpin_id}{nearest_gene_data} = nearest_gene($scaffold_loc, $hairpin_id);
    }
}

sub nearest_gene{
    my ($s, $h) = @_; #scaffold location and hairpin_id received as arguments
    return "Gene not found for $s $h" if not exists $h_hash{$s}{$h};
    
    my $save_min = 999999;#Big number for comparing distances
    my @gene_rec;
    
    foreach (@genedata){
        chomp;
        my @c = split;
        my $min = min( abs($h_hash{$s}{$h}{start} - $c[2]),
                    abs($h_hash{$s}{$h}{stop} - $c[2]),
                    abs($h_hash{$s}{$h}{start} - $c[3]),
                    abs($h_hash{$s}{$h}{stop} - $c[3]) );
        if ($min < $save_min){
            $save_min = $min;
            @gene_rec = @c;
        }
    }
    return \@gene_rec;
}
d5e5
Practically a Posting Shark
810 posts since Sep 2009
Reputation Points: 159
Solved Threads: 159
 

I had to modify the function called nearest_gene to consider only genes whose scaffold location matched that of the hairpin. I also modified the variable names for the column data for readability. Please replace the corresponding subroutine in the above script with the following:

sub nearest_gene{
    my ($s, $h) = @_; #scaffold location and hairpin_id received as arguments
    
    my $save_min = 999999;#Big number for comparing distances
    my @gene_rec;
    
    foreach (@genedata){
        chomp;
        my @c = split;
        my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other) = split /\s+/, $_, 5;
        #Added the following statement to test only genes with matching scaffold loc
        next unless $c[1] eq $s;
        
        my $min = min( abs($h_hash{$s}{$h}{start} - $g_start),
                    abs($h_hash{$s}{$h}{stop} - $g_start),
                    abs($h_hash{$s}{$h}{start} - $g_stop),
                    abs($h_hash{$s}{$h}{stop} - $g_stop) );
        if ($min < $save_min){
            $save_min = $min;
            @gene_rec = ($g_id, $g_scaffold, $g_start, $g_stop, $g_other);
        }
    }
    return \@gene_rec;
}
d5e5
Practically a Posting Shark
810 posts since Sep 2009
Reputation Points: 159
Solved Threads: 159
 

I read through the code and there are some bits I'm not following that maybe you could explain? line 17-20 - I don't understand what is happening here. I see you open %h_hash, but I don't understand what is happening next. Is read_hairpindata being loaded into h_hash?

Also, I don't quite follow the work flow as I'm not used to subs calling on other subs. Can you give me a basic run through?

Anyhow, I think you are on the right track. I'll check back in a bit later today.

Thanks again!

bio-grad
Junior Poster in Training
50 posts since Oct 2010
Reputation Points: 10
Solved Threads: 1
 

I read through the code and there are some bits I'm not following that maybe you could explain? line 17-20 - I don't understand what is happening here. I see you open %h_hash, but I don't understand what is happening next. Is read_hairpindata being loaded into h_hash?

Also, I don't quite follow the work flow as I'm not used to subs calling on other subs. Can you give me a basic run through?

Anyhow, I think you are on the right track. I'll check back in a bit later today.

Thanks again!

One of the differences between the read_genedata function, or subroutine ('function' and 'subroutine' mean the same thing in Perl), and the read_hairpindata function is read_genedata returns a list of all the lines in the genedata file which the calling statement assigns to the @genedata array.

Whereas the read_hairpindata function does not return any data. Instead ofreturning something it does something. What it does is add data elements (in the form of hash and array references) to the %h_hash which we have to declare outside the read_hairpindata subroutine so we can later access it from outside the scope of that subroutine... so that's what's going on in lines 17 through 20:Line 17 declares the %h_hash wherin we want to save the hairpin and corresponding nearest gene data.
line 18 calls the read_hairpindata subroutine in a void context meaning it doesn't assign the result to any variable, because the function doesn't return anything, but instead adds data to the %h_hash variable, thereby making it into a complex data structure which links each hairpin to the line of genedata that is the nearest to that hairpin.
Line 20 prints a dump of what's in %h_hash for our own information while we're writing and testing the program. Once we know exactly what data elements we want to retrieve from this hash, and figure out how to convert the references back into useable text, numbers, lists, or whatever, then we will remove line 20 because we won't need to see a data dump any more.
Remember: hashes can really only contain pairs of scalar elements, so to store arrays or other hashes in a hash we have to store references to them. We can later de-reference these references to retrieve what they refer to.

Note that line 42:
$h_hash{$scaffold_loc}{$hairpin_id}{nearest_gene_data} = nearest_gene($scaffold_loc, $hairpin_id);
within the read_hairpindata function calls the nearest_gene function and passes it the scaffold location and hairpin id of the hairpin for which the function will try to return the element in the @genedata array calculated to have the least distance from the hairpin.

One thingI don't understand is why the genedata file has many lines of data -- and therefore many pairs of stop and start locations -- for most genes. Since gene_id is not unique, I'm not sure which line I'm supposed to save for that gene, so I return the line whose start and stop locations have the least distance from the hairpin.

d5e5
Practically a Posting Shark
810 posts since Sep 2009
Reputation Points: 159
Solved Threads: 159
 

Thank you for explaining the hash system and the referencing. I'll have to look into this a bit further so I can use this setup again down the road.

As for the gene file...The gene file has lots of data. This is because of the biology of the thing it is referencing. Essentially this data set tells me the boundaries of the gene (genes: Code1, JG12345 etc.)and which elements are present at which locations within genes (these are identified with the "transcript_id" prefix or ".t" suffix in the ID column).

I am determining which hairpins lie "outside" of a gene, "overlap" a gene, or are "inside" a gene. The element locations within the genes only become relevant when a hairpin is determined to be "inside" the gene. When it is within the boundaries of the gene the most critical pieces of information are the lines indicating 5'UTR, intron, CDS, or 3'UTR in the info column. If a hairpin location falls within one of these regions it is important and my plan is to output the gene info column with the hairpin information so that I know hairpinX resides within the 5'UTR of gene Code1.

If the hairpin lies distant to a gene then the elements within the gene become irrelevant and I was planning to delete them as redundant entries for any hit that was found to be "outside" a gene. In this case I just need to know that hairpinX resides 324 bases away from Code2.

This is why the original code is set to determine if the hairpin is within a gene, if it is not, then it determines the distance between it and genes nearby to find the closest one. If neither of the above scenarios work, then the hairpin is said to overlap a gene (ie. it lies both inside and outside of a gene) and only the gene ID is needed.

Does this help or am I being unclear?

bio-grad
Junior Poster in Training
50 posts since Oct 2010
Reputation Points: 10
Solved Threads: 1
 

Thanks for the explanation. I gather that some of the records in the genedata file may be used only later if certain conditions are met. I think this means that the function that searches for the nearest gene should calculate the distance from hairpin only for those records where the 'other' value is 'gene'. Is that correct? If so we could modify the script to test only the 'gene' rows. I also included an example of how to loop through and access the data stored in the %h_hash. The script still doesn't determine if the hairpin is inside, outside or overlapping the nearest gene, but knowing how to access the data elements in %h_hash is a step in that direction.

#!/usr/bin/perl;
#comparepremiRNAtocoding-pulloverlap3.pl;
use strict;
use warnings;
use List::Util qw[min max];
use Data::Dumper;

#You can delete the following line. My testing setup works better without command-line arguments
@ARGV = qw(dani-sampleHairpindata.txt dani-sampleGenedata.txt) if @ARGV == 0;

my $hairpin_filename = $ARGV[0];
my $gene_filename = $ARGV[1];

my @genedata = read_genedata();
chomp @genedata;

my %h_hash = ();
read_hairpindata();

#print Dumper(\%h_hash);#Dump hairpin data structure
#We can access the data for each hairpin id in each scaffold location
#by looping through %h_hash
foreach my $s(keys %h_hash){
    print "Scaffold location $s\n";
    
    foreach my $h(keys %{$h_hash{$s}}){
        my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other) = $h_hash{$s}{$h}{nearest_gene_data}[0];
        my $nearest_gene = 'Not Found';
        $nearest_gene = $g_id if defined $g_id;
        print "Nearest Gene to Hairpin id $h is $nearest_gene\n";
    }
    print "\n";
}

sub read_genedata{   
    open(my $fh, '<', $gene_filename)||die "open $gene_filename failed: $!";
    
    #Read entire genedata file into an array
    return <$fh>;
}

sub read_hairpindata{
    open(my $fh, '<', $hairpin_filename)||die "open $hairpin_filename failed: $!";
    
    #Read entire hairpindata file into a hash
    while(<$fh>){
        chomp;
        my @cols = split/\t/;#split by tab
        my $scaffold_loc = $cols[1];#make column 2 (index 1) the scaffold location key (i.e. the scaffolds);
        my $hairpin_id = $cols[0];#hairpin_id also needs to be a key so you can look up data later
        $h_hash{$scaffold_loc}{$hairpin_id}{start} = $cols[2];
        $h_hash{$scaffold_loc}{$hairpin_id}{stop} = $cols[3];
        $h_hash{$scaffold_loc}{$hairpin_id}{other} = [@cols[4..14]];
        $h_hash{$scaffold_loc}{$hairpin_id}{sequence_read} = $cols[15];
        $h_hash{$scaffold_loc}{$hairpin_id}{nearest_gene_data} = nearest_gene($scaffold_loc, $hairpin_id);
    }
}

sub nearest_gene{
    my ($s, $h) = @_; #scaffold location and hairpin_id received as arguments
    
    my $save_min = 999999;#Big number for comparing distances
    my @gene_rec;
    
    foreach (@genedata){
        chomp;
        my @c = split;
        my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other) = split /\s+/, $_, 5;
        #Added the following statement to test only genes with matching scaffold loc
        next unless ($c[1] eq $s) and ($c[4] eq 'gene');#Test only when other info contains 'gene'
        
        my $min = min( abs($h_hash{$s}{$h}{start} - $g_start),
                    abs($h_hash{$s}{$h}{stop} - $g_start),
                    abs($h_hash{$s}{$h}{start} - $g_stop),
                    abs($h_hash{$s}{$h}{stop} - $g_stop) );
        if ($min < $save_min){
            $save_min = $min;
            @gene_rec = ($g_id, $g_scaffold, $g_start, $g_stop, $g_other);
        }
    }
    return \@gene_rec;
}


Sorry, I didn't sleep much last night. A family crisis came up which may keep me away from the computer tomorrow and Wednesday at least, so if I don't answer right away that's why.

d5e5
Practically a Posting Shark
810 posts since Sep 2009
Reputation Points: 159
Solved Threads: 159
 

Wow, thanks so much. As for the function that searches for the nearest gene, it should calculate the distance from hairpin only for those records where the 'other' value is 'gene' or empty (in the cases where the ID begins with JG, the 'other' value is empty as there is not as much data for these genes. I'll look this over and try to incorporate into my original script which tests for the "inside" "outside" "overlap" and get back to you after I see how it goes.

I hope all goes well as you deal with your "crisis". Those are never a picnic. Take care.

bio-grad
Junior Poster in Training
50 posts since Oct 2010
Reputation Points: 10
Solved Threads: 1
 

Update - The code works :) Thanks.

Once I've figured out how to access the values in the hash I can use the values to calculate the outside/inside/overlap status. I'm still working through that by trying to figure out how to incorporate $save_min into the output. I tried adding it to the @gene_rec and then including it in line 27, so I could reference it in the print statement, but I keep getting an error that it is uninitialized. I also want to access other data that is part of @gene_rec, but when I tried to reference them from the print statement, I received the same error as with $save_min which I had added to line 27 and in line 65.

Am I understanding things correctly that @gene_rec data can be accessed through $h_hash{$s}{$h}{nearest_gene_data} with $h_hash{$s}{$h}{nearest_gene_data}[0] being the $g_id and $h_hash{$s}{$h}{nearest_gene_data}[4] being the $g_other?

Ideally the output would look like this:

Hairpin_ID, hairpin other, "outside/inside/overlap", gene ID, gene other, distance from hairpin, scaffold, hairpin start, hairpin stop, hairpin sequence

bio-grad
Junior Poster in Training
50 posts since Oct 2010
Reputation Points: 10
Solved Threads: 1
 

Since you have added a value to the array I think you need to increase the limit argument for the split from 5 to 6.

Try modifying line 69 from:
my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other) = split /\s+/, $_, 5;

to: my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other) = split /\s+/, $_, 6; Am I understanding things correctly that @gene_rec data can be accessed through $h_hash{$s}{$h}{nearest_gene_data} with $h_hash{$s}{$h}{nearest_gene_data}[0] being the $g_id and $h_hash{$s}{$h}{nearest_gene_data}[4] being the $g_other?Yes, that part is correct as far as I can tell.

d5e5
Practically a Posting Shark
810 posts since Sep 2009
Reputation Points: 159
Solved Threads: 159
 

Sorry, I found my mistake on line 27. See the comment and the modified statement (now line 28) of the following:

#!/usr/bin/perl;
#comparepremiRNAtocoding-pulloverlap3.pl;
use strict;
use warnings;
use List::Util qw[min max];
use Data::Dumper;

#You can delete the following line. My testing setup works better without command-line arguments
@ARGV = qw(dani-sampleHairpindata.txt dani-sampleGenedata.txt) if @ARGV == 0;

my $hairpin_filename = $ARGV[0];
my $gene_filename = $ARGV[1];

my @genedata = read_genedata();
chomp @genedata;

my %h_hash = ();
read_hairpindata();

#print Dumper(\%h_hash);#Dump hairpin data structure
#We can access the data for each hairpin id in each scaffold location
#by looping through %h_hash
foreach my $s(keys %h_hash){
    print "Scaffold location $s\n";
    
    foreach my $h(keys %{$h_hash{$s}}){
        #Modified following statement to properly dereference the array
        my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other, $dist) = @{$h_hash{$s}{$h}{nearest_gene_data}};
        my $nearest_gene = 'Not Found';
        $nearest_gene = $g_id if defined $g_id;
        print "Nearest Gene to Hairpin id $h is $nearest_gene, distance: $dist\n";
    }
    print "\n";
}

sub read_genedata{   
    open(my $fh, '<', $gene_filename)||die "open $gene_filename failed: $!";
    
    #Read entire genedata file into an array
    return <$fh>;
}

sub read_hairpindata{
    open(my $fh, '<', $hairpin_filename)||die "open $hairpin_filename failed: $!";
    
    #Read entire hairpindata file into a hash
    while(<$fh>){
        chomp;
        my @cols = split/\t/;#split by tab
        my $scaffold_loc = $cols[1];#make column 2 (index 1) the scaffold location key (i.e. the scaffolds);
        my $hairpin_id = $cols[0];#hairpin_id also needs to be a key so you can look up data later
        $h_hash{$scaffold_loc}{$hairpin_id}{start} = $cols[2];
        $h_hash{$scaffold_loc}{$hairpin_id}{stop} = $cols[3];
        $h_hash{$scaffold_loc}{$hairpin_id}{other} = [@cols[4..14]];
        $h_hash{$scaffold_loc}{$hairpin_id}{sequence_read} = $cols[15];
        $h_hash{$scaffold_loc}{$hairpin_id}{nearest_gene_data} = nearest_gene($scaffold_loc, $hairpin_id);
    }
}

sub nearest_gene{
    my ($s, $h) = @_; #scaffold location and hairpin_id received as arguments
    
    my $save_min = 999999;#Big number for comparing distances
    my @gene_rec;
    
    foreach (@genedata){
        chomp;
        my @c = split;
        my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other) = split /\s+/, $_, 6;
        #Added the following statement to test only genes with matching scaffold loc
        next unless ($c[1] eq $s) and ($c[4] eq 'gene');#Test only when other info contains 'gene'
        
        my $min = min( abs($h_hash{$s}{$h}{start} - $g_start),
                    abs($h_hash{$s}{$h}{stop} - $g_start),
                    abs($h_hash{$s}{$h}{start} - $g_stop),
                    abs($h_hash{$s}{$h}{stop} - $g_stop) );
        if ($min < $save_min){
            $save_min = $min;
            @gene_rec = ($g_id, $g_scaffold, $g_start, $g_stop, $g_other,$save_min);
        }
    }
    return \@gene_rec;
}


I still get one uninitialized value error at line 31 so the script needs more work.

d5e5
Practically a Posting Shark
810 posts since Sep 2009
Reputation Points: 159
Solved Threads: 159
 

seems to me that perhaps the trouble lies with $dist when in fact there is no gene present on the scaffold. therefor maybe we could modify line 29 to read:

$nearest_gene = $g_id if defined $g_id && $dist>0;

and then

else print "Nearest Gene to Hairpin id $h is $nearest_gene\n";


would that work? I tried to put it in, but I seem to be messing up my parenthesis or something with the else statement.

bio-grad
Junior Poster in Training
50 posts since Oct 2010
Reputation Points: 10
Solved Threads: 1
 

This question has already been solved

Post: Markdown Syntax: Formatting Help
You
 
View similar articles that have also been tagged: