d5e5 109 Master Poster

When I tried to download your attached ref1.txt I got an error message from Daniweb saying "/tmp/Xfx9+ApI.part could not be saved, because the source file could not be read" so I can't see the data.

#!/usr/bin/perl
use strict;
use warnings;

while(my $rec = <DATA>){
    chomp($rec);
    $rec = reverse($rec);
    $rec =~ tr/ATGC/TACG/;
    print "$rec\n";
}
__DATA__
GCTCCTTGGGAAATATAGATCAAATATAGTTCATCGTTTAACTAAACCCG
TCCTTGGGAAATATAGATCAAATATAGTTCATCGTTTAACTAAACCCGGA
CCTTGGGAAATATAGATCAAATATAGTTCATCGTTTAACTAAACCCGGAC

You already know how to reverse a text string. To replace A with T, T with A, etc. you could use the transliteration function $rec =~ tr/ATGC/TACG/; Since I don't get the same output you want, I may have misunderstood the question.

d5e5 109 Master Poster

A table is not a MySQL data type so a function cannot return a table. You can create a view instead.

d5e5 109 Master Poster

Your loop runs while($run eq "yes") but when the user enters yes at the prompt then $run contains yes with a newline character at the end. Try doing a chomp on the $run after assigning the STDIN to it.

print "Send more strings? yes/no";
$run = <STDIN>;
chomp($run);#Remove the newline character from input
d5e5 109 Master Poster

Suppose file.txt contains the following:

CTTA TAAC GACC CCCG CCGA CACG GCAG TGAG CGCA GCAG CGAC GCGT GGCT CTTG TAAT 
AACC AATG CGCT TGCG AAAT CAGC TAGC CCAT TTGA TAAA GTAA GGGC TCGA GAGG ATTT 
GGCA TTAA GCAC GGCT TGTG CCTA CCTC TGGT TTCC GTGT CTAC ACAG TAGT CGGC TGTC 
TATC TGTT CGTC CGAC CGCT
#!/usr/bin/perl
#print_files_in_subdirs.pl
use strict;
use warnings;

my $input_filename = 'file.txt';
my $data = slurp_file($input_filename);

$data =~ s/\s//g;#Remove all space, newline, etc.
$data =~ s/(\w{50})/$1\n/g;

print $data;

sub slurp_file{
    my $filename = shift;
    local $/=undef;
    open my $fh, $filename or die "Couldn't open file: $!";
    my $string = <$fh>;
    return $string;
}

Output:

CTTATAACGACCCCCGCCGACACGGCAGTGAGCGCAGCAGCGACGCGTGG
CTCTTGTAATAACCAATGCGCTTGCGAAATCAGCTAGCCCATTTGATAAA
GTAAGGGCTCGAGAGGATTTGGCATTAAGCACGGCTTGTGCCTACCTCTG
GTTTCCGTGTCTACACAGTAGTCGGCTGTCTATCTGTTCGTCCGACCGCT
d5e5 109 Master Poster

The default iterator variable in a foreach loop if no other variable is supplied.

from perldoc perlvar.

If we supply another variable, such as $fn, in the loop condition then we won't need $_. I just used it out of habit. The foreach loop can work just as well as follows.

foreach my $fn (@files) {
        next if $fn eq "." or $fn eq "..";
        
        if (-d $fn) {
            push @d, $fn;
        }
        else{
            print "$fn\n";
        }
    }
d5e5 109 Master Poster
#!/usr/bin/perl
#print_files_in_subdirs.pl
use strict;
use warnings;
my $startdir = '/home/david/Programming/data';
print_files($startdir);

sub print_files {
    #This subroutine calls itself for each subdirectory it finds.
    my $dir = $_[0];
    my @files = <$dir/*>;
    my @d;
    foreach (@files) {
        next if $_ eq "." or $_ eq "..";
        my $fn = $_;
        if (-d $fn) {
            push @d, $fn;
        }
        else{
            print "$fn\n";
        }
    }
    if (scalar @d == 0) { #If no directories found, $dir is lowest dir in this branch
        return;
    }
    foreach (@d) {
        print_files($_);
    }
}
Mouche commented: Works for me. +6
d5e5 109 Master Poster

For example, you could do it as follows:

#!/usr/bin/perl;
use strict;
use warnings;

my %dict;
my $dictfilename = 'words-english.txt';

open my $fh, '<', $dictfilename or die "Failed to open $dictfilename: $!";
while (my $word = <$fh>){
    chomp $word; #Remove end-of-line character
    my $upper_word = uc($word);#Convert to upper case
    $dict{$upper_word} = undef; #Add word as key to hash (no value associated)
}
close $fh;

my $permsfilename = 'permutations.txt';
open $fh, '<', $permsfilename or die "Failed to open $permsfilename: $!";
while (my $word = <$fh>){
    chomp $word; #Remove end-of-line character
    my $upper_word = uc($word);#Convert to upper case
    if (exists $dict{$upper_word}){
        print "Yes, $word is in the dictionary.\n";
    }
    else{
        print "No, $word is NOT in the dictionary.\n";
    }
}
close $fh;

(Today I learned that 'eta' is an English word naming the seventh letter of the Greek alphabet.)

d5e5 109 Master Poster

I wrote c program that outputs all permutations of a word to a txt file.

ate
aet
tae
tea
eat
eta

I also have a txt file of all the words in the dictionary. I would like to take the first entry in permutations.txt and search dictionary.txt to see if its a valid word then the second entry ect...

I have a little experience using perl, but only for parsing one file. Would it be a good fit for what i'm trying to do. Anybody have some suggestions or point me in the right direction?

You can do this easily in Perl. A hash in Perl works sort of like a dictionary in that it consists of unique keys associated with values, so read all the words from your dictionary file into memory as keys to a hash. Since you don't care about definitions of the words you don't need to associate any values with the keys in your hash, so make the values undef . Having all your dictionary words stored as keys in a hash allows you to test if any given word exists in the hash much more easily than having to look for it in the dictionary file. For an introduction to using hashes in Perl see http://www.perltutorial.org/perl-hash.aspx

All your sample data words are in lower case letters but, in case future data include mixed-case words, make sure to convert every word to either upper or lower case …

d5e5 109 Master Poster

During the first 'foreach' loop, you split each column into $pt and $ps...

I split each row and assign the first two elements of the resulting list to the variables $pt (program title) and $ps (program session).

...and then assign the values of $programs{$pt} into $ps

No, I assign the content of $ps to the $programs{$pt} entry in the %programs hash. A hash contains one or more key-value pairs, where the key must be unique for the hash. This allows us to look up any value in the hash by specifying its key. The key of this particular hash entry is $pt and the value is $ps.

if (exists $programs{$pt}){ isn't that mean you're trying to check the program session whether it exists in the archive file?

No, I'm checking to see if there is an entry in the %programs hash having that program title as key. Remember that the %program hash contains all data read from program.tsv so in effect I'm checking to see if that program title (from the archive file) exists in the program file.

I can't understand the difference between $programs{$pt} in if (exists $programs{$pt}){ and in print $fh3 "$_\t$programs{$pt}\n";

There is no difference. I check if $programs{$pt} exists before trying to print $programs{$pt} because if I try to print a hash value for a key that doesn't exist in the hash perl will give me an error message. For an introduction to using hashes in Perl see http://www.perltutorial.org/perl-hash.aspx

d5e5 109 Master Poster

Because you only need to read each record from the archive file once, you don't need to save it into an array. Just iterate through it.

Because each program title should occur no more than once in the program file, you can read it into a hash for more efficient lookup than having to iterate through an array each time you want to search for a program title. The following works for me:

#!/usr/bin/perl
#match_archive.pl
use strict;
use warnings;

my %programs;#Hash to save program title => program session for each program rec
open my $fh1, '<', 'program.tsv' or die "Failed to open program.tsv: $!";
foreach (<$fh1>){
    chomp;#Remove newline character from end of line
    my ($ps, $pt) = split(/\t/);
    $programs{$pt} = $ps;
}
close $fh1;

open my $fh2, '<', 'archive.tsv' or die "Failed to open archive.tsv: $!";
open my $fh3, '>', 'combine.tsv' or die "Failed to open combine.tsv: $!";#Output
foreach (<$fh2>){
    chomp;#Remove newline character from end of line
    my $pt = (split(/\t/))[1];#Paper title
    if (exists $programs{$pt}){
        print $fh3 "$_\t$programs{$pt}\n";
    }
    else{
        print $fh3 "$_\t*NA*\n";
    }
}
red711 commented: useful suggestions, nice and efficient codes. +0
d5e5 109 Master Poster

I'll have another look when I get some time but the first thing I'd suggest:

use strict;
use warnings;

belong in every non-trivial Perl script. Add them to your script and declare any undeclared variables, such as @sesname, @title1, @all, @title2 and $i.

Another general suggestion: when opening files, use lexical filehandles instead of bare words and test for success or failure to open by including an or die "blah blah: $!" clause.

open my $fh1, '<', 'archive.tsv' or die "Failed to open archive.tsv: $!";
open my $fh2, '<', 'program.tsv' or die "Failed to open program.tsv: $!";
open my $fh3, '>', 'combine.tsv' or die "Failed to open combine.tsv: $!";
d5e5 109 Master Poster

I haven't tried this, but have a look at selectall_arrayref
Looks like @result = @{ $dbh->selectall_arrayref($sql, { Slice => {} }) }; puts all the rows of the result set into an array which you can then loop through as many times as you want.

d5e5 109 Master Poster

I don't think you can re-use a result set with DBI base on this quote from this tutorial:

Cursors are used for sequential fetching operations: records are fetched in the order in which they are stored within the result set. Currently, records cannot be skipped over or randomly accessed. Furthermore, once a row addressed by a cursor has been fetched, it is ``forgotten'' by the cursor. That is, cursors cannot step backwards through a result set.

I think you could do what you want by moving the $sth->execute; into the start of your outer loop. Really you would be re-querying the database each time you iterate through that loop.

An alternative would be to copy each array into an array of arrays and iterate through that. Make sure you copy and save each array the array reference points to, instead of copying the same array ref multiple times, or you will end up with only the last row's data saved.

d5e5 109 Master Poster

You are welcome. Please don't forget to mark this thread 'solved'.

d5e5 109 Master Poster

Data_1.txt

Num Posi
1 2
2 5
3 4

Data_2.txt

Num Star_posi End_posi
1 1 10
2 15 18
3 26 30
#!/usr/bin/perl;
use strict;
use warnings;

my $filename1 = 'Data_1.txt';
my $filename2 = 'Data_2.txt';
my $filename3 = 'output.txt';
my %ranges;

open my $fh, '<', $filename2 or die "Failed to open $filename2: $!";
while (<$fh>){
    next unless m/^\d/; #skip unless line starts with number
    chomp;
    my ($num, $start, $end) = split;
    $ranges{$num}{Star_posi} = $start;
    $ranges{$num}{End_posi} = $end;
}
close $fh;

open $fh, '<', $filename1 or die "Failed to open $filename1: $!";
open my $fho, '>', $filename3 or die "Failed to open $filename3: $!";#Open file for output
while (<$fh>){
    next unless m/^\d/; #skip unless line starts with number
    chomp;
    my ($num, $posi) = split;
    my $range_num = search_ranges($posi);
    if ($range_num){
        #print to output file instead of STDOUT
        print $fho "Data1 record number $num found "
            . "at position $posi found in $ranges{$range_num}{Star_posi} "
            . "and $ranges{$range_num}{End_posi} \n";
    }
    else{
        print "Data1 record number $num not found in Data2\n";
    }
}

sub search_ranges{
    my $pos = shift;
    foreach (keys %ranges){
        return $_ if $pos >= $ranges{$_}{Star_posi}
                    and $pos <= $ranges{$_}{End_posi};
    }
}

output.txt

Data1 record number 1 found at position 2 found in 1 and 10 
Data1 record number 2 found at position 5 found in 1 and 10 
Data1 record number 3 found at position 4 found in 1 and 10
d5e5 109 Master Poster

sorry I prepared the Print code in below.

print "at position  $posi found in $ranges{$num}{Star_posi} and $ranges{$num}{End_posi} \n";

But I have one problem. EX:

Data 1:                  Data2: 
      Num   Posi           Num    star      end
       1     2              1      1         10
       2     5              2      15        18
       3     4              3      26        30

How coould I find Num (3) posi (4) between 1 and 10?

I'm confused by your second example. It looks like your two data sets are side by side in one input file. Is that correct? I assumed from reading your original question that you had two separate input files.

d5e5 109 Master Poster
#!/usr/bin/perl;
use strict;
use warnings;

my $filename1 = 'Data_1.txt';
my $filename2 = 'Data_2.txt';
my %ranges;

open my $fh, '<', $filename2 or die "Failed to open $filename2: $!";
while (<$fh>){
    next unless m/^\d/; #skip unless line starts with number
    chomp;
    my ($num, $start, $end) = split;
    $ranges{$num}{Star_posi} = $start;
    $ranges{$num}{End_posi} = $end;
}
close $fh;

open $fh, '<', $filename1 or die "Failed to open $filename1: $!";
while (<$fh>){
    next unless m/^\d/; #skip unless line starts with number
    chomp;
    my ($num, $posi) = split;
    my $range_num = search_ranges($posi);
    if ($range_num){
        print "Data1 record number $num found in Data2 record number $range_num\n";
    }
    else{
        print "Data1 record number $num not found in Data2\n";
    }
}

sub search_ranges{
    my $pos = shift;
    foreach (keys %ranges){
        return $_ if $pos >= $ranges{$_}{Star_posi}
                    and $pos <= $ranges{$_}{End_posi};
    }
}

Outputs:

Data1 record number 1 found in Data2 record number 1
Data1 record number 2 found in Data2 record number 1
Data1 record number 3 found in Data2 record number 1
Data1 record number 4 found in Data2 record number 2
Data1 record number 5 not found in Data2
d5e5 109 Master Poster

Each hairpin id should is unique - that's why I couldn't figure out why there were so few lines.

Sorry, my mistake. I just loaded the hairpin-nearest-gene.txt containing the output from yesterday's test run into a text editor and scrolled to the end and it shows there are 168026 lines. I don't know where I got the impression there were only 8026. Maybe too much Benadryl. It looks like the hairpin ids in the input are unique, so that's good. You can see the output at http://d5e5.fyels.com/dab

d5e5 109 Master Poster

There should by a line for every hairpin id but not every line in the hairpin data file if there are multiple lines having the same hairpin id. I was assuming hairpin id can be a unique key so it can be the key to the %h_hash. If hairpin ids aren't unique then the hash will contain data for only the last line of each bunch of lines for a particular hairpin_id. If hairpin ids aren't unique, I guess I still don't understand what hairpins represent.

As for an explanation of lines 120 - 123: the following has to dereference an array reference referring to an array of array references. That's why it looks complicated. The complex syntax has nothing to do with its being in a subroutine. Each scaffold key in the hash contains a reference to an array of references to arrays containing the split gene data lines for that scaffold.

An iteration repeatedly puts the current element in the $_ variable by default (if no other variable is specified). In this case $_ contains an array reference so we must dereference it to make it an array again, like so: @$_

foreach (@{$genedata{$s}}){
        chomp;

        my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other) = @$_;
d5e5 109 Master Poster

For each line in the gene data file we create a key value pair in the %genedata hash. The key is the gene scaffold and the value is a reference to an array containing the line of gene data that we split by spaces.

So line 120 iterates through this array. To de-reference this array we have to look it up in the hash and wrap it in dereferencing syntax resulting in @{$genedata{$s}}

d5e5 109 Master Poster

When we update the code is it possible change line 76

next unless defined $g_other and $g_other eq 'gene'

to also run if $g_other eq ' ' as there are additional queries which should be examined which have nothing in the column $g_other column.

I moved the following statement from the read_genedata sub to the nearest_gene sub.

next unless defined $g_other and $g_other eq 'gene'
                  or
                    defined $g_id and $g_id =~ m/^JG/
                    and !defined $g_other;#Read only 'gene' records

Also I modified the script to read the gene data into a hash of arrays instead of a plain array so the nearest_gene sub can look up only the gene records for a specific scaffold which serves as a key to the hash. Looking up data by key in a hash should be faster than reading through an array to find the genes for that scaffold.

Please try the following. On my laptop it takes roughly half an hour and creates an output file called 'hairpin-nearest-gene.txt' containing 8026 lines.

#!/usr/bin/perl;
#comparepremiRNAtocoding-pulloverlap3.pl;
use strict;
use warnings;
use List::Util qw[min max];
use Benchmark;

my $t0 = Benchmark->new;

#You can delete the following line. My testing setup works better without command-line arguments
@ARGV = qw(cab-ALL.LU-premiRNAs.withLOCATION.Vmatch-Nvmr-EST.list bab-All.LuCoding.061511) if @ARGV == 0;
#@ARGV = qw(dani-sampleHairpindata.txt dani-sampleGenedata.txt) if @ARGV == 0;
my $hairpin_filename = $ARGV[0];
my $gene_filename = $ARGV[1];

#Save output to a file instead of STDOUT
#because we are printing debug info to STDOUT
my $outfilename = 'hairpin-nearest-gene.txt';

my %genedata;#Hash of genedata arrays for each …
d5e5 109 Master Poster

Relational database management software can load not only the data but, if you design your database well, some of the important relationships between data elements. Because the gene data file is rather large, and we have to load the entire file into an array in memory and then loop through the array from the start to find and test rows having the same scaffold as the hairpin, and do this for each hairpin, the script takes a long time. Database software can select subsets (AKA result sets) from the data without loading all the data (together, at one time) into memory, and may do it more efficiently if you first build indexes on the columns by which your selecting.

Another advantage of relational databases is you can do your data processing in a series of steps. In the first program you can load the data files into database tables and build the appropriate indexes. In another step you can build a table associating each hairpin id with the id of the nearest gene. Then you can write one or more programs that query this data in various ways without having to repeatedly load, sort and otherwise massage the original data. That way you can build the database once and then develop and test various queries that shouldn't take as long to run as the job that builds, indexes and links the data.

The downside is I'm not an expert in SQLite and have only played with it because …

d5e5 109 Master Poster

When we update the code is it possible change line 76

next unless defined $g_other and $g_other eq 'gene'

to also run if $g_other eq ' ' as there are additional queries which should be examined which have nothing in the column $g_other column.

I'm not sure about that. I've been trying to modify the script to save the gene data in a more suitable data structure that would allow faster look up of the genes for each scaffold when determining nearest gene, but so far I'm not getting good results.

Maybe we would need to consider a radical change to the program design, such as reading the gene data into a database table. Then you could access it with SQL queries that could include or filter out rows according to values in the columns, and it might run faster. Do you have SQLite for Perl?

Meanwhile, here's the corrected version of the last script I posted. It skips blank lines in the hairpin file to avoid the error messages you were getting.

#!/usr/bin/perl;
#comparepremiRNAtocoding-pulloverlap3.pl;
use strict;
use warnings;
use List::Util qw[min max];
use Benchmark;

my $t0 = Benchmark->new;

#You can delete the following line. My testing setup works better without command-line arguments
#@ARGV = qw(cab-ALL.LU-premiRNAs.withLOCATION.Vmatch-Nvmr-EST.list bab-All.LuCoding.061511) if @ARGV == 0;
@ARGV = qw(dani-sampleHairpindata.txt dani-sampleGenedata.txt) if @ARGV == 0;
my $hairpin_filename = $ARGV[0];
my $gene_filename = $ARGV[1];
my @genedata;
$| = 1;#Flush print buffer

read_genedata();#Call subroutine to read selected records from file into array

my $t1 …
d5e5 109 Master Poster

Hi,
I was out all day today so I was able to let it run last night and today (so I haven't tried the new code). It finished! There are 168028 lines in the file. I will need to go over the output and double-check to be sure, but I think it worked :)

I did get some error messages though. However, they were more numerous than the terminal will show me. Anyhow, here are the one's which I have:

Use of uninitialized value $s in string eq at Desktop/8-19-11run.pl line 105, <$fh> line 167869.

Use of uninitialized value $scaffold_loc in hash element at Desktop/8-19-11run.pl line 89, <$fh> line 167869.

Use of uninitialized value $hairpin_id in hash element at Desktop/8-19-11run.pl line 89, <$fh> line 167869.

Use of uninitialized value in join or string at Desktop/8-19-11run.pl line 54.

Use of uninitialized value $hairpin_start in concatenation (.) or string at Desktop/8-19-11run.pl line 64.

Use of uninitialized value $hairpin_stop in concatenation (.) or string at Desktop/8-19-11run.pl line 64.

Use of uninitialized value $hairpin_seq in concatenation (.) or string at Desktop/8-19-11run.pl line 64.

That's great that the program finally completed. Apparently looking up the nearest gene for each hairpin is taking a lot of time. I'll see if I can speed up that part of the script.

When I look at line 167869 of the hairpin file I see there is a blank line which probably caused the …

d5e5 109 Master Poster

I downloaded the two files and tested the program. It seemed to take forever so I made a few more changes that hopefully will speed it up a little more. Also I have it print debugging info to the screen while it's running so we have some idea what it's doing.

I see there are over 168,000 lines of data in the hairpin file. If it took one second to find the nearest gene for each of those lines that would total about 46 hours! Fortunately it seems to run faster than that on my laptop, adding about 5 to 10 hairpins to the hash every second so hopefully it should finish running after about 5 to 10 hours. That's just a guess because I can't leave my laptop running that long (it's on dining room table, no desk).

If it's still not fast enough, would you consider loading the data into a relational database, such as SQLite? That would require writing a Perl program to load the data into tables in the database, and then require significant changes to the program(s) that read the database and contain the logic to determine nearest gene, etc. SQLite is easy to install and use with Perl. see http://search.cpan.org/~msergeant/DBD-SQLite-0.31/lib/DBD/SQLite.pm

Meanwhile here's the latest, and hopefully the fastest, version of our Perl script that may take about 5 to 10 hours to produce an output file from those two text files you uploaded to fyels.com

#!/usr/bin/perl; …
d5e5 109 Master Poster

I'll upload the files to fyels when I get my computer with the data up and running this afternoon.

OK. Remember to post the link to the data on fyels or I won't be able to find it.

Meanwhile I tweaked the program to make it more efficient but since it took less than a second to run the test data I can't tell if it's really any faster than before. Here's the modified program. Note that I use the grep command to filter the genedata before searching for the nearest gene. Plus I sort the genedata by scaffold so I can stop looping through the data after looping through the matching scaffold data. In theory that should save a little time.

#!/usr/bin/perl;
#comparepremiRNAtocoding-pulloverlap3.pl;
use strict;
use warnings;
use List::Util qw[min max];
use Benchmark;

my $t0 = Benchmark->new;

#You can delete the following line. My testing setup works better without command-line arguments
@ARGV = qw(dani-sampleHairpindata.txt dani-sampleGenedata.txt) if @ARGV == 0;

my $hairpin_filename = $ARGV[0];
my $gene_filename = $ARGV[1];
my @genedata = read_genedata();

@genedata = grep {m/^.+$/} @genedata;#Filter array lines to keep only non-blank lines

#Filter array and keep only lines whose other info
#contains 'gene' or gene id begins with JG and other info is empty
@genedata = grep {my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other) = split /\s+/, $_, 6;
                    defined $g_other and $g_other eq 'gene'
                  or
                    defined $g_id and $g_id =~ m/^JG/
                    and !defined $g_other                
                    } @genedata;

chomp @genedata;

@genedata = sort by_scaffold @genedata;

my %h_hash = …
d5e5 109 Master Poster

A good first step is to find data that will cause the problem so we can modify the program and test until the problem no longer occurs. Either there are some data in your files that the program doesn't handle properly, or else one or both of the files are so large that the program takes an unacceptably long time to load the data into memory. Can you post the data files that take forever to process? If they are too large to attach as txt files to your post, maybe you can upload them to a free file-sharing service such as http://fyels.com/ which gives you a link to the data allowing anyone to download it.

d5e5 109 Master Poster

Consider using Tie::File.

Tie::File represents a regular text file as a Perl array. Each element in the array corresponds to a record in the file. The first line of the file is element 0 of the array; the second line is element 1, and so on.

The file is not loaded into memory, so this will work even for gigantic files.

~from Tie::File documentation.

d5e5 109 Master Poster

I don't know what you mean by "there can actually be many files for each record" but I think you want an array of hash references. See http://perldoc.perl.org/perldsc.html

#!/usr/bin/perl;
use strict;
use warnings;
use Data::Dumper;

my @AoH;#Array of hash references
while(my $line = <DATA>){
    chomp($line);
    my ($id, $filename, $size) = split/\s+/, $line;
    push @AoH, {id => $id,
                filename => $filename,
                size => $size};
}

foreach(@AoH){
    print "id is $$_{id}, filename is $$_{filename}, size is $$_{size}\n";
}

__DATA__
1 logfilename 346202741018308
2 logfilename 0261512802421464
3 logfilename 612262297692848
4 logfilename 3268022049187
5 logfilename 888755246426701
6 logfilename 378923465017564
7 logfilename 314595896654591
8 logfilename 827978858998073
9 logfilename 943690319071184

This gives the following output:

id is 1, filename is logfilename, size is 346202741018308
id is 2, filename is logfilename, size is 0261512802421464
id is 3, filename is logfilename, size is 612262297692848
id is 4, filename is logfilename, size is 3268022049187
id is 5, filename is logfilename, size is 888755246426701
id is 6, filename is logfilename, size is 378923465017564
id is 7, filename is logfilename, size is 314595896654591
id is 8, filename is logfilename, size is 827978858998073
id is 9, filename is logfilename, size is 943690319071184
k_manimuthu commented: Nice example for Array of hashes +5
d5e5 109 Master Poster

I've been running this on the real data since yesterday and it's still running. I'll let you know if it works once I get a chance to examine the output. Thanks again.

I didn't expect it to run so long. Unfortunately my wife and I have to travel for the rest of the week and I won't have a computer. Hopefully we'll be back by Friday. Let me know if the script still needs work and I'll have another look at it on the weekend.

d5e5 109 Master Poster

Does the following work for you?

#!/usr/bin/perl;
#comparepremiRNAtocoding-pulloverlap3.pl;
use strict;
use warnings;
use List::Util qw[min max];
use Data::Dumper;

#You can delete the following line. My testing setup works better without command-line arguments
@ARGV = qw(dani-sampleHairpindata.txt dani-sampleGenedata.txt) if @ARGV == 0;

my $hairpin_filename = $ARGV[0];
my $gene_filename = $ARGV[1];

my @genedata = read_genedata();
chomp @genedata;

my %h_hash = ();
read_hairpindata();

#print Dumper(\%h_hash);#Dump hairpin data structure
#We can access the data for each hairpin id in each scaffold location
#by looping through %h_hash
my $header = 'Hairpin_ID, hairpin other, outside/inside/overlap,'
               . 'gene ID, gene other, distance from hairpin, scaffold,'
               . 'hairpin start, hairpin stop, hairpin sequence';#Heading line
               
print $header, "\n";
        
foreach my $s(keys %h_hash){
    foreach my $h(keys %{$h_hash{$s}}){
        #dereference the genedata array
        my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other, $dist) = @{$h_hash{$s}{$h}{nearest_gene_data}};
        
        #Dereference array reference and join array elements
        my $hairpin_other = join '', @{$h_hash{$s}{$h}{other}};
        my $hairpin_start = $h_hash{$s}{$h}{start};
        my $hairpin_stop = $h_hash{$s}{$h}{stop};
        my $hairpin_seq = $h_hash{$s}{$h}{sequence_read};
        
        if (defined $g_id){
            my $oio = calc_rel_pos($hairpin_start, $hairpin_stop, $g_start, $g_stop);
            print "$h,$hairpin_other,$oio,$g_id,$g_other,$dist,$s,$hairpin_start,$hairpin_stop,$hairpin_seq\n";
        }
        else{
            print "$h,$hairpin_other,,,,,$s,$hairpin_start,$hairpin_stop,$hairpin_seq\n";
        }
    }
}

sub read_genedata{   
    open(my $fh, '<', $gene_filename)||die "open $gene_filename failed: $!";
    
    #Read entire genedata file into an array
    return <$fh>;
}

sub read_hairpindata{
    open(my $fh, '<', $hairpin_filename)||die "open $hairpin_filename failed: $!";
    
    #Read entire hairpindata file into a hash
    while(<$fh>){
        chomp;
        my @cols = split/\t/;#split by tab
        my $scaffold_loc = $cols[1];#make column 2 (index 1) the scaffold location key (i.e. the scaffolds);
        my $hairpin_id = $cols[0];#hairpin_id also needs to be a key so you can look up data …
d5e5 109 Master Poster

seems to me that perhaps the trouble lies with $dist when in fact there is no gene present on the scaffold. therefor maybe we could modify line 29 to read:

$nearest_gene = $g_id if defined $g_id && $dist>0;

and then

else print "Nearest Gene to Hairpin id $h is $nearest_gene\n";

would that work? I tried to put it in, but I seem to be messing up my parenthesis or something with the else statement.

I think you have identified the problem of the undefined distance when no gene found correctly but your proposed solution won't work because $nearest_gene = $g_id if defined $g_id && $dist>0; contains a post-condition if followed by a semi-colon which ends the statement so the following else condition is orphaned (doesn't follow an if block). See http://perldoc.perl.org/perlintro.html and scroll down to

However, there is a clever way of making your one-line conditional blocks more English like:

1. # the traditional way
2. if ($zippy) {
3. print "Yow!";
4. }
5.
6. # the Perlish post-condition way
7. print "Yow!" if $zippy;
8. print "We have no bananas" unless $bananas;

We can try replacing the lines 21 through 34 with the following

#We can access the data for each hairpin id in each scaffold location
#by looping through %h_hash
foreach my $s(keys %h_hash){
    print "Scaffold location $s\n";
    
    foreach my $h(keys %{$h_hash{$s}}){
        #Modified following statement to properly dereference the array
        my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other, $dist) = @{$h_hash{$s}{$h}{nearest_gene_data}}; …
d5e5 109 Master Poster

Sorry, I found my mistake on line 27. See the comment and the modified statement (now line 28) of the following:

#!/usr/bin/perl;
#comparepremiRNAtocoding-pulloverlap3.pl;
use strict;
use warnings;
use List::Util qw[min max];
use Data::Dumper;

#You can delete the following line. My testing setup works better without command-line arguments
@ARGV = qw(dani-sampleHairpindata.txt dani-sampleGenedata.txt) if @ARGV == 0;

my $hairpin_filename = $ARGV[0];
my $gene_filename = $ARGV[1];

my @genedata = read_genedata();
chomp @genedata;

my %h_hash = ();
read_hairpindata();

#print Dumper(\%h_hash);#Dump hairpin data structure
#We can access the data for each hairpin id in each scaffold location
#by looping through %h_hash
foreach my $s(keys %h_hash){
    print "Scaffold location $s\n";
    
    foreach my $h(keys %{$h_hash{$s}}){
        #Modified following statement to properly dereference the array
        my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other, $dist) = @{$h_hash{$s}{$h}{nearest_gene_data}};
        my $nearest_gene = 'Not Found';
        $nearest_gene = $g_id if defined $g_id;
        print "Nearest Gene to Hairpin id $h is $nearest_gene, distance: $dist\n";
    }
    print "\n";
}

sub read_genedata{   
    open(my $fh, '<', $gene_filename)||die "open $gene_filename failed: $!";
    
    #Read entire genedata file into an array
    return <$fh>;
}

sub read_hairpindata{
    open(my $fh, '<', $hairpin_filename)||die "open $hairpin_filename failed: $!";
    
    #Read entire hairpindata file into a hash
    while(<$fh>){
        chomp;
        my @cols = split/\t/;#split by tab
        my $scaffold_loc = $cols[1];#make column 2 (index 1) the scaffold location key (i.e. the scaffolds);
        my $hairpin_id = $cols[0];#hairpin_id also needs to be a key so you can look up data later
        $h_hash{$scaffold_loc}{$hairpin_id}{start} = $cols[2];
        $h_hash{$scaffold_loc}{$hairpin_id}{stop} = $cols[3];
        $h_hash{$scaffold_loc}{$hairpin_id}{other} = [@cols[4..14]];
        $h_hash{$scaffold_loc}{$hairpin_id}{sequence_read} = $cols[15];
        $h_hash{$scaffold_loc}{$hairpin_id}{nearest_gene_data} = nearest_gene($scaffold_loc, $hairpin_id);
    }
}

sub nearest_gene{
    my …
d5e5 109 Master Poster

Since you have added a value to the array I think you need to increase the limit argument for the split from 5 to 6.

Try modifying line 69 from: my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other) = split /\s+/, $_, 5; to: my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other) = split /\s+/, $_, 6;

Am I understanding things correctly that @gene_rec data can be accessed through $h_hash{$s}{$h}{nearest_gene_data} with $h_hash{$s}{$h}{nearest_gene_data}[0] being the $g_id and $h_hash{$s}{$h}{nearest_gene_data}[4] being the $g_other?

Yes, that part is correct as far as I can tell.

d5e5 109 Master Poster

Your script has not declared variables named $chr, $m_start, $m_end, $d_start, and $d_end. You must declare them and assign values to them. See an example of this error and its remedy. If you don't assign values to them you will get a different error message saying values for those variables are undefined so you can't use them for concatenation in the following statement: print "$chr\t$m_start\t$m_end\t$d_start\t$d_end\n";

d5e5 109 Master Poster

I've been programming a web crawler for a while, I'm almost done, it works perfectly but when it crawls vbulletin forums i get weird urls

example:

forum/index.php?phpsessid=oed7fqnm9ikhqq9jvbt23lo8e4
index.php/topic,5583.0.html?phpsessid=93f6a28f192c8cc8b035688cf8b5e06d

obviously this is being causes by php session IDs

what can I do to stop this?
I tried using cookies with HTTP::Cookies but the problem persists.

thanks

If the phpsessid always occurs at the end of the url you could remove it with a regex substitution like this:

#!/usr/bin/perl;
use strict;
use warnings;

my $url = 'index.php/topic,5583.0.html?phpsessid=93f6a28f192c8cc8b035688cf8b5e06d';

$url =~ s/\?phpsessid=\w+$//;

print $url;
d5e5 109 Master Poster

Thanks for the explanation. I gather that some of the records in the genedata file may be used only later if certain conditions are met. I think this means that the function that searches for the nearest gene should calculate the distance from hairpin only for those records where the 'other' value is 'gene'. Is that correct? If so we could modify the script to test only the 'gene' rows. I also included an example of how to loop through and access the data stored in the %h_hash. The script still doesn't determine if the hairpin is inside, outside or overlapping the nearest gene, but knowing how to access the data elements in %h_hash is a step in that direction.

#!/usr/bin/perl;
#comparepremiRNAtocoding-pulloverlap3.pl;
use strict;
use warnings;
use List::Util qw[min max];
use Data::Dumper;

#You can delete the following line. My testing setup works better without command-line arguments
@ARGV = qw(dani-sampleHairpindata.txt dani-sampleGenedata.txt) if @ARGV == 0;

my $hairpin_filename = $ARGV[0];
my $gene_filename = $ARGV[1];

my @genedata = read_genedata();
chomp @genedata;

my %h_hash = ();
read_hairpindata();

#print Dumper(\%h_hash);#Dump hairpin data structure
#We can access the data for each hairpin id in each scaffold location
#by looping through %h_hash
foreach my $s(keys %h_hash){
    print "Scaffold location $s\n";
    
    foreach my $h(keys %{$h_hash{$s}}){
        my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other) = $h_hash{$s}{$h}{nearest_gene_data}[0];
        my $nearest_gene = 'Not Found';
        $nearest_gene = $g_id if defined $g_id;
        print "Nearest Gene to Hairpin id $h is $nearest_gene\n";
    }
    print "\n";
}

sub read_genedata{   
    open(my $fh, '<', $gene_filename)||die "open $gene_filename failed: $!"; …
d5e5 109 Master Poster

I read through the code and there are some bits I'm not following that maybe you could explain? line 17-20 - I don't understand what is happening here. I see you open %h_hash, but I don't understand what is happening next. Is read_hairpindata being loaded into h_hash?

Also, I don't quite follow the work flow as I'm not used to subs calling on other subs. Can you give me a basic run through?

Anyhow, I think you are on the right track. I'll check back in a bit later today.

Thanks again!

One of the differences between the read_genedata function, or subroutine ('function' and 'subroutine' mean the same thing in Perl), and the read_hairpindata function is read_genedata returns a list of all the lines in the genedata file which the calling statement assigns to the @genedata array.

Whereas the read_hairpindata function does not return any data. Instead of returning something it does something. What it does is add data elements (in the form of hash and array references) to the %h_hash which we have to declare outside the read_hairpindata subroutine so we can later access it from outside the scope of that subroutine... so that's what's going on in lines 17 through 20:

  • Line 17 declares the %h_hash wherin we want to save the hairpin and corresponding nearest gene data.
  • line 18 calls the read_hairpindata subroutine in a void context meaning it doesn't assign the result to any variable, because the function doesn't return anything, but instead …
d5e5 109 Master Poster

Try moving the statement that executes $sth2 into the outer loop, something like the following (not tested):

#!/usr/bin/perl
use strict;
use warnings;

use DBI;

my $user = 'root';
my $password = '1004';

# connect to the database
my $dbh = DBI->connect("dbi:mysql:mirvar", $user, $password) || die "Failed connect DB : $!\n";

my $sql;

$sql = "select d.chr, m.Start as m_start, m.End as m_end, d.Start as d_start, d.End as d_end from mirna as m, dbsnp as d limit 10000";
my $sql1 = "select * from mirna limit 10";
my $sql2 = "select * from dbsnp limit 10";


my $sth = $dbh->prepare($sql);
my $sth1 = $dbh->prepare($sql1);
my $sth2 = $dbh->prepare($sql2);

$sth1->execute || die "Error! : $sql\n";

$sth->execute || die "Error! : $sql\n";

while ( my $line1 = $sth1->fetchrow_hashref() ) {
    $sth2->execute || die "Error! : $sql\n";#Moved into loop
    while ( my $line2 = $sth2->fetchrow_hashref() ) {
        if ( $line2->{'Start'} <= $line1->{'Start'} && $line1->{'End'} <= $line2->{'End'} ) {
                print "test";
        }
    }
}
d5e5 109 Master Poster

I had to modify the function called nearest_gene to consider only genes whose scaffold location matched that of the hairpin. I also modified the variable names for the column data for readability. Please replace the corresponding subroutine in the above script with the following:

sub nearest_gene{
    my ($s, $h) = @_; #scaffold location and hairpin_id received as arguments
    
    my $save_min = 999999;#Big number for comparing distances
    my @gene_rec;
    
    foreach (@genedata){
        chomp;
        my @c = split;
        my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other) = split /\s+/, $_, 5;
        #Added the following statement to test only genes with matching scaffold loc
        next unless $c[1] eq $s;
        
        my $min = min( abs($h_hash{$s}{$h}{start} - $g_start),
                    abs($h_hash{$s}{$h}{stop} - $g_start),
                    abs($h_hash{$s}{$h}{start} - $g_stop),
                    abs($h_hash{$s}{$h}{stop} - $g_stop) );
        if ($min < $save_min){
            $save_min = $min;
            @gene_rec = ($g_id, $g_scaffold, $g_start, $g_stop, $g_other);
        }
    }
    return \@gene_rec;
}
d5e5 109 Master Poster

I may need to see more of the code to see where you are going with this, but so far I'm confused as to why the hairpin id becomes a new hash key rather than a gene id as the idea is to compare the distances between a series of gene start and stops on a scaffold to the hairpin on that scaffold and then determine the gene with the minimum distance to the hairpin start or stop. I would have thought that the gene id that is found to be the min would become a key that could then be used to pull the data from that gene id to be printed in the output. But I'm a total newbie, so I will defer to you and wait to see more code. Thanks again for the help.
Btw, I won't be able to check back in until tomorrow evening EST.

The following reads the gene data into an array, then builds the hash of hairpin data, which also includes a reference to the row of data from the genedata file that is nearest to each hairpin. So far it just dumps the contents of the hash linking hairpins to the nearest gene data record. Please let me know if I'm on the right track.

#!/usr/bin/perl;
#comparepremiRNAtocoding-pulloverlap3.pl;
use strict;
use warnings;
use List::Util qw[min max];
use Data::Dumper;

#You can delete the following line. My testing setup works better without command-line arguments
@ARGV = qw(dani-sampleHairpindata.txt dani-sampleGenedata.txt) if @ARGV == …
d5e5 109 Master Poster

I may need to see more of the code to see where you are going with this, but so far I'm confused as to why the hairpin id becomes a new hash key rather than a gene id as the idea is to compare the distances between a series of gene start and stops on a scaffold to the hairpin on that scaffold and then determine the gene with the minimum distance to the hairpin start or stop. I would have thought that the gene id that is found to be the min would become a key that could then be used to pull the data from that gene id to be printed in the output. But I'm a total newbie, so I will defer to you and wait to see more code. Thanks again for the help.
Btw, I won't be able to check back in until tomorrow evening EST.

You're right, the script I posted hasn't done anything with the gene data yet. It seemed strange to me that the hairpin hash had only scaffold as key, as scaffold is not unique. I may still be confused. I'll have another look on the weekend.

d5e5 109 Master Poster
#!/usr/bin/perl;
use strict;
use warnings;

while (<DATA>){
    my @whitespacegroups = m/\w(\s+)\w/g;
    foreach my $whitespacegroup(@whitespacegroups){
        my $count = length $whitespacegroup;
        print "$count spaces\t";
    }
    print "\n";
}
__DATA__
I wake     up in          the                                morning
What       time?
Always     wakeup      at                    6am.

Output is:

1 spaces	5 spaces	1 spaces	10 spaces	32 spaces	
7 spaces	
5 spaces	6 spaces	20 spaces
d5e5 109 Master Poster

As I mentioned before, this is some old code someone helped me with. I believe you can ignore FileHandle for now and thanks for the fyi on the single-letter words when opening files... btw, can you tell me why we have call a file by two things (ie. H, $hairpin)?

I'm not sure I understand the question but the open statement associates a file handle with a file name. The file name consists of the path and name of the file. The file handle represents a reference to some location in memory for internal use by perl. I'm not a computer scientist so that may not be exactly correct.

The following is obviously not complete, but I modified the structure of the hairpin hash so that the hairpin id can serve as a key within the scaffold location hash. The goal is to be able to store and look up the saved data for each hairpin in the scaffold. Let me know if this structure seems better or worse than what you had.

#!/usr/bin/perl;
#comparepremiRNAtocoding-pulloverlap3.pl;
use strict;
use warnings;
use Data::Dumper;

#You can delete the following line. My testing setup works better without command-line arguments
@ARGV = qw(dani-sampleHairpindata.txt dani-sampleGenedata.txt) if @ARGV == 0;

my $hairpin_filename = $ARGV[0];
my $gene_filename = $ARGV[1];

#open a file for reading and associate it with a lexical filehandle
open(my $fh, '<', $hairpin_filename)||die "open $hairpin_filename failed: $!";

my %h_hash=();
while(<$fh>)  # first load h into hash of arrays
{
  chomp;
  my @cols = split/\t/;#split …
d5e5 109 Master Poster

One quick question and then I hope to have some time this afternoon: I see your program loads a module called FileHandle but doesn't seem to do anything with any of that module's properties or methods. I'm not familiar with the FileHandle module. Is it an important part of your solution, or can I ignore it for now?

Avoid using single-letter bare words such as 'H' and 'C' when opening files. It may not have caused you a problem yet but it seems like fingernails on chalkboard to Perl programmers.

# Bareword filehandles such as SARAN are package globals. Use lexical filehandles.
# Prefer the three-argument form of open, especially if the filename is not hardcoded.
# Include the filename in the error message.

~Comment by Sinan Ünür on StackOverflow

The data structure in your hash looks unnecessarily complex but I'll have to experiment with that this afternoon before recommending a change. In principle, with the right data structures you can refer to any hairpin scaffold data by its key. I'll post more this afternoon.

d5e5 109 Master Poster

perhaps, rather than a b-tree, someone could help me put together a method of finding the "gene" with the minimum distance to the "hairpin" by doing something like

my @dist = ( abs($h_hash{$c[1]}[$i][2] - $c[2]), abs($h_hash{$c[1]}[$i][3] - $c[2]), abs($h_hash{$c[1]}[$i][2] - $c[3]), abs($h_hash{$c[1]}[$i][3] - $c[3]) );

@dist = sort {$a <==> $b} @dist;
$min = $dist[0]; # this will find the distance to a single gene and would need to be done again for every array with a matching key (scaffold) and then I would want to find the min of each of these.

I just don't know how to reference back into the hash once the entry with the minimum distance from the hairpin is determined.

Due to my unfamiliarity with the b-tree concept, I would favour your second idea for finding the minimum distance. I don't have any more computer time today but will try to find a way tomorrow.

What I might suggest (but haven't worked out yet) is using the Tie:File module to treat the Genedata file as an array that you can loop through repeatedly. As you find the nearest gene for each hairpin you could save the index of the Genedata array in the hairpin hash, or in another hash that links hairpin with nearest gene. I think that's promising but won't know if it works until I try it tomorrow.

d5e5 109 Master Poster

Please attach your data as *.txt file(s) instead of images or *.pdf files. Instead of having to look at the images and manually enter your data into MySQL we could copy and paste it to try and reproduce the error.

d5e5 109 Master Poster

We need to know what delimiter character you use to separate the columns in your file. When you post the data without wrapping it in [code]

[/code] tags all duplicate spaces or tabs become single spaces so we can't see some of the delimiters. If you put only one single space between the MODEL value and the BOUGHT value then the BOUGHT value appears in the OPTIONS column. Also any value that contains spaces, such as '4 years ago' will split into three separate columns if you use a single space as the delimiting character. Instead, use a delimiter character that will never appear in the values, such as tilda ~. Then your file would look like the following (without the line numbers, of course):

BRAND~VERSION~MODEL~OPTIONS~BOUGHT
toyota~lxi~2007~~4 years ago
nissan~mxi~2008~~3 years ago

Note that the two consecutive delimiters (~~) indicate an empty column.

If you already have proper delimiters that we can't see, post your data wrapped in [code]

[/code] tags or else attach them to your reply as a '.txt' file attachment.

d5e5 109 Master Poster

Great Thanks, D5E. it working fine here

You're welcome. Please don't forget to mark this thread solved.

d5e5 109 Master Poster

That works perfectly!! Thanks a lot for your help!!

You're welcome. Please don't forget to mark this topic 'solved'.