d5e5 109 Master Poster

When we update the code is it possible change line 76

next unless defined $g_other and $g_other eq 'gene'

to also run if $g_other eq ' ' as there are additional queries which should be examined which have nothing in the column $g_other column.

I'm not sure about that. I've been trying to modify the script to save the gene data in a more suitable data structure that would allow faster look up of the genes for each scaffold when determining nearest gene, but so far I'm not getting good results.

Maybe we would need to consider a radical change to the program design, such as reading the gene data into a database table. Then you could access it with SQL queries that could include or filter out rows according to values in the columns, and it might run faster. Do you have SQLite for Perl?

Meanwhile, here's the corrected version of the last script I posted. It skips blank lines in the hairpin file to avoid the error messages you were getting.

#!/usr/bin/perl;
#comparepremiRNAtocoding-pulloverlap3.pl;
use strict;
use warnings;
use List::Util qw[min max];
use Benchmark;

my $t0 = Benchmark->new;

#You can delete the following line. My testing setup works better without command-line arguments
#@ARGV = qw(cab-ALL.LU-premiRNAs.withLOCATION.Vmatch-Nvmr-EST.list bab-All.LuCoding.061511) if @ARGV == 0;
@ARGV = qw(dani-sampleHairpindata.txt dani-sampleGenedata.txt) if @ARGV == 0;
my $hairpin_filename = $ARGV[0];
my $gene_filename = $ARGV[1];
my @genedata;
$| = 1;#Flush print buffer

read_genedata();#Call subroutine to read selected records from file into array

my $t1 …
d5e5 109 Master Poster

Hi,
I was out all day today so I was able to let it run last night and today (so I haven't tried the new code). It finished! There are 168028 lines in the file. I will need to go over the output and double-check to be sure, but I think it worked :)

I did get some error messages though. However, they were more numerous than the terminal will show me. Anyhow, here are the one's which I have:

Use of uninitialized value $s in string eq at Desktop/8-19-11run.pl line 105, <$fh> line 167869.

Use of uninitialized value $scaffold_loc in hash element at Desktop/8-19-11run.pl line 89, <$fh> line 167869.

Use of uninitialized value $hairpin_id in hash element at Desktop/8-19-11run.pl line 89, <$fh> line 167869.

Use of uninitialized value in join or string at Desktop/8-19-11run.pl line 54.

Use of uninitialized value $hairpin_start in concatenation (.) or string at Desktop/8-19-11run.pl line 64.

Use of uninitialized value $hairpin_stop in concatenation (.) or string at Desktop/8-19-11run.pl line 64.

Use of uninitialized value $hairpin_seq in concatenation (.) or string at Desktop/8-19-11run.pl line 64.

That's great that the program finally completed. Apparently looking up the nearest gene for each hairpin is taking a lot of time. I'll see if I can speed up that part of the script.

When I look at line 167869 of the hairpin file I see there is a blank line which probably caused the …

d5e5 109 Master Poster

I downloaded the two files and tested the program. It seemed to take forever so I made a few more changes that hopefully will speed it up a little more. Also I have it print debugging info to the screen while it's running so we have some idea what it's doing.

I see there are over 168,000 lines of data in the hairpin file. If it took one second to find the nearest gene for each of those lines that would total about 46 hours! Fortunately it seems to run faster than that on my laptop, adding about 5 to 10 hairpins to the hash every second so hopefully it should finish running after about 5 to 10 hours. That's just a guess because I can't leave my laptop running that long (it's on dining room table, no desk).

If it's still not fast enough, would you consider loading the data into a relational database, such as SQLite? That would require writing a Perl program to load the data into tables in the database, and then require significant changes to the program(s) that read the database and contain the logic to determine nearest gene, etc. SQLite is easy to install and use with Perl. see http://search.cpan.org/~msergeant/DBD-SQLite-0.31/lib/DBD/SQLite.pm

Meanwhile here's the latest, and hopefully the fastest, version of our Perl script that may take about 5 to 10 hours to produce an output file from those two text files you uploaded to fyels.com

#!/usr/bin/perl; …
d5e5 109 Master Poster

I'll upload the files to fyels when I get my computer with the data up and running this afternoon.

OK. Remember to post the link to the data on fyels or I won't be able to find it.

Meanwhile I tweaked the program to make it more efficient but since it took less than a second to run the test data I can't tell if it's really any faster than before. Here's the modified program. Note that I use the grep command to filter the genedata before searching for the nearest gene. Plus I sort the genedata by scaffold so I can stop looping through the data after looping through the matching scaffold data. In theory that should save a little time.

#!/usr/bin/perl;
#comparepremiRNAtocoding-pulloverlap3.pl;
use strict;
use warnings;
use List::Util qw[min max];
use Benchmark;

my $t0 = Benchmark->new;

#You can delete the following line. My testing setup works better without command-line arguments
@ARGV = qw(dani-sampleHairpindata.txt dani-sampleGenedata.txt) if @ARGV == 0;

my $hairpin_filename = $ARGV[0];
my $gene_filename = $ARGV[1];
my @genedata = read_genedata();

@genedata = grep {m/^.+$/} @genedata;#Filter array lines to keep only non-blank lines

#Filter array and keep only lines whose other info
#contains 'gene' or gene id begins with JG and other info is empty
@genedata = grep {my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other) = split /\s+/, $_, 6;
                    defined $g_other and $g_other eq 'gene'
                  or
                    defined $g_id and $g_id =~ m/^JG/
                    and !defined $g_other                
                    } @genedata;

chomp @genedata;

@genedata = sort by_scaffold @genedata;

my %h_hash = …
d5e5 109 Master Poster

A good first step is to find data that will cause the problem so we can modify the program and test until the problem no longer occurs. Either there are some data in your files that the program doesn't handle properly, or else one or both of the files are so large that the program takes an unacceptably long time to load the data into memory. Can you post the data files that take forever to process? If they are too large to attach as txt files to your post, maybe you can upload them to a free file-sharing service such as http://fyels.com/ which gives you a link to the data allowing anyone to download it.

d5e5 109 Master Poster

I've been running this on the real data since yesterday and it's still running. I'll let you know if it works once I get a chance to examine the output. Thanks again.

I didn't expect it to run so long. Unfortunately my wife and I have to travel for the rest of the week and I won't have a computer. Hopefully we'll be back by Friday. Let me know if the script still needs work and I'll have another look at it on the weekend.

d5e5 109 Master Poster

Does the following work for you?

#!/usr/bin/perl;
#comparepremiRNAtocoding-pulloverlap3.pl;
use strict;
use warnings;
use List::Util qw[min max];
use Data::Dumper;

#You can delete the following line. My testing setup works better without command-line arguments
@ARGV = qw(dani-sampleHairpindata.txt dani-sampleGenedata.txt) if @ARGV == 0;

my $hairpin_filename = $ARGV[0];
my $gene_filename = $ARGV[1];

my @genedata = read_genedata();
chomp @genedata;

my %h_hash = ();
read_hairpindata();

#print Dumper(\%h_hash);#Dump hairpin data structure
#We can access the data for each hairpin id in each scaffold location
#by looping through %h_hash
my $header = 'Hairpin_ID, hairpin other, outside/inside/overlap,'
               . 'gene ID, gene other, distance from hairpin, scaffold,'
               . 'hairpin start, hairpin stop, hairpin sequence';#Heading line
               
print $header, "\n";
        
foreach my $s(keys %h_hash){
    foreach my $h(keys %{$h_hash{$s}}){
        #dereference the genedata array
        my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other, $dist) = @{$h_hash{$s}{$h}{nearest_gene_data}};
        
        #Dereference array reference and join array elements
        my $hairpin_other = join '', @{$h_hash{$s}{$h}{other}};
        my $hairpin_start = $h_hash{$s}{$h}{start};
        my $hairpin_stop = $h_hash{$s}{$h}{stop};
        my $hairpin_seq = $h_hash{$s}{$h}{sequence_read};
        
        if (defined $g_id){
            my $oio = calc_rel_pos($hairpin_start, $hairpin_stop, $g_start, $g_stop);
            print "$h,$hairpin_other,$oio,$g_id,$g_other,$dist,$s,$hairpin_start,$hairpin_stop,$hairpin_seq\n";
        }
        else{
            print "$h,$hairpin_other,,,,,$s,$hairpin_start,$hairpin_stop,$hairpin_seq\n";
        }
    }
}

sub read_genedata{   
    open(my $fh, '<', $gene_filename)||die "open $gene_filename failed: $!";
    
    #Read entire genedata file into an array
    return <$fh>;
}

sub read_hairpindata{
    open(my $fh, '<', $hairpin_filename)||die "open $hairpin_filename failed: $!";
    
    #Read entire hairpindata file into a hash
    while(<$fh>){
        chomp;
        my @cols = split/\t/;#split by tab
        my $scaffold_loc = $cols[1];#make column 2 (index 1) the scaffold location key (i.e. the scaffolds);
        my $hairpin_id = $cols[0];#hairpin_id also needs to be a key so you can look up data …
d5e5 109 Master Poster

seems to me that perhaps the trouble lies with $dist when in fact there is no gene present on the scaffold. therefor maybe we could modify line 29 to read:

$nearest_gene = $g_id if defined $g_id && $dist>0;

and then

else print "Nearest Gene to Hairpin id $h is $nearest_gene\n";

would that work? I tried to put it in, but I seem to be messing up my parenthesis or something with the else statement.

I think you have identified the problem of the undefined distance when no gene found correctly but your proposed solution won't work because $nearest_gene = $g_id if defined $g_id && $dist>0; contains a post-condition if followed by a semi-colon which ends the statement so the following else condition is orphaned (doesn't follow an if block). See http://perldoc.perl.org/perlintro.html and scroll down to

However, there is a clever way of making your one-line conditional blocks more English like:

1. # the traditional way
2. if ($zippy) {
3. print "Yow!";
4. }
5.
6. # the Perlish post-condition way
7. print "Yow!" if $zippy;
8. print "We have no bananas" unless $bananas;

We can try replacing the lines 21 through 34 with the following

#We can access the data for each hairpin id in each scaffold location
#by looping through %h_hash
foreach my $s(keys %h_hash){
    print "Scaffold location $s\n";
    
    foreach my $h(keys %{$h_hash{$s}}){
        #Modified following statement to properly dereference the array
        my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other, $dist) = @{$h_hash{$s}{$h}{nearest_gene_data}}; …
d5e5 109 Master Poster

Sorry, I found my mistake on line 27. See the comment and the modified statement (now line 28) of the following:

#!/usr/bin/perl;
#comparepremiRNAtocoding-pulloverlap3.pl;
use strict;
use warnings;
use List::Util qw[min max];
use Data::Dumper;

#You can delete the following line. My testing setup works better without command-line arguments
@ARGV = qw(dani-sampleHairpindata.txt dani-sampleGenedata.txt) if @ARGV == 0;

my $hairpin_filename = $ARGV[0];
my $gene_filename = $ARGV[1];

my @genedata = read_genedata();
chomp @genedata;

my %h_hash = ();
read_hairpindata();

#print Dumper(\%h_hash);#Dump hairpin data structure
#We can access the data for each hairpin id in each scaffold location
#by looping through %h_hash
foreach my $s(keys %h_hash){
    print "Scaffold location $s\n";
    
    foreach my $h(keys %{$h_hash{$s}}){
        #Modified following statement to properly dereference the array
        my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other, $dist) = @{$h_hash{$s}{$h}{nearest_gene_data}};
        my $nearest_gene = 'Not Found';
        $nearest_gene = $g_id if defined $g_id;
        print "Nearest Gene to Hairpin id $h is $nearest_gene, distance: $dist\n";
    }
    print "\n";
}

sub read_genedata{   
    open(my $fh, '<', $gene_filename)||die "open $gene_filename failed: $!";
    
    #Read entire genedata file into an array
    return <$fh>;
}

sub read_hairpindata{
    open(my $fh, '<', $hairpin_filename)||die "open $hairpin_filename failed: $!";
    
    #Read entire hairpindata file into a hash
    while(<$fh>){
        chomp;
        my @cols = split/\t/;#split by tab
        my $scaffold_loc = $cols[1];#make column 2 (index 1) the scaffold location key (i.e. the scaffolds);
        my $hairpin_id = $cols[0];#hairpin_id also needs to be a key so you can look up data later
        $h_hash{$scaffold_loc}{$hairpin_id}{start} = $cols[2];
        $h_hash{$scaffold_loc}{$hairpin_id}{stop} = $cols[3];
        $h_hash{$scaffold_loc}{$hairpin_id}{other} = [@cols[4..14]];
        $h_hash{$scaffold_loc}{$hairpin_id}{sequence_read} = $cols[15];
        $h_hash{$scaffold_loc}{$hairpin_id}{nearest_gene_data} = nearest_gene($scaffold_loc, $hairpin_id);
    }
}

sub nearest_gene{
    my …
d5e5 109 Master Poster

Since you have added a value to the array I think you need to increase the limit argument for the split from 5 to 6.

Try modifying line 69 from: my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other) = split /\s+/, $_, 5; to: my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other) = split /\s+/, $_, 6;

Am I understanding things correctly that @gene_rec data can be accessed through $h_hash{$s}{$h}{nearest_gene_data} with $h_hash{$s}{$h}{nearest_gene_data}[0] being the $g_id and $h_hash{$s}{$h}{nearest_gene_data}[4] being the $g_other?

Yes, that part is correct as far as I can tell.

d5e5 109 Master Poster

Thanks for the explanation. I gather that some of the records in the genedata file may be used only later if certain conditions are met. I think this means that the function that searches for the nearest gene should calculate the distance from hairpin only for those records where the 'other' value is 'gene'. Is that correct? If so we could modify the script to test only the 'gene' rows. I also included an example of how to loop through and access the data stored in the %h_hash. The script still doesn't determine if the hairpin is inside, outside or overlapping the nearest gene, but knowing how to access the data elements in %h_hash is a step in that direction.

#!/usr/bin/perl;
#comparepremiRNAtocoding-pulloverlap3.pl;
use strict;
use warnings;
use List::Util qw[min max];
use Data::Dumper;

#You can delete the following line. My testing setup works better without command-line arguments
@ARGV = qw(dani-sampleHairpindata.txt dani-sampleGenedata.txt) if @ARGV == 0;

my $hairpin_filename = $ARGV[0];
my $gene_filename = $ARGV[1];

my @genedata = read_genedata();
chomp @genedata;

my %h_hash = ();
read_hairpindata();

#print Dumper(\%h_hash);#Dump hairpin data structure
#We can access the data for each hairpin id in each scaffold location
#by looping through %h_hash
foreach my $s(keys %h_hash){
    print "Scaffold location $s\n";
    
    foreach my $h(keys %{$h_hash{$s}}){
        my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other) = $h_hash{$s}{$h}{nearest_gene_data}[0];
        my $nearest_gene = 'Not Found';
        $nearest_gene = $g_id if defined $g_id;
        print "Nearest Gene to Hairpin id $h is $nearest_gene\n";
    }
    print "\n";
}

sub read_genedata{   
    open(my $fh, '<', $gene_filename)||die "open $gene_filename failed: $!"; …
d5e5 109 Master Poster

I read through the code and there are some bits I'm not following that maybe you could explain? line 17-20 - I don't understand what is happening here. I see you open %h_hash, but I don't understand what is happening next. Is read_hairpindata being loaded into h_hash?

Also, I don't quite follow the work flow as I'm not used to subs calling on other subs. Can you give me a basic run through?

Anyhow, I think you are on the right track. I'll check back in a bit later today.

Thanks again!

One of the differences between the read_genedata function, or subroutine ('function' and 'subroutine' mean the same thing in Perl), and the read_hairpindata function is read_genedata returns a list of all the lines in the genedata file which the calling statement assigns to the @genedata array.

Whereas the read_hairpindata function does not return any data. Instead of returning something it does something. What it does is add data elements (in the form of hash and array references) to the %h_hash which we have to declare outside the read_hairpindata subroutine so we can later access it from outside the scope of that subroutine... so that's what's going on in lines 17 through 20:

  • Line 17 declares the %h_hash wherin we want to save the hairpin and corresponding nearest gene data.
  • line 18 calls the read_hairpindata subroutine in a void context meaning it doesn't assign the result to any variable, because the function doesn't return anything, but instead …
d5e5 109 Master Poster

I had to modify the function called nearest_gene to consider only genes whose scaffold location matched that of the hairpin. I also modified the variable names for the column data for readability. Please replace the corresponding subroutine in the above script with the following:

sub nearest_gene{
    my ($s, $h) = @_; #scaffold location and hairpin_id received as arguments
    
    my $save_min = 999999;#Big number for comparing distances
    my @gene_rec;
    
    foreach (@genedata){
        chomp;
        my @c = split;
        my ($g_id, $g_scaffold, $g_start, $g_stop, $g_other) = split /\s+/, $_, 5;
        #Added the following statement to test only genes with matching scaffold loc
        next unless $c[1] eq $s;
        
        my $min = min( abs($h_hash{$s}{$h}{start} - $g_start),
                    abs($h_hash{$s}{$h}{stop} - $g_start),
                    abs($h_hash{$s}{$h}{start} - $g_stop),
                    abs($h_hash{$s}{$h}{stop} - $g_stop) );
        if ($min < $save_min){
            $save_min = $min;
            @gene_rec = ($g_id, $g_scaffold, $g_start, $g_stop, $g_other);
        }
    }
    return \@gene_rec;
}
d5e5 109 Master Poster

I may need to see more of the code to see where you are going with this, but so far I'm confused as to why the hairpin id becomes a new hash key rather than a gene id as the idea is to compare the distances between a series of gene start and stops on a scaffold to the hairpin on that scaffold and then determine the gene with the minimum distance to the hairpin start or stop. I would have thought that the gene id that is found to be the min would become a key that could then be used to pull the data from that gene id to be printed in the output. But I'm a total newbie, so I will defer to you and wait to see more code. Thanks again for the help.
Btw, I won't be able to check back in until tomorrow evening EST.

The following reads the gene data into an array, then builds the hash of hairpin data, which also includes a reference to the row of data from the genedata file that is nearest to each hairpin. So far it just dumps the contents of the hash linking hairpins to the nearest gene data record. Please let me know if I'm on the right track.

#!/usr/bin/perl;
#comparepremiRNAtocoding-pulloverlap3.pl;
use strict;
use warnings;
use List::Util qw[min max];
use Data::Dumper;

#You can delete the following line. My testing setup works better without command-line arguments
@ARGV = qw(dani-sampleHairpindata.txt dani-sampleGenedata.txt) if @ARGV == …
d5e5 109 Master Poster

I may need to see more of the code to see where you are going with this, but so far I'm confused as to why the hairpin id becomes a new hash key rather than a gene id as the idea is to compare the distances between a series of gene start and stops on a scaffold to the hairpin on that scaffold and then determine the gene with the minimum distance to the hairpin start or stop. I would have thought that the gene id that is found to be the min would become a key that could then be used to pull the data from that gene id to be printed in the output. But I'm a total newbie, so I will defer to you and wait to see more code. Thanks again for the help.
Btw, I won't be able to check back in until tomorrow evening EST.

You're right, the script I posted hasn't done anything with the gene data yet. It seemed strange to me that the hairpin hash had only scaffold as key, as scaffold is not unique. I may still be confused. I'll have another look on the weekend.

d5e5 109 Master Poster

As I mentioned before, this is some old code someone helped me with. I believe you can ignore FileHandle for now and thanks for the fyi on the single-letter words when opening files... btw, can you tell me why we have call a file by two things (ie. H, $hairpin)?

I'm not sure I understand the question but the open statement associates a file handle with a file name. The file name consists of the path and name of the file. The file handle represents a reference to some location in memory for internal use by perl. I'm not a computer scientist so that may not be exactly correct.

The following is obviously not complete, but I modified the structure of the hairpin hash so that the hairpin id can serve as a key within the scaffold location hash. The goal is to be able to store and look up the saved data for each hairpin in the scaffold. Let me know if this structure seems better or worse than what you had.

#!/usr/bin/perl;
#comparepremiRNAtocoding-pulloverlap3.pl;
use strict;
use warnings;
use Data::Dumper;

#You can delete the following line. My testing setup works better without command-line arguments
@ARGV = qw(dani-sampleHairpindata.txt dani-sampleGenedata.txt) if @ARGV == 0;

my $hairpin_filename = $ARGV[0];
my $gene_filename = $ARGV[1];

#open a file for reading and associate it with a lexical filehandle
open(my $fh, '<', $hairpin_filename)||die "open $hairpin_filename failed: $!";

my %h_hash=();
while(<$fh>)  # first load h into hash of arrays
{
  chomp;
  my @cols = split/\t/;#split …
d5e5 109 Master Poster

One quick question and then I hope to have some time this afternoon: I see your program loads a module called FileHandle but doesn't seem to do anything with any of that module's properties or methods. I'm not familiar with the FileHandle module. Is it an important part of your solution, or can I ignore it for now?

Avoid using single-letter bare words such as 'H' and 'C' when opening files. It may not have caused you a problem yet but it seems like fingernails on chalkboard to Perl programmers.

# Bareword filehandles such as SARAN are package globals. Use lexical filehandles.
# Prefer the three-argument form of open, especially if the filename is not hardcoded.
# Include the filename in the error message.

~Comment by Sinan Ünür on StackOverflow

The data structure in your hash looks unnecessarily complex but I'll have to experiment with that this afternoon before recommending a change. In principle, with the right data structures you can refer to any hairpin scaffold data by its key. I'll post more this afternoon.

d5e5 109 Master Poster

perhaps, rather than a b-tree, someone could help me put together a method of finding the "gene" with the minimum distance to the "hairpin" by doing something like

my @dist = ( abs($h_hash{$c[1]}[$i][2] - $c[2]), abs($h_hash{$c[1]}[$i][3] - $c[2]), abs($h_hash{$c[1]}[$i][2] - $c[3]), abs($h_hash{$c[1]}[$i][3] - $c[3]) );

@dist = sort {$a <==> $b} @dist;
$min = $dist[0]; # this will find the distance to a single gene and would need to be done again for every array with a matching key (scaffold) and then I would want to find the min of each of these.

I just don't know how to reference back into the hash once the entry with the minimum distance from the hairpin is determined.

Due to my unfamiliarity with the b-tree concept, I would favour your second idea for finding the minimum distance. I don't have any more computer time today but will try to find a way tomorrow.

What I might suggest (but haven't worked out yet) is using the Tie:File module to treat the Genedata file as an array that you can loop through repeatedly. As you find the nearest gene for each hairpin you could save the index of the Genedata array in the hairpin hash, or in another hash that links hairpin with nearest gene. I think that's promising but won't know if it works until I try it tomorrow.

d5e5 109 Master Poster

That works perfectly!! Thanks a lot for your help!!

You're welcome. Please don't forget to mark this topic 'solved'.

d5e5 109 Master Poster

I should have put more error-checking to test if it successfully opened the files. Here is an improved version. Make sure the files 1.txt and 2.txt (attached to this post) exist in your current working directory so the script can open them.

#!/usr/bin/perl
use strict;
use warnings;
use File::Basename;

my %hash;
my $file1 = '1.txt';
my $file2 = '2.txt';

open my $fh, '<', $file2 or die "Can't open $file2: $!";
while (<$fh>){
    chomp;
    my @flds = split;
    my $fullname = $flds[0];
    my $name = fileparse($fullname,qw(.fit .vcor)); #Remove the file extension
    $hash{$name} = join ' ', @flds[2..$#flds];
}
close $fh;

open $fh, '<', $file1 or die "Can't open $file1: $!";
while (<$fh>){
    chomp;
    my @flds = split;
    my $fullname = $flds[0];
    my $name = fileparse($fullname,qw(.fit .vcor)); #Remove the file extension
    if (exists $hash{$name}){
        print "$name @flds[2..$#flds] $name $hash{$name}\n";
    }
    else{
        print "$name @flds[2..$#flds] in list1 matches nothing in list2\n";
    }
}
d5e5 109 Master Poster
#!/usr/bin/perl
use strict;
use warnings;
use File::Basename;

my %hash;

open my $fh, '<', '2.txt';
while (<$fh>){
    chomp;
    my @flds = split;
    my $fullname = $flds[0];
    my $name = fileparse($fullname,qw(.fit .vcor)); #Remove the file extension
    $hash{$name} = join ' ', @flds[2..$#flds];
}
close $fh;

open $fh, '<', '1.txt';
while (<$fh>){
    chomp;
    my @flds = split;
    my $fullname = $flds[0];
    my $name = fileparse($fullname,qw(.fit .vcor)); #Remove the file extension
    if (exists $hash{$name}){
        print "$name @flds[2..$#flds] $name $hash{$name}\n";
    }
    else{
        print "$name @flds[2..$#flds] in list1 matches nothing in list2\n";
    }
}

Gives the following output:

1aw7_AB 1 2 3 1aw7_AB 4 5 6
1bjw_AB 9 4 7 1bjw_AB 2 7 7
1biq_AB 8 0 1 in list1 matches nothing in list2
d5e5 109 Master Poster

one more thing..I am newbie in perl so trying to learn..does return 1 always denotes true and return 0 always returns false..and is it mandatory to use them in subroutines? Also many perl books says that while can be used for multi line file looping. But in my case it did not work. Then can we say, while loop is not useful always?
thanks

My subroutine returns 1 or 0 because I want to evaluate the subroutine in a boolean context. In the above script if ( already_in_file($input) ){... calls the subroutine and evaluates it in a boolean context in which undef, zero, empty string and empty list evaluate to false and anything else evaluates to true.
Truth and Falsehood...The number 0, the strings '0' and '' , the empty list () , and undef are all false in a boolean context. All other values are true. Negation of a true value by ! or not returns a special false value. When evaluated as a string it is treated as '' , but as a number, it is treated as 0.

It is not mandatory for a subroutine to return 1 or 0. You could have a subroutine return any value you want.

Sure you can use a while loop to loop through a multi-line file, but in the above script you wanted to loop through and sometimes add to the file, so using Tie::File lets you treat the file like an array. I find …

d5e5 109 Master Poster

The Tie::File module lets you treat a file like an array so you can repeatedly loop through it and push new records into it.

#!/usr/bin/perl
#tie_file.pl;
use strict;
use warnings;

use Tie::File;

my $filename = '/home/david/Programming/data/new1.txt';

tie my @array, 'Tie::File', $filename or die "Unable to Tie $filename: $!";

my ($input, $input1);

($input, $input1) = ('David', 'at home');#Try adding myself

if ( already_in_file($input) ){
    print "User name already exists.\n";
}
else{
    push @array, "$input\t$input1";
    print "Added $input to $filename.\n"
}

sub already_in_file{
    my $u2find = shift;
    my $user;
    
    foreach (@array){
        ($user) = split /\t/;
        if ($user =~ m/$u2find/i){
            return 1;#True if user already in file
        }
    }
    return 0;#False if user not found in file
}
d5e5 109 Master Poster

Thanks, it worked.

You're welcome. Please don't forget to mark this thread 'solved'.

d5e5 109 Master Poster

I don't have your kind of database and am not familiar with awk so I don't understand your script. To create an output file whose name is the current date you can do the following:

#!/usr/bin/perl
use strict;
use warnings;

my ($day, $month, $year) = (localtime)[3,4,5];
my $filename = sprintf("%04d-%02d-%02d\n", $year+1900, $month+1, $day);

open my $fh, '>', $filename;

print $fh "This file should have today's date as it's name.";
d5e5 109 Master Poster

Thanks a lot!! It works..I think my mistake was in the command as I missed 'Local' and 'Lines terminated' parts.
I have one more question. Searched Google for it but no help. I think I am searching wrong terms.
I have attached another small tab delimited text file attached here. I want the corresponding rows should come. For eg. If I select Country="England", then I want the output with all the corresponding cities and Particiapnats.
So the output table should look like:

Country  Cities Participants
England     Taunton   225
            Bristol   654
            Plymouth  585
England     London    552
            Exeter    235
            Leeds     445

I tried using LIMIT but no luck (or I am not using it correctly!) (Please note that the numbers showing here is not part of my table!)
Please suugest..
Thanks

Your input table has no country associated with most of the cities, such as Bristol, Plymouth, Exeter and Leeds. When you write a SELECT query you can easily retrieve the rows that contain 'England', but of course that will miss the rows where country is null. I don't know of any way around that except to manually fill in all null values in the country column with the appropriate country. You know that Leeds is in England, but MySQL does not.

d5e5 109 Master Poster

If you have generated the text file on a Windows system, you might have to use LINES TERMINATED BY '\r\n' to read the file properly, because Windows programs typically use two characters as a line terminator. Some programs, such as WordPad, might use \r as a line terminator when writing files. To read such files, use LINES TERMINATED BY '\r'.

see note in docs for LOAD DATA
The following works on my linux computer. (I don't have MySQL on Windows.)

DROP TABLE IF EXISTS `all_text`;
CREATE TABLE `all_text`(
`country` VARCHAR(20),
`some_nbr` VARCHAR(20),
`city` VARCHAR(20),
`gender` VARCHAR(20),
`vote` VARCHAR(20)
) ENGINE = MYISAM;

#Change the file path to get this to run on your computer
LOAD DATA LOCAL INFILE "/home/david/Programming/data/test.txt" INTO TABLE all_text
LINES TERMINATED BY '\r\n';

SELECT * FROM all_text;

The result of the above is

+-----------+----------+------+--------+----------+
| country   | some_nbr | city | gender | vote     |
+-----------+----------+------+--------+----------+
| Australia | Sydney   | 55   | M      | 245254   |
|           | Beijing  | 65   | M      | 22254    |
| Greece    | Athens   |      | F      | 2222AVG5 |
| Thailand  | Bangkok  | 42   | M      | 76577    |
| Malayasia |          | 22   | M      | 7676578  |
| Japan     | Tokyo    | 75   | F      | 987765   |
| Chile     | Santiago | 58   | F      | 5453ASD  |
| Russia    | Moscow   | 75   | F      | 343545   |
| Uganada   |          |      | M      | 676867   |
| Canada    | Montreal | 77   | M      | 4544345  |
+-----------+----------+------+--------+----------+
debasisdas commented: agree +13
d5e5 109 Master Poster

Hi Thanks for your reply. I tried making a test table to populate with my data where some data are missing as well like before. But now the problem is different which you could see from this image. The first Country column VARCHAR(20) is not coming properly! Country names are like Australia, Greece etc..
[IMG]http://i56.tinypic.com/2cnfri9.jpg[/IMG]
But missing data in columns City, Age are loading properly now..
Please note that, I am doing all these in my Windows computer. I saved the data first in .xlsx format and then saved them as tab delimited text.
Command used to load data:
LOAD DATA INFILE "J/MySQL/test.txt" INTO TABLE test;
Thanks..

If you attach your test.txt tab delimited file to your post we could test the import and try to replicate the error.

d5e5 109 Master Poster

Try creating a table having the same column names whose columns are all VARCHAR and import your data into that. That should work as long as your VARCHAR columns are big enough. Then you can write a query that reads the data from this all-VARCHAR table and INSERTs into the table you want to populate. That seems like a roundabout way to do it, but I used to do that a lot years ago when I had problems importing a file. If you have to convert a value from VARCHAR to INT, it's easier to handle that in your insert query that selects from your intermediate table.

d5e5 109 Master Poster

The previous script uses the parse_file() method to automatically open and parse the entire file. If you want to parse a file that is already open (such as STDIN) or you want to parse only some records in a large file then you can use the parse() method instead. (I can't find it in the source Parser.pm file either but that doesn't seem to matter -- it works for me anyway.) The following script loops through a file already open in STDIN, and parses it and prints some output.

#!/usr/bin/perl
use strict;
use warnings;

use HTML::Parser ();

# Create parser object
my $p = HTML::Parser->new( api_version => 3,
                        start_h => [\&start, "tagname, text"],
                        text_h  => [\&text,  "text"],
                        end_h   => [\&end,   "tagname, text"],
                        marked_sections => 1,
                      );

# Loop through STDIN and parse each line.
while (<>){
    $p->parse($_);
}
print "\n";

$p->eof;#Tell Parser object we're finished parsing this file

sub start{
    my ($tagname, $text) = @_;
    print "<!-- $tagname starts here................-->\n";
    print $text;
}

sub text{
    my $text = shift;
    print $text;
}

sub end{
    my ($tagname, $text) = @_;
    print "\n<!-- $tagname ends here................-->\n";
    print $text;
}
d5e5 109 Master Poster

Undefined subroutine &main::start called at getwords.pl line 27.

The error message says it can't find your start function. It doesn't mention a function named 'parse', you need to define a function or subroutine named 'start' in your main package.

For example, if you have an html file in your current working directory name 'VerySimpleFile.html' the following script should run OK:

#!/usr/bin/perl
use strict;
use warnings;

use HTML::Parser ();

# Create parser object
my $p = HTML::Parser->new( api_version => 3,
                        start_h => [\&start, "tagname, text"],
                        text_h  => [\&text,  "text"],
                        end_h   => [\&end,   "tagname, text"],
                        marked_sections => 1,
                      );

# Parse directly from file

$p->parse_file('VerySimpleFile.html');

sub start{
    my ($tagname, $text) = @_;
    print "<!-- $tagname starts here................-->\n";
    print $text;
}

sub text{
    my $text = shift;
    print $text;
}

sub end{
    my ($tagname, $text) = @_;
    print "\n<!-- $tagname ends here................-->\n";
    print $text;
}
d5e5 109 Master Poster

Dear All,

Can anybody tell me how to autoincrement the Hexa decimal value.

For Ex:

I am having the value FF0B in Result array and want increment it to FF0C. Now Result array should contain two values.

(I assume you mean 'increment' because I think 'autoincrement' refers to database table columns.) When you increment a hex string the result is an ordinary decimal number which you have to reformat as a hex string if that's what you want.

#!/usr/bin/perl
use strict;
use warnings;

#Create results array containing one value
my @results = ('FF0B');

#Add 1 to first element and format as hex string for second element
push @results, sprintf('%04X', hex($results[0]) + 1);

#Results array now contains FF0B, FF0C
print 'Results array now contains ', join(', ', @results), "\n";
d5e5 109 Master Poster

You have declared $baz with the my function which limits the scope of $baz to the block in which it is declared thus making it impossible to obtain its value from outside that block. A my declares the listed variables to be local (lexically) to the enclosing block...

In the following script I changed the my to our and deleted the statements not necessary for defining $baz and referring to its value from outside the package:

#!/usr/bin/perl
use strict;
use warnings;

{ package example;
    our $baz = "sometext";
}

print $example::baz;
d5e5 109 Master Poster

<form name="login" action="$cgidir?action=verify_login" method="POST"> The above doesn't look right to me. To get it to work I put it in an html document like the following:

<html>
<header>
<title>Post form example</title>
</header>
<body>
<form name="login" action="/cgi-bin/verify_login" method="POST">
    <input type="text" name="user">
    <input type="password" name="pass">
    <input type="submit" name="login" value="Login">
</form>
</body>
</html>

Then your verify_login can read what the form POSTed by using the CGI module and its param method.

#!/usr/bin/perl
#/cgi-bin/verify_login
use strict;
use warnings;

use CGI qw(:standard);
my $cgi=new CGI; #read in parameters

print $cgi->header(); #print a header
print $cgi->start_html("Welcome"); #generate HTML document start
print "<h1>Welcome, ",$cgi->param('user'),". Your password is ",$cgi->param('pass'),"</h1></p>";
print "But I won't tell anyone.</p>";
print $cgi->end_html();#finish HTML document
roswell1329 commented: Consistently clear and useful posts. +4
d5e5 109 Master Poster

Rather than struggle to compose one regex that does all the substitution I would use two regexes. I find it easier to understand and change if necessary.

#!/usr/bin/perl
use strict;
use warnings;

my @words = qw(iodine Thiodi number dix Iodic iodide);

foreach my $word(@words){
    $word =~ s/(?<!io)di/deu/i;#Replace di with deu unless preceded by io
    $word =~ s/(?=thio)di/deu/i;#Replace di with deu if preceded by thio
    print "$word ";
}
print "\n"; #iodine Thiodi number deux Iodic iodide
d5e5 109 Master Poster

I don't know but you could start by finding your httpd.conf file and having a look at this post in case the solution to that person's problem throws light on yours.

d5e5 109 Master Poster

Mike, your example works fine but I think the loop obscures your point that slurping means reading a file into a variable in one go... so no need for a loop.

#while(<DATA>){
#    $fulltext=$_;
#}  #The above loop works but to slurp all you need is 
$fulltext=<DATA>;
d5e5 109 Master Poster

what is this variable:

$/

"The input record separator, newline by default. This influences Perl's idea of what a 'line' is." from http://perldoc.perl.org/perlvar.html

d5e5 109 Master Poster

its not working ....

If you wanted to retrieve the highest score you would use offset of 0 (not 1). And so to retrieve the 6th highest score you need an offset of 5.

SELECT id FROM members ORDER BY score desc LIMIT 5, 1;

In English I think the above says something like, "skip the first five rows and then retrieve one row."

d5e5 109 Master Poster
select count(*) from (select 1 from t1 UNION ALL select 1 from t2) as both_tables;
d5e5 109 Master Poster
#!/usr/bin/perl
use strict;
use warnings;
use 5.006;

my @your_files = qw(google_20110225091600.7z
google_20110225091622.7z
google_20110225100306.7z
google_20110225100410.7z
google_20110225104833.7z
google_ready_20110225100410.txt
google_ready_20110225104833.txt
Yahoo_20110225091639.7z
Yahoo_20110225100320.7z
Yahoo_20110225100424.7z
Yahoo_20110225104849.7z
Yahoo_ready_20110225100424.txt
Yahoo_ready_20110225104849.txt
Community_20110225091637.7z
Community_20110225100318.7z
Community_20110225100422.7z
Community_20110225104848.7z
Community_ready_20110225100422.txt
Community_ready_20110225104848.txt);

my @keys = ("20110225104849","20110225104833","20110225104848");
my $pattern = join '|', @keys;# Result: 20110225104849|20110225104833|20110225104848
print "You do not want to delete files matching this pattern:\n$pattern\n\n";
print "You DO want to delete the following files:\n";

#For testing I substituted 'print' for 'unlink'
#When you're sure it works correctly replace with
#do { unlink if ( !/$pattern/ ) } for @your_files;
do {print "$_\n" if ( !/$pattern/ )} for @your_files;
d5e5 109 Master Poster

Hi,

Thanks for your help...i got the output. thanks for your coding

You're welcome. Please don't forget to mark this thread 'solved'.

d5e5 109 Master Poster

The following works for me. my $this_url = $cgi->url(); Don't put anything between the parentheses. You put the full path to your script file and Apache doesn't understand that. The $cgi->url() method should retrieve the correct url for the script that you are running, so that when you submit your form it should go to the same script containing that form. You can read about this at OBTAINING THE SCRIPT'S URL

d5e5 109 Master Poster
#!/usr/bin/perl
use strict;
use warnings;

use CGI qw(:all);

#Create a new CGI object
my $cgi = new CGI;
my $this_url = $cgi->url(); #Save this script's url
my $first = $cgi->param('first_name');
my $last = $cgi->param('last_name');

print "Content-type:text/html\n\n";
print <<EndOfHTML;
<html><head><title>Generating Self-Referential URLs</title></head>
<body>
<FORM action="$this_url" method="POST">
First Name: <input type="text" name="first_name">  <br>

Last Name: <input type="text" name="last_name">

<input type="submit" value="Submit">
</FORM>
<p>The name you entered was '$first $last'</p> 
</body>
EndOfHTML
#Put a comment or linefeed after EndOfHTML
#Otherwise perl might not find it.
d5e5 109 Master Poster
do { unlink if ( !/20110225104849|20110225104833|20110225104848/ ) } for @your_files;

I like Muthu's answer and it works, but BE CAREFUL when testing it that you glob the desired directory into your array. You don't want to delete all the files in whatever your current working directory happens to be!

#!/usr/bin/perl
use strict;
use warnings;
use 5.006;

my @your_files = <temp/*>;

#Comment out the following line the first time you test this.
#You want to make sure that @your_files contains files from the [I]desired[/I] directory.
#Otherwise you may delete all files in some other directory!!!
#do { unlink if ( !/20110225104849|20110225104833|20110225104848/ ) } for @your_files;

print "$_\n" foreach @your_files;
d5e5 109 Master Poster

Hi to all,
i created following table in SQL.
===================================

create table employee(name varchar(20), dob date)

===================================

But, when i ran the query, i got the massage that 'SQL can not find data type 'date''.
How should i include 'date' in to table definition.

create table employee(name varchar(20), dob date) should have worked OK in MySQL. What database software were you using when you got that error?

d5e5 109 Master Poster

Thank you so much for your help, it worked. (Although, the files were being saved as "MyFile..png", but that was easily fixed by removing the extra period from the file saving section.)

Once again, you've been amazing in both your speedy replies, and great help.

Thanks,
Liam.

You're welcome. I didn't want to test the full script as it involved saving a lot of files so the suggestions were kind of hit and miss, but I'm glad it works now. Please don't forget to mark this thread solved.

d5e5 109 Master Poster

Please accept my condolences for your loss.

I think the remaining problem consists in the lexical (i.e. declared with my and limited in scope to the current block or subroutine) $ext variable going out of scope at the point where you name and save the file. Try the following: move the my $ext; from the subroutine and put it near the beginning of your program, where it will have the widest scope. You could put it immediately before the #Set specifics comment, for example, like this:

## Bugs ##

# Files that aren't a png, are being saved as one.
my $ext; #Declare variable to store file extension until file is saved.

# Set Specifics
# etc.
#.

Make sure the my $ext; occurs only once in your program. The script should really start with use strict; use warnings; as nearly all Perl scripts should.

d5e5 109 Master Poster
#!/usr/bin/perl
use strict;
use warnings;

use constant {SECONDS_IN_DAY => 86400};
use HTTP::Date qw(time2str str2time parse_date);

my $today_YMD = epoch2YMD(time);
my $today_epoch = str2time($today_YMD);
my $today_minus_3_epoch = $today_epoch - 3 * SECONDS_IN_DAY;
my $today_minus_3_YMD = epoch2YMD($today_minus_3_epoch);
my (@completed, @not_completed);

my $filename = 'omar.txt';
open my $fh, '<', $filename or die "Failed to open $filename: $!";

while (<$fh>){
    chomp;
    my @rec = split(/\|/);
    my $jobdt = epoch2YMD(str2time($rec[0]));
    if ($jobdt lt $today_minus_3_YMD){
        push @not_completed, \@rec;
    }
    else{
        push @completed, \@rec;
    }
}
#use Data::Dumper;
#print Dumper(\@not_completed);
print "Total successful jobs = " . scalar(@completed) . "\n";
print "Total failed jobs = " . scalar(@not_completed) . "\n";
print "Name of Failed jobs are: ";

my @arr;
foreach (@not_completed){
    push @arr, @{$_}[2];
}
print join(', ', @arr);

sub epoch2YMD{
    #Converts epoch datetime number (seconds since beginning of epoch)
    #Drops time portion and returns date-only string as YYYY-MM-DD
    my $epoch = shift;
    die "Usage is epoch2YMD(epoch)" unless ($epoch);
    my $today_YMD = substr(parse_date( localtime($epoch)),0,10);
}

Output today (Tue, Feb 1) is:

Total successful jobs = 11
Total failed jobs = 18
Name of Failed jobs are:  ABC,  DEF,  GHI,  JKL,  LMN,  OPQ,  RST,  UVW,  XYZ,  MVU,  QPO,  CBD,  ASR,  QWY,  CXZ,  FNB,  VXR,  CDS
d5e5 109 Master Poster

...d5e5, so by using the filename from the end of the src URL, that should solve the issue with the incorrect filetypes, yes?

Yes. Assuming that the src URL ends in an extension that correctly identifies the file type, you can capture the file name (or just the extension, .png or .jpg, if you prefer.)

You need to change the following statement from

open my $IMAGE, '>>', "$title.png"
        or die "Cannot create file!";

to

open my $IMAGE, '>>', "$title.$ext"
        or die "Cannot create file!";

after declaring a variable called $ext and capturing the correct file extension into it.

How do you capture the file extension into your $ext variable? See the following snippet from your script, which I have modified slightly. Notice that the regular expression (or regex) now contains two sets of parentheses instead of one. The second set of parentheses captures the end of the src url into $2, which we save to $ext.

# Obtains all individual comic data
sub getComicData {
    my $siteData = get("$sitePrefix$current/");
    my @data = split /\n/, $siteData;
    my $ext;
    foreach (@data) {
        if (/http:\/\/xkcd.com\/(\d+)\//) {
            $current = $1;
        }
        if (/src="(http:\/\/imgs.xkcd.com\/comics\/.+(\.\w{3}))"/) {
            $currentUrl = $1;
            $ext = $2;
            print "$currentUrl has the following extension: $ext", "\n";
            sleep(3);
            if (/alt="(.+?)"/) {
                $title = $1;
                $title =~ s/&#(\d+);/chr($1)/ge;
            $title = "House of Pancakes" if $current == 472;  # Color title on comic 472 with weird syntax
            }
            if (/title="(.+?)"/) {    #title commonly know as 'alt' text
                $alt = $1; …