I need some help setting up a perl script which will provide the statistics on some data I have in an excel file. My plan was to copy the excel file into a text file (with tab delineated columns) and run the script off of that. I downloaded a simple statistics package from http://search.cpan.org/~brianl/Statistics-Lite-3.2/Lite.pm and copied the file to my bin folder. However, the system is not recognizing the Statistics package when I include use Statistics::Lite qw(:all); in the script.

Meanwhile I've included some sample data and the script as it stands now. If anyone can help me with this script or knows of a simple way to produce the basic statistical analysis of min, max, mean, mode and standard deviation without an extra package installed that would be great.

The file needs to
1)first be parsed according to scaffold which is the value in column [1]
2) determine if there are more than 1 line item with the same [1], if yes, then...
3) all line items with the same value in [1] (i.e. they are all part of the same group) should have their start site (column [2]) put through the statistical analysis of minimum, maximum, mean, mode, and standard deviation. (To do this, all line items belonging to a single group would need to be sorted by column [2])


I would like the script to generate an output file with the statistical analysis for each group on a separate line.

here is the sample file:
12 scaffold656_7__ 793 D 12 mdv1 mi M3*_MI 0 2 2.82e 3 18 83.33
17 scaffold657_1__ 10860 D 17 ptc mi 482.1_MI 0 1 2.36e 1 31 94.12
12 scaffold657_3__ 226 D 12 mdv1 mi M3*_MI 0 2 2.82e 3 18 83.33
12 scaffold657_3__ 1348 D 12 mdv1 mi M3*_MI 0 2 2.82e 3 18 83.33
12 scaffold657_4__ 259 D 12 mdv1 mi M3*_MI 0 2 2.82e 3 18 83.33
12 scaffold657_5__ 8776 D 12 mdv1 mi M3*_MI 0 2 2.82e 3 18 83.33
12 scaffold657_5__ 14581 D 12 mdv1 mi M3*_MI 0 2 2.82e 3 18 83.33
12 scaffold657_6__ 11361 D 12 mdv1 mi M3*_MI 0 2 2.82e 3 18 83.33
12 scaffold657_6__ 13353 D 12 mdv1 mi M3*_MI 0 2 2.82e 3 18 83.33
12 scaffold657_6__ 20463 D 12 mdv1 mi M3*_MI 0 2 2.82e 3 18 83.33
21 scaffold657_9__ 4998 D 21 ath mi 414_MIMA 0 2 3.42e 2 36 90.48
12 scaffold657_9__ 6733 D 12 mdv1 mi M3*_MI 0 2 2.82e 3 18 83.33
12 scaffold657_9__ 6855 D 12 mdv1 mi M3*_MI 0 2 2.82e 3 18 83.33

#!/usr/bin/perl
use strict;
use warnings;
use Statistics::Lite qw(:all);

my $data = @ARGV;
open $data or die("Cannot open data file\n");
my (@in,@out);
my @data = <$data>
while(<$data>){  
# first load data into hash of arrays
    chomp;
    push @in, [split(split(/\s|\t/))];#Build an array of arrays
}
close $data;
my $prev_aref;
foreach my $aref (@in){#foreach array reference
    if (!defined($prev_aref)
        or $$aref[1] ne $$prev_aref[1]

#        or abs($$aref[4] - $$prev_aref[4]) >= 250){#At least 250 away from prev loc
        push @out, $aref;
        $prev_aref = $aref;
        my $scaffold = $$aref[1];
    }
}

@out = sort my_sort @out;

foreach my $aref (@out){
my @start = $$aref[2]
my $min = min @start;
my $max = max @start;
my $mean = mean @start;
%calc= statshash @start;
my $stddev = $calc{stddev};

print join "\t" $scaffold "\t" $min "\t" $max "\t" $mean "\t" $stddev "\n";
}

#$min= min @data;
#        $mean= mean @data;
#
 #       %data= statshash @data;
  #      print "sum= $data{sum} stddev= $data{stddev}\n";

   #     print statsinfo(@data);
 #   print join "\t", @$aref, "\n";
#}

sub my_sort{
    my $r = $$a[1] cmp $$b[1]; #Compare group
    if ($r == 0) { #If group in same group
        $r = $$a[2] <=> $$b[2]; #Compare start
    }
    return $r;
}
exit 0;

Recommended Answers

All 10 Replies

My platform is linux so I'm not the one to say how to install from CPAN to Windows, but have a look at http://www.daniweb.com/forums/post1369660.html#post1369660 which is by mitchems about installing Statistics::Descriptive on Windows. His advice would probably apply to installing Statistics::Lite as well. Just copying the module into your bin folder wouldn't work if perl doesn't search that folder for modules, or if that module does not consist of pure perl and uses other components which need to be compiled.

I should have mentioned I am running on UBUNTU which is also Linux based not Windows. Did I install the wrong package?

I should have mentioned I am running on UBUNTU which is also Linux based not Windows. Did I install the wrong package?

Good. Do you have the Synaptic Package Manager? It's really easy to use but it only finds some of the modules available on CPAN and so it won't help you install Statistics::Lite. I haven't worked much with statistics but if Perl won't let use Statistics::Lite I would say you haven't installed it. Downloading a file from CPAN and copying it to your bin folder doesn't install it, apparently.

If you have the Synaptic Package Manager and don't mind installing something other than Statistics::Lite you could try the following: Start up Synaptic Package Manager and type libstatistics-basic-perl into the Quicksearch box. Select the package(s) you want from the resulting list, mark them for installation and then Apply. That should install a statistics module you can use in your Perl scripts.

Thanks I downloaded Basic and OrLite this morning with synaptic. From the description, it sounds like it is exactly what I was looking for. I'll have to look them up online to see how to apply them in my scripts. Thanks for the suggestion! How do you think the script looks otherwise? Once I have the statistics functions working properly, will the rest of the script function the way I want it too (above)?

The first problem I see in your script: my $data = @ARGV; will count the number of arguments provided on the command line where you run perl scriptname.pl sample.txt and assign that number to $data. When you assign an array to a scalar variable the result is the size of the array. Instead you want to open the contents of the first command-line argument (assuming it is a valid file name or path).

If you want to specify the name of the input file at runtime from the command line you can do it like this:

#Try to open first command-line argument and assign to a filehandle
#If the open fails, terminate and print $! which contains open error status.
open my $fh, '<', $ARGV[0] or die("Cannot open $ARGV[0]: $!\n");

All the tabs appear to have been removed from your sample data in the process of posting it here. Could you attach your sample data as an attachment to your post, please? For example attach sample.txt ("go advanced" while editing your post here, and look for the "manage attachments" button). Thanks.

thanks for clearing that up

I found Statistics::Basic but couldn't find OrLite.

I had to make some changes to your script to get it to run. Does this run for you? It doesn't give all the statistics you want but maybe it can serve as a first step.

#!/usr/bin/perl
use strict;
use warnings;

use Statistics::Basic qw(:all);

#Try to open first command-line argument and assign to a filehandle
#If the open fails, terminate and print $! which contains open error status.
open my $fh, '<', $ARGV[0] or die("Cannot open $ARGV[0]: $!\n");

my @in;

my %scaffolds;
while(<$fh>){  
# first load data into hash of arrays
    chomp;
    my @columns = split(/\s|\t/);
    if ( !defined $scaffolds{$columns[1]}){
        $scaffolds{$columns[1]} = [$columns[2]];
    }
    else{
        my @arr = @{$scaffolds{$columns[1]}};
        push @arr, $columns[2];
        $scaffolds{$columns[1]} = [@arr];
    }
}
close $fh;

printf("%-20s%20s%20s\n", 'Scaffold', 'Mean', 'StdDev');
foreach my $scaf (sort keys %scaffolds){
    my @start_sites = @{$scaffolds{$scaf}};
    my $count = @start_sites;
    my $mean = mean(@start_sites);
    my $stddev = stddev(@start_sites);
    if ($count > 1){
        printf("%-20s%20s%20s\n", $scaf,$mean,$stddev);
    }
}

This gives the following output:

Scaffold                            Mean              StdDev
scaffold657_3__                      787                 561
scaffold657_5__                 11,678.5             2,902.5
scaffold657_6__                   15,059            3,906.78
scaffold657_9__                 6,195.33              848.11

Statistics::Basic doesn't seem to min and max functions. You can get min and max values by sorting the array of start sites before printing. I modified the script to include min and max.

#!/usr/bin/perl
use strict;
use warnings;

use Statistics::Basic qw(:all);

#Try to open first command-line argument and assign to a filehandle
#If the open fails, terminate and print $! which contains open error status.
open my $fh, '<', $ARGV[0] or die("Cannot open $ARGV[0]: $!\n");

my @in;

my %scaffolds;
while(<$fh>){  
# first load data into hash of arrays
    chomp;
    my @columns = split(/\s|\t/);
    my ($scaf, $fsite) = ($columns[1], $columns[2]);
    $scaffolds{$scaf} = [] unless exists $scaffolds{$scaf};
    push @{$scaffolds{$scaf}}, $fsite;
}
close $fh;

print join "\t", qw(Scaffold Min Max Mean StdDev), "\n";
foreach my $scaf (sort keys %scaffolds){
    my @start_sites = sort {$a <=> $b} @{$scaffolds{$scaf}};
    my $count = @start_sites;
    my $min = $start_sites[0]; #First element is smallest because of sort
    my $max = $start_sites[$#start_sites];
    my $mean = mean(@start_sites);
    my $stddev = stddev(@start_sites);
    if ($count > 1){
        print join "\t", ($scaf, $min, $max, $mean, $stddev), "\n";
    }
}

This gives the following output:

Scaffold	Min	Max	Mean	StdDev	
scaffold657_3__	226	1348	787	561	
scaffold657_5__	8776	14581	11,678.5	2,902.5	
scaffold657_6__	11361	20463	15,059	3,906.78	
scaffold657_9__	4998	6855	6,195.33	848.11

Statistics::Descriptive (search for libstatistics-descriptive-perl on Synaptic PM) calculates all the functions you want but gives different results than Statistics::Basic for Standard Deviation. I don't know why.

#!/usr/bin/perl
use strict;
use warnings;

use Statistics::Descriptive;

#Try to open first command-line argument and assign to a filehandle
#If the open fails, terminate and print $! which contains open error status.
open my $fh, '<', $ARGV[0] or die("Cannot open $ARGV[0]: $!\n");

my @in;

my %scaffolds;
while(<$fh>){  
# first load data into hash of arrays
    chomp;
    my @columns = split(/\s|\t/);
    my ($scaf, $fsite) = ($columns[1], $columns[2]);
    $scaffolds{$scaf} = [] unless exists $scaffolds{$scaf};
    push @{$scaffolds{$scaf}}, $fsite;
}
close $fh;

print join "\t", qw(Scaffold Min Max Mean Mode StdDev), "\n";
foreach my $scaf (sort keys %scaffolds){
    my $stat = Statistics::Descriptive::Full->new();
    my @start_sites = sort {$a <=> $b} @{$scaffolds{$scaf}};

    $stat->add_data(@start_sites);
    my $count = $stat->count();
    my $min = $stat->min();
    my $max = $stat->max();
    my $mean = $stat->mean();
    my $mode = $stat->mode();
    $mode = 'None' if !defined $mode; #No element occurs more than any other
    my $stddev = $stat->standard_deviation();
    if ($count > 1){
        print join "\t", ($scaf, $min, $max, $mean, $mode, $stddev), "\n";
    }
}

Outputs:

Scaffold	Min	Max	Mean	Mode	StdDev	
scaffold657_3__	226	1348	787	None	793.373808491306	
scaffold657_5__	8776	14581	11678.5	None	4104.75486478791	
scaffold657_6__	11361	20463	15059	None	4784.81222202084	
scaffold657_9__	4998	6855	6195.33333333333	None	1038.71378797691
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.