Hi to all PERL programmers,
I have data like this with 6 columns

LINES   XY1 XY2 XY3 XY4 XY5

P1  Z/Z T/T -/- T/T T/T
P2  A/A A/A G/G Z/Z T/T
1   G/G T/T G/G T/T G/G
2   T/T A/A C/C C/C T/T
3   T/T G/G T/T G/G T/T
4   A/A C/C A/A A/A A/A
5   A/A A/A T/T T/T A/A
  1. First I want to find how many columns (from XY1 to XY5) are different for P1 and P2 ,
    Eq means:
    Both P1 and P2 should contain same same letters (alleles) or if any one of P1 or P2 contains Z/Z or -/- I should consider them as eq only.

            2.  I will compare lines column values from 1 with P2 for all columns (from XY1 to XY5) in horizontal way and continue for remaining lines from 2 to 5. if they match I would like to give 1 else 0 
    
    1. I will make sum for lines 1 to 5 across all the columns from columns XY1 to XY5 but I will include only columns showing different for P1 and P2 in my sum count.
    2. I will calculate percentage of matching lines 1 to 5 with P2 by dividing sum with number of different markers between P1 and P2.
      I am expecting like this

I am expecting like this

LINES   XY1 XY2 XY3 XY4 XY5     

P1 eq nq eq eq eq SUM %
P2 1
1 0 0 1 0 0 0 0
2 0 1 0 0 1 1 100
3 0 0 0 0 1 0 0
4 1 0 0 0 0 0 0
5 1 1 0 0 0 1 100

Like this I have data in more than 5000 rows and at present I am doing in excel 2010 with different formulas but it is taking lot of my energy.
I would like to do this PERL and I am newbie in PERL, I am succeeded in file reading onto screen.
I really need help in solving this in PERL with code. Any help would be appreciated

Hi genetist,
There are several things in the your explination of the what you want that is not clear. However, following the title of your post and with a little understanding of what you wanted I came up with the following that I believe will help you a great deal.

use warnings;
use strict;

my %data;

=pod
 Since I don't understand what you
 want with the P1 and P2 comparism
 I omitted comparing these
 if what you wanted is clear enough
 then we can factor that in later.
=cut

# get the header first
my $header = <DATA>;

# then take the P1 off
# since I don't understand how
# you want to use it
<DATA>;

while (<DATA>) {
    my @val = split /\s+/, $_;
    push @{ $data{ $val[0] } }, @val[ 1 .. $#val ];
}

# get the values for P2;

my @p2 = @{ delete $data{P2} };

=pod
The following display the heading and the 
row comparism of P2 with other rows as
specify by the Original Poster except for SUM and %
I don't know how the OP supposed to
generate his SUM and percentage ( % )
So, until that is known. It is omitted from the following
=cut

print $header;

for my $key ( sort keys %data ) {
    print $key, ' ';
    my @values = @{ $data{$key} };
    for ( 0 .. $#values ) {
        print $values[$_] eq $p2[$_] ? '1 ' : '0 ';
    }
    print $/;
}

__DATA__
LINES   XY1 XY2 XY3 XY4 XY5
P1  Z/Z T/T -/- T/T T/T
P2  A/A A/A G/G Z/Z T/T
1   G/G T/T G/G T/T G/G
2   T/T A/A C/C C/C T/T
3   T/T G/G T/T G/G T/T
4   A/A C/C A/A A/A A/A
5   A/A A/A T/T T/T A/A

produces .....

LINES   XY1 XY2 XY3 XY4 XY5
1 0 0 1 0 0 
2 0 1 0 0 1 
3 0 0 0 0 1 
4 1 0 0 0 0 
5 1 1 0 0 0 

Lastly, the language is Perl not PERL. Perl is not an acroymn though some had been formed for it.

Edited 3 Years Ago by 2teez

This article has been dead for over six months. Start a new discussion instead.