Reading complex data.......ufff

Question

Vindhyaauri 0 Newbie Poster

15 Years Ago

Hello!!

I have to design a program which can read the following file:

k1082
  SciTegic08250908273D

 30 32  0  0  0  0            999 V2000
    0.3230   -0.6380   -0.7700 C   0  0
    1.1810   -1.2460   -1.7340 C   0  0
    2.5950   -1.1640   -1.6010 C   0  0
    3.1670   -0.4730   -0.5020 C   0  0
    2.2620    0.0980    0.4020 C   0  0
    0.8930    0.0350    0.2990 N   0  0
    4.8610   -0.3640   -0.3230 S   0  0
  1  2  2  0
  2  3  1  0
  3  4  2  0
  4  5  1  0
  5  6  2  0
  6  1  1  0
  4  7  1  0
  7  8  1  0
  8  9  2  0
  9  5  1  0
  8 10  1  0
M  END
> <Name>
k1082
> <AbsoluteEnergy>
127.9

> <ConfNumber>
1

$$$$
k1083
  SciTegic08250908273D

 39 42  0  0  0  0            999 V2000
   -1.3610   -1.4670   -0.0270 C   0  0
   -1.4000   -2.8910   -0.0080 C   0  0
   -0.1970   -3.6480    0.0080 C   0  0
    1.0590   -2.9880    0.0060 C   0  0
    1.0270   -1.5900   -0.0130 C   0  0
   -0.1270   -0.8360   -0.0290 N   0  0
    2.5020   -3.8850    0.0280 S   0  0

I have to choose ONLY the following data:

0.3230   -0.6380   -0.7700 C   0  0
    1.1810   -1.2460   -1.7340 C   0  0
    2.5950   -1.1640   -1.6010 C   0  0
    3.1670   -0.4730   -0.5020 C   0  0
    2.2620    0.0980    0.4020 C   0  0
    0.8930    0.0350    0.2990 N   0  0
    4.8610   -0.3640   -0.3230 S   0  0

and ignore all the remaining data ...Although, i figured out grep/^ / can be used to take up the data, but im not able to exclude the foll data:

-1.3610   -1.4670   -0.0270 C   0  0
   -1.4000   -2.8910   -0.0080 C   0  0
   -0.1970   -3.6480    0.0080 C   0  0
    1.0590   -2.9880    0.0060 C   0  0
    1.0270   -1.5900   -0.0130 C   0  0
   -0.1270   -0.8360   -0.0290 N   0  0
    2.5020   -3.8850    0.0280 S   0  0

Could anyone gimme an idea how to go about this?? Thank you in advance... Best Wishes for the New Year '10

perl

4 Contributors
17 Replies
190 Views
3 Weeks Discussion Span
Latest Post 15 Years Ago Latest Post by Vindhyaauri

All 17 Replies

d5e5 109 Master Poster

15 Years Ago

#!/usr/bin/perl -w
use strict;
my $printed = 0; #Boolean switch to indicate good data have/have not been printed
while( <DATA> ) {
    chomp;
    if (m/^\s{3}    #The line should begin with at least three spaces
            (:?     #Start of a non-capturing group (space or hyphen)
             \s     #either another space
             |      #or
             \-     #a hyphen
             )/x)   #end of group, end of pattern. x means allow comments
    {
        print "$_\n";
        $printed = 1;
    }
    elsif ($printed) {#good data already printed
        last; # so don't print any more
    }
}

__DATA__
k1082
  SciTegic08250908273D

 30 32  0  0  0  0            999 V2000
    0.3230   -0.6380   -0.7700 C   0  0
    1.1810   -1.2460   -1.7340 C   0  0
    2.5950   -1.1640   -1.6010 C   0  0
    3.1670   -0.4730   -0.5020 C   0  0
    2.2620    0.0980    0.4020 C   0  0
    0.8930    0.0350    0.2990 N   0  0
    4.8610   -0.3640   -0.3230 S   0  0
  1  2  2  0
  2  3  1  0
  3  4  2  0
  4  5  1  0
  5  6  2  0
  6  1  1  0
  4  7  1  0
  7  8  1  0
  8  9  2  0
  9  5  1  0
  8 10  1  0
M  END
> <Name>
k1082
> <AbsoluteEnergy>
127.9

> <ConfNumber>
1

$$$$
k1083
  SciTegic08250908273D

 39 42  0  0  0  0            999 V2000
   -1.3610   -1.4670   -0.0270 C   0  0
   -1.4000   -2.8910   -0.0080 C   0  0
   -0.1970   -3.6480    0.0080 C   0  0
    1.0590   -2.9880    0.0060 C   0  0
    1.0270   -1.5900   -0.0130 C   0  0
   -0.1270   -0.8360   -0.0290 N   0  0
    2.5020   -3.8850    0.0280 S   0  0

d5e5 109 Master Poster

15 Years Ago

Hi d5e5,
Thanks a ton for all the help!! But, i didnt quite understand the script from if statement. Could u kindly explain this? Thanks again !!!

The if statement tests to see if the first few characters of the current line match a regular expression representing a pattern that should match the data lines you want to print. The /x parameter after the regular expression lets me break the regex into several lines in order to add comments explaining what each part of the regex represents. Since the regex matches not only the good data that you want to print, but also some of the bad data found further on in the file, we assume that all the good data occurs in one unbroken series of lines, and so as soon as we encounter a line that is not good data we want to stop reading any more, so we exit the loop with the last statement.

Notice that the regex represents only the first few characters of the good data line instead of trying do specify a pattern for every character from start to end of line. It does this because it works. You often have to decide how specific you need to be to match the good lines and not match the bad lines. If you use the same regex on a different data file and find that it matches too much you may have to modify the regex to make it more specific. Also, if the good data and the bad data were not grouped together and separated by obviously unmatching lines, as fortunately they are in your example, then the problem of how to make the regex specific enough to exclude bad data would recur.

d5e5 109 Master Poster

15 Years Ago

I don't understand what your program is doing. It seems to be reading and trying to split and do calculations on every line in the file except the one with $$$$.

Actually, i need to get the first part of the info:
Help with Code Tags
Perl Syntax (Toggle Plain Text)
0.3230 -0.6380 -0.7700 C 0 0
1.1810 -1.2460 -1.7340 C 0 0
2.5950 -1.1640 -1.6010 C 0 0
3.1670 -0.4730 -0.5020 C 0 0
This has to be captured in an array to do some calcuations. This first part of the file needs to be read only till '$$$$', which indicates the end of the array. Then, after 1st set of calculations, go to the second set of data :
Help with Code Tags
Perl Syntax (Toggle Plain Text)
-1.3610 -1.4670 -0.0270 C 0 0
-1.4000 -2.8910 -0.0080 C 0 0
-0.1970 -3.6480 0.0080 C 0 0

I would approach this by doing the following:

Read and skip the lines that do not contain useful data
Save the first series of data lines into an array
Do something with the array
Empty the array
Skip some more useless lines
Again, do something with the array

To do this you need to tell your program how to know the difference between a useful data line and one that you want to skip. I would do this by testing each line that you read to see if it matches a regular expression that represents a pattern to which you expect useful data to conform. You create this pattern by looking at a line of good data and describing some of its distinguishing features, such as: "It starts with 3 spaces followed by either another space or a minus sign." That description should be good enough for the data file you have shown us, in which all the lines you want to skip, such as SciTegic08250908273D do not start with at least 3 spaces.
Try doing something like the following:

#!/usr/bin/perl -w
use strict;
my @BunchOfDataLines   = ();
my $matched = 0;

my $f     = "ComplexData.txt";
open( IN, $f ) or die "Can't open $f : $!";
while ( my $row = <IN> ) {
    chomp($row);
    if ($row =~ m/^\s{3}(\s|\-)/)
    {
        $matched = 1;
        push(@BunchOfDataLines, $row);
    }
    elsif ($matched) {
        @BunchOfDataLines = DoSomethingWithBunchOfData(@BunchOfDataLines);
        $matched = 0;
    }
}
close(IN);

sub DoSomethingWithBunchOfData {
    my @array = @_; #Expect one array passed as parameter
    print "\n" . '@array contains ', scalar(@array), " lines\n";
    foreach (@array) {
        print "$_\n";
    }
    @array = ();#Empty the array after using it
    return @array 
}

d5e5 109 Master Poster

15 Years Ago

Without the sdf and pdb files, I can't test your program. I have looked it over briefly for now. Here it is with no major changes. I added some indentation for readability plus a few comments attempting to answer some of your questions.

#!/usr/bin/perl -w
use strict;
my ( $x1, $x2, $y1, $y2, $z1, $z2 );
my @pdbpr    = ();
my @pdblig   = ();
my @pratoms  = ();
my @pdbpro   = ();
my @ligatoms = ();
#my @pratoms  = (); #No need to declare same variable twice
my @A        = ();
my @B        = ();

#sub ligand ();
#sub protein ();
#sub calc ();

open( IN, "A.sdf" ) or die "Could not open file!";
#my @A = grep /^   /, <IN>; #Remove the 'my', already declared above
@A = grep /^   /, <IN>;
close(IN);

sub ligand {
    foreach my $row (@A)
    {    # Need to get array @A which was read from a file. How??
        #Since you declared @A outside the subroutine it is globally available
        #everywhere, including this subroutine. This could work but it could
        #cause confusion. It would be better to pass as a parameter when
        #calling the subroutine, and in the subroutine assign values in @_
        #to a new array declared within your subroutine.
        my @lig = ( split( /\s+/, $row ) );
        print("@lig \n");
        my @arraylig = ( split( /\s+/, $row ) )[ 1, 2, 3 ];
        print("@arraylig\n");
        push @pdblig,   [@arraylig];
        push @ligatoms, [@lig];
    }
    return ( @pdblig, @ligatoms )    # I need this for calculations in sub calc
    #Perl can return only a flat list of scalar values.
    #It won't give you an error, but it will return one list
    #consisting of all the elements in @pdblig and @ligatoms.
}

open( IN, "B.pdb" ) or die "Could not open file!";
#my @B = grep /^ATOM/, <IN>; #Remove the 'my', already declared above
@B = grep /^ATOM/, <IN>;
close(IN);

sub protein {
    foreach my $line (@B)
    {    # Need to get the value of @B which is read from a file. How??
        my @pr = ( split( /\s+/, $line ) );

        # print ("@pr\n");
        my @arraypr = ( split( /\s+/, $line ) )[ 6, 7, 8 ];

        #  print ("@arraypr\n");
        push @pdbpr,   [@arraypr];
        push @pratoms, [@pr];
    }
    return ( @pdbpr, @pratoms )    # I need this for calculations in sub calc
}

sub calc {
    ( @pdblig, @ligatoms ) = &ligand
      ; # Need to call subroutines ligand and protein for calculations... Is this correct??
      #Perl subroutines can return only a flat list of scalar values.
        # &ligand will give you only one list containing all the elements of
        # the two arrays, which is not what you expect when you assign one list
        # to two array in the previous and following statements.
    ( @pdbpr, @pratoms ) = &protein;
    for my $i1 ( 0 .. $#pdblig - 1 ) {
        my ( $x1, $y1, $z1 ) = @{ $pdblig[$i1] };

        #print ("$x1,$y1,$z1\n");
        my $m  = $ligatoms[$i1][4];
        my $a1 = $ligatoms[$i1][14];
        for my $i2 ( 0 .. $#pdbpr - 1 ) {
            my ( $x2, $y2, $z2 ) = @{ $pdbpr[$i2] };

            #print ("$x2,$y2,$z2\n");
            my $n  = $pratoms[$i2][2];
            my $p  = $pratoms[$i2][3];
            my $a2 = $pratoms[$i2][5];
            my $dist =
              sqrt( ( $x2 - $x1 )**2 + ( $y2 - $y1 )**2 + ( $z2 - $z1 )**2 );
            print
"Ligand Atom $m($a1) ($x1,$y1,$z1) to Protein Atom $n(aa - $a2)($x2,$y2,$z2) is $dist\n"
              ;    # This is the result
            if ( $dist < 4.0 && $dist > 2.2 ) {

# print "Ligand Atom $m($a1) ($x1,$y1,$z1) to Protein Atom $n (aa - $a2)($x2,$y2,$z2) is $dist\n";}# This also is the result
                print
                  "Ligand Atom $m($a1) to Protein Atom $n ($p, $a2) is $dist\n";
            }
        }
    }
}

I recommend that you read a little more about variable scope, and avoid declaring variables more than once. Although it is possible to use, in a subroutine, global variables declared in your mainline code this can get confusing. It is better to pass variables or scalar values to your subroutines where they will be available in the @_ array variable, from which you can assign them to new (differently named) variables declared in your subroutine. Giving different names to variables in subroutines avoids confusing them with global variables having the same name. I'm not expert on these issues but I recommend you read this page about Perl subroutines and passing parameters.

d5e5 109 Master Poster

15 Years Ago

sub calc {
(@pdblig, @ligatoms)= &ligand; # Need to call subroutines ligand and protein for calculations... Is this correct??
(@pdbpr,@pratoms) = &protein;

The above way you are trying to get two arrays from a subroutine will not work. According to what I've read, it is recommended that your subroutine return references to the arrays it has created, which the calling statement can dereference in order to use the arrays. I haven't learned how to explain or do this very well, but you can read about it here in this article called "Pass by reference".

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

anunitha · Answer 1 · 2010-01-02T12:12:08+00:00

sir,i did but am not getting,k now i wish to get some script to compare 2 micrarrya data's
1.my computer->program(G-> DTATA(folder)->inside this i have 5 folder more each folder consists of 1000 of data
a.microD(folder 1->consits of many data)
b.microA("2 ")
c.microE(3 )
d.microF(4 ")
e.microG(5 ")
like this i have different folder inside main folder DATA
in this i wish to take 2datas d,e folder inside this i have each 10 datas i need to compare these 2 datas i need to find same starting posistion and ending position in these 2 data ,how i need take in put and print the result in new folder..
(sample-id|chromosome|start-pos|end-pos|num-snp|cnv-length|hmm-state|copy-number|start-snp|end-snp)
in the 2 data also i have same i need to compare this start-pos,and end-postion,
how to take input and print in to new file

Vindhyaauri 0 Newbie Poster · Answer 2 · 2010-01-04T11:49:54+00:00

Hi d5e5,

Thanks a ton for all the help!! But, i didnt quite understand the script from if statement. Could u kindly explain this? Thanks again !!!

Prakash_8111 0 Newbie Poster · Answer 3 · 2010-01-04T21:12:36+00:00

Prakash_8111 0 Newbie Poster

15 Years Ago

Great solution

Vindhyaauri 0 Newbie Poster · Answer 4 · 2010-01-05T15:40:46+00:00

Hello d5e5,

Thanks for the quick response... i have designed a program partially, basically an idea of getting the info from the following data :

k1082
  SciTegic08250908273D

 30 32  0  0  0  0            999 V2000
    0.3230   -0.6380   -0.7700 C   0  0
    1.1810   -1.2460   -1.7340 C   0  0
    2.5950   -1.1640   -1.6010 C   0  0
    3.1670   -0.4730   -0.5020 C   0  0
  1  2  2  0
  2  3  1  0
  3  4  2  0
  4  5  1  0
  
M  END
> <Name>
k1082

> <AbsoluteEnergy>
127.9

> <ConfNumber>
1

$$$$
k1083
  SciTegic08250908273D

 39 42  0  0  0  0            999 V2000
   -1.3610   -1.4670   -0.0270 C   0  0
   -1.4000   -2.8910   -0.0080 C   0  0
   -0.1970   -3.6480    0.0080 C   0  0
  1  2  2  0
  2  3  1  0
  3  4  2  0
  
M  END
> <Name>
k1083

> <AbsoluteEnergy>
148.864

> <ConfNumber>
1

$$$$

Actually, i need to get the first part of the info:

0.3230   -0.6380   -0.7700 C   0  0
    1.1810   -1.2460   -1.7340 C   0  0
    2.5950   -1.1640   -1.6010 C   0  0
    3.1670   -0.4730   -0.5020 C   0  0

This has to be captured in an array to do some calcuations. This first part of the file needs to be read only till '$$$$', which indicates the end of the array. Then, after 1st set of calculations, go to the second set of data :

-1.3610   -1.4670   -0.0270 C   0  0
   -1.4000   -2.8910   -0.0080 C   0  0
   -0.1970   -3.6480    0.0080 C   0  0

Again get into the same loop to do calc.

I have written this basic script .

use strict;
my @lig= ();
my $count =0;
my $z= 0;
open (IN, “HANDLE”) or die “Could not open!!”
while(my $row =<IN>) {
do{
my @lig = (split (/\s+/, $row));
#  next set of lines to do some calculations
my $z=$count++;
print "@lig\n";
}    unless ( my $row eq '\$\$\$\$');
}
close (IN);

But, the problem is that the program picks both the sets of data till the end... Using counting lines , is it possible to make an array so that the data are read in array 1set after the other? Im totally stuck.. cOuld u plz giv an idea how to go about?

Vindhyaauri 0 Newbie Poster · Answer 5 · 2010-01-07T16:57:33+00:00

Hi d5e5, thank you.... its superb... I have the following script with subroutines... I am not too sure how to call subroutines... COuld you please take a look at it once... I have commented where i have problems

use strict;
my ( $x1,$x2,$y1,$y2,$z1,$z2);
my @pdbpr=();
my @pdblig=();
my @pratoms=();
my @pdbpro=();
my @ligatoms=();
my @pratoms=();
my @A=();
my @B=();
#sub ligand ();
#sub protein ();
#sub calc ();
 
open(IN, "A.sdf") or die "Could not open file!";
my @A = grep /^   /,<IN>;
close (IN);
 
sub ligand {
foreach my $row(@A){ # Need to get array @A which was read from a file. How??
my @lig = (split (/\s+/, $row));
print ("@lig \n");
my @arraylig = (split (/\s+/, $row))[1, 2, 3];
print ("@arraylig\n");
push @pdblig,[@arraylig];
push @ligatoms,[@lig];
}
return (@pdblig, @ligatoms) # I need this for calculations in sub calc
}

open(IN, "B.pdb") or die "Could not open file!";
my @B = grep /^ATOM/, <IN>;
close (IN);
 
sub protein {
foreach my $line (@B) { # Need to get the value of @B which is read from a file. How??
my @pr = (split (/\s+/, $line));
# print ("@pr\n"); 
my @arraypr = (split (/\s+/, $line))[6, 7, 8];
#  print ("@arraypr\n");
push @pdbpr,[@arraypr] ;
push @pratoms, [@pr];
 }
return (@pdbpr,@pratoms) # I need this for calculations in sub calc
}


sub calc {
(@pdblig, @ligatoms)= &ligand; # Need to call subroutines ligand and protein for calculations... Is this correct??
(@pdbpr,@pratoms) = &protein;
for my $i1 ( 0 .. $#pdblig-1 ){
my ( $x1, $y1, $z1 ) = @{ $pdblig[$i1] };
#print ("$x1,$y1,$z1\n");
my $m = $ligatoms[$i1][4];
my $a1= $ligatoms[$i1][14];
    for my $i2 ( 0 .. $#pdbpr-1 ){
    my ( $x2, $y2, $z2 ) = @{ $pdbpr[$i2] };
#print ("$x2,$y2,$z2\n");    
   my  $n = $pratoms[$i2][2];
  my $p =$pratoms[$i2][3];
 my $a2= $pratoms[$i2][5];
    my $dist = sqrt( ($x2-$x1)**2 + ($y2-$y1)**2 + ($z2-$z1)**2 );
 print "Ligand Atom $m($a1) ($x1,$y1,$z1) to Protein Atom $n(aa - $a2)($x2,$y2,$z2) is $dist\n";# This is the result 
  if (  $dist < 4.0 && $dist > 2.2) {
 # print "Ligand Atom $m($a1) ($x1,$y1,$z1) to Protein Atom $n (aa - $a2)($x2,$y2,$z2) is $dist\n";}# This also is the result
  print "Ligand Atom $m($a1) to Protein Atom $n ($p, $a2) is $dist\n";
}
}
}
}

Could you plz help??? Thank you!!!

Vindhyaauri 0 Newbie Poster · Answer 6 · 2010-01-11T16:47:58+00:00

Thank u... I tried to do the following pgm after reading a little about subroutines. Please take a look at this:

use strict;
 
my ($x1,$x2,$y1,$y2,$z1,$z2);
my @pdbpr=();
my @pdblig=();
my @pratoms=();
my @pdbpro=();
my @ligatoms=();
my @pratoms=();
my @array =();
my @A=();
my @B=();
sub ligand;
sub protein;
sub calc;
my @BunchOfDataLines =();
my $matched = 0;
 
my $f     = "lig_file";
open( IN, $f ) or die "Can't open $f : $!";
while ( my $row = <IN> ) {
    chomp($row);
    if ($row =~ m/^\s{2}(\s|\-)/)
    {
        $matched = 1;
        push(@BunchOfDataLines, $row);
    }    
elsif ($matched) {
       my  @BunchofDataLines = DoSomethingWithBunchOfData(@BunchOfDataLines);
        $matched = 0;
    }
}
close(IN);

sub protein {
open(IN, "pro_file") or die "Could not open file!";
my @B = grep /^ATOM/, <IN>;
close (IN);
 
foreach my $ln(@B) {
my @pr = (split (/\s+/, $ln));
# print ("@pr\n"); 
my @arraypr = (split (/\s+/, $ln))[6, 7, 8];
 # print ("@arraypr\n");
push @pdbpr,[@arraypr] ;
push @pratoms, [@pr];
 }
return (\@pdbpr,\@pratoms);
}
  
sub DoSomethingWithBunchOfData {
    my @array = @_; #Expect one array passed as parameter
#    print "\n" . '@array contains ', scalar(@array), " lines\n";
#print "@array\n";
   foreach (@array){
ligand (@array);
#print "pdblig : @pdblig\n";
 
protein();
#print "pdbpr : @pdbpr\n";
calc (\@pdblig,\@ligatoms,\@pdbpr,\@pratoms);
}
@array =();
return @array;
}

sub ligand {
my @A = @_;
 foreach  my $rw(@A) {
my @lig = (split (/\s+/, $rw));
#print ("@lig \n");
my @arraylig = (split (/\s+/, $rw))[1, 2, 3];
#print ("@arraylig\n");
push @pdblig,[@arraylig];
push @ligatoms,[@lig];
}
return (\@pdblig, \@ligatoms);
}
 
 
sub calc {
for my $i1 ( 0 .. $#pdblig-1 ){
my ( $x1, $y1, $z1 ) = @{ $pdblig[$i1] };
print ("$x1,$y1,$z1\n");
my $m = $ligatoms[$i1][4];
my $a1= $ligatoms[$i1][14];
    for my $i2 ( 0 .. $#pdbpr-1 ){
    my ( $x2, $y2, $z2 ) = @{ $pdbpr[$i2] };
#print ("$x2,$y2,$z2\n");    
   my  $n = $pratoms[$i2][2];
  my $p =$pratoms[$i2][3];
 my $a2= $pratoms[$i2][5];
    my $dist = sqrt( ($x2-$x1)**2 + ($y2-$y1)**2 + ($z2-$z1)**2 );
 $dist= sprintf ("%.2f", $dist);
#  print "Ligand Atom $m($a1) ($x1,$y1,$z1) to Protein Atom $n(aa - $a2)($x2,$y2,$z2) is $dist\n";
  if (  $dist < 5.0 && $dist > 2.2) {
#  print "Ligand Atom $m($a1) ($x1,$y1,$z1) to Protein Atom $n (aa - $a2)($x2,$y2,$z2) is $dist\n";
  print "Ligand Atom $m($a1) to Protein Atom $n ($p, $a2) is $dist\n";
}
next;
} }
}

I have attached the files for lig_file.

Pro_file data is as follows:

ATOM      1  N   GLN A  42      24.630 -50.759  11.566  1.00102.15           N1+
ATOM      2  CA  GLN A  42      25.682 -50.321  12.534  1.00102.15           C
ATOM      3  C   GLN A  42      27.054 -50.296  11.844  1.00102.15           C
ATOM      4  O   GLN A  42      28.082 -50.027  12.477  1.00102.15           O
ATOM      5  CB  GLN A  42      25.713 -51.262  13.757  1.00 90.91           C
ATOM      6  CG  GLN A  42      26.623 -50.797  14.913  1.00 90.91           C
ATOM      7  CD  GLN A  42      26.514 -51.669  16.170  1.00 90.91           C
ATOM      8  OE1 GLN A  42      27.471 -51.795  16.941  1.00 90.91           O
ATOM      9  NE2 GLN A  42      25.346 -52.269  16.380  1.00 90.91           N
ATOM     10  N   PHE A  43      27.068 -50.574  10.543  1.00156.93           N
ATOM     11  CA  PHE A  43      28.307 -50.495   9.777  1.00156.93           C
ATOM     12  C   PHE A  43      28.193 -49.563   8.572  1.00156.93           C
ATOM     13  O   PHE A  43      27.179 -49.560   7.871  1.00156.93           O
ATOM     14  CB  PHE A  43      28.736 -51.887   9.306  1.00122.77           C
ATOM     15  CG  PHE A  43      30.141 -51.930   8.788  1.00122.77           C
ATOM     16  CD1 PHE A  43      31.208 -52.106   9.659  1.00122.77           C
ATOM     17  CD2 PHE A  43      30.402 -51.747   7.438  1.00122.77           C
ATOM     18  CE1 PHE A  43      32.510 -52.095   9.195  1.00122.77           C
ATOM     19  CE2 PHE A  43      31.702 -51.735   6.965  1.00122.77           C
ATOM     20  CZ  PHE A  43      32.758 -51.908   7.846  1.00122.77           C
ATOM     21  N   PRO A  44      29.243 -48.762   8.319  1.00111.66           N
ATOM     22  CA  PRO A  44      29.285 -47.843   7.176  1.00111.66           C
ATOM     23  C   PRO A  44      29.251 -48.562   5.823  1.00111.66           C
ATOM     24  O   PRO A  44      30.268 -48.682   5.138  1.00111.66           O
ATOM     25  CB  PRO A  44      30.582 -47.056   7.390  1.00152.22           C
ATOM     26  CG  PRO A  44      31.422 -47.926   8.265  1.00152.22           C
ATOM     27  CD  PRO A  44      30.463 -48.678   9.141  1.00152.22           C
ATOM     28  N   GLN A  45      28.066 -49.030   5.445  1.00147.89           N
ATOM     29  CA  GLN A  45      27.864 -49.677   4.153  1.00147.89           C
ATOM     30  C   GLN A  45      28.136 -48.728   2.988  1.00147.89           C

The calculation is as follows:
first read one set of ligand array
go to Sub DoSomethingWithBunchofDataLines and then to Sub calc. The values of ligand x,y,z and Protein x,y,z are calculated seperately in different subs. Now, once in Sub calc, it calculates the distances and also prints it... But the problem is it is not doing it for all the array ligands, only for the first one and that too enters an infinite loop... I dont understand what can be done...Pls help!!

d5e5 109 Master Poster · Answer 7 · 2010-01-11T22:37:14+00:00

OK, I have your data files and will test later today or tomorrow when I have time. So far I see one problem in your program. One of your subroutines is returning array references (that's good!) BUT you call this subroutine in a void context, meaning you call it without assigning the returned list to any variables.

sub ligand {
    my @A = @_;
    foreach my $rw (@A) {
        my @lig = ( split( /\s+/, $rw ) );

        #print ("@lig \n");
        my @arraylig = ( split( /\s+/, $rw ) )[ 1, 2, 3 ];

        #print ("@arraylig\n");
        push @pdblig,   [@arraylig];
        push @ligatoms, [@lig];
    }
    return ( \@pdblig, \@ligatoms ); #ligand(@array) is returning array references, good.
}

BUT the statement that calls the ligand sub doesn't save or use the list of array references that it returns.

foreach (@array) {
        ##ligand(@array) is returning array references
        ##The following should save or use these references
        ligand(@array);#This statement calls the sub, but doesn't save the refs
        ##Better to have my ($aref1, $aref2) = ligand(@array);
        ##Then dereference the array refs to use arrays.

Here is an example of a program that calls a subroutine that returns a list of array refs. The calling program then saves the array refs, and dereferences them to use the arrays to which they refer.

#!/usr/bin/perl -w
use strict;
### The following line calls the subroutine AND saves the references.
my @arrayofrefs = colorsandtastes(); #Call sub which returns list of array refs.
my $colorsref = $arrayofrefs[0]; #Save first array ref from returned list.
my $tastesref = $arrayofrefs[1]; #Save second array ref from returned list.

#Next, dereference the array refs to get arrays
my @colors = @{$colorsref};
my @tastes = @{$tastesref};

print "The colors array contains:\n";
print @colors, "\n";
print "The tastes array contains:\n";
print @tastes, "\n";

sub colorsandtastes {
    #This subroutine creates two arrays and returns a list of array references
    my @arr1 = q/red yellow blue green violet black white/;
    my @arr2 = q/salty sweet sour bitter/;
    return (\@arr1, \@arr2); #Prefixing array variable with '\' creates array reference
}

I'll test your program and get back to you soon.

d5e5 109 Master Poster · Answer 8 · 2010-01-12T03:33:34+00:00

I started testing your program but so far there are more problems than I can fix today. One problem is that the protein subroutine is reading all 30 lines of your Pro_file data over and over again because it is called from a loop in the DoSomethingWithBunchOfData subroutine. This is probably not the problem that causes your program to go in an infinite loop. I haven't found the reason for that because I'm finding other problems like the protein sub.

sub DoSomethingWithBunchOfData {
    my @array = @_;    #Expect one array passed as parameter
    foreach (@array) { #This loop repeats for every member of @array
.....
.....
        protein(); #Calling this subroutine repeatedly for every member of @array

        #print "pdbpr : @pdbpr\n";
        calc( \@pdblig, \@ligatoms, \@pdbpr, \@pratoms );
    }
    @array = ();
    return @array;
}

sub protein {
    open( IN, "pro_file" ) or die "Could not open file!";
    my @B = grep /^ATOM/, <IN>;#Reads all 30 lines of data into @B
    close(IN);

    foreach my $ln (@B) {
        my @pr = ( split( /\s+/, $ln ) );

        # print ("@pr\n");
        my @arraypr = ( split( /\s+/, $ln ) )[ 6, 7, 8 ];

        # print ("@arraypr\n");
        push @pdbpr,   [@arraypr];
        push @pratoms, [@pr];
    }
    return ( \@pdbpr, \@pratoms );
}

There are more problems that I don't have time to list now, such as that the protein sub is returning array references that are not being saved or used by the statement that calls the sub. The program still needs more work.

d5e5 109 Master Poster · Answer 9 · 2010-01-13T07:55:33+00:00

The following program attempts to calculate for each line of ligand data the distance to each of the protein atoms. It prints 2735 lines to the screen, so you may have to redirect the output to a file if you want to examine it all. Please try running it and let me know if it does at least some of what you want to do.

#!/usr/bin/perl -w
use strict;
my @BunchOfDataLines = ();
my $matched          = 0;
my $count;
#Read entire protein file into array
my $f = "pro_file";
open( IN, $f ) or die "Can't open $f : $!";
my @proteins = grep /^ATOM/, <IN>;#Reads all 30 lines of data into @B
close(IN);
    
#Read and process ligand file
$f = "lig_file";
open( IN, $f ) or die "Can't open $f : $!";
while ( my $row = <IN> ) {
    chomp($row);
    if ( $row =~ m/^\s{2}(\s|\-)\d\d\.\d\d\d\d/ ) {
        $matched = 1;
        push( @BunchOfDataLines, $row );
    }
    elsif ($matched) {
        my @BunchofDataLines = DoSomethingWithBunchOfData(@BunchOfDataLines);
        $matched = 0;
    }
}
close(IN);

sub DoSomethingWithBunchOfData {
    my @ligands = @_;    #Expect one array passed as parameter
    $count++;
    print "\nProcessing bunch number $count of data lines...\n";
    foreach my $ligand (@ligands) { #Each member of @ligands is a line of ligand data.
        my @ligfields = split (/\s+/, $ligand);
        my $ligid = "$ligfields[4]($ligfields[14])";
        my ($ligx, $ligy, $ligz) = @ligfields[1, 2, 3];
        foreach my $protein (@proteins) {
            my @profields = split (/\s+/, $protein);
            my $proid = "$profields[2]($profields[3], $profields[5])";
            my ($prox, $proy, $proz) = @profields[6, 7, 8];
            my $distance = CalculateDistance($ligx, $ligy, $ligz, $prox, $proy, $proz);
            print "Ligand atom $ligid to protein atom $proid is $distance\n";           
        }
    }
    return (); #Return empty list to make @BunchOfDataLines empty
}

sub CalculateDistance {
    my ($ligandx, $ligandy, $ligandz, $proteinx, $proteiny, $proteinz) = @_;
    my $dist = sqrt(
                    ( $proteinx - $ligandx )**2
                    + ( $proteiny - $ligandy )**2
                    + ( $proteinz - $ligandz )**2
                    );
            $dist = sprintf( "%.2f", $dist );
            return $dist;
}

Vindhyaauri 0 Newbie Poster · Answer 10 · 2010-01-13T16:54:38+00:00

Yes ... It works superbly... Thanks d5e5... I really have learnt a lot of perl from u ..... Thanks again for all the help!!

There is one small glitch... While outputting the result into the file, it is getting appended and lig 1, 2, 3, 4 data are not seperated properly...

use strict;
my @DataBunch = ();
my $matched   = 0;
my $count;
#Read entire protein file into array
my $f = "/home/pro_file";
open( IN, $f ) or die "Can't open $f";
my @proteins = grep /^ATOM/, <IN>;
#print "@proteins\n";
close(IN);
 
#Read and process ligand file
$f = "/home/lig_file";
open( IN, $f ) or die "Can't open $f";
while ( my $row = <IN> ) {
    chomp($row);
     if ($row =~ m/^\s{2}(\s|\-)/){
        $matched = 1;
        push( @DataBunch, $row );
}
    elsif ($matched) {
     my @DataBunch = LigArray(@DataBunch);
     $matched = 0;
    }
}
close(IN);
 
 
sub LigArray {
   my @ligands = @_;    #Expect one array passed as parameter
  $count++;
open(OUT,">>/home/result/file") || die "Cannot open file\n";
    print OUT "\nProcessing ligand no $count of ligand array...\n";
    foreach my $ligand (@ligands) {
        my @ligfields = split (/\s+/, $ligand);
        my $ligid = "$ligfields[4]($ligfields[14])";
        my ($ligx, $ligy, $ligz) = @ligfields[1, 2, 3];
        foreach my $protein (@proteins) {
            my @profields = split (/\s+/, $protein);
            my $proid = "$profields[2]($profields[3], $profields[5])";
            my ($prox, $proy, $proz) = @profields[6, 7, 8];
            my $distance = CalculateDistance($ligx, $ligy, $ligz, $prox, $proy, $proz);
print OUT "Ligand atom $ligid ($ligx, $ligy, $ligz) to protein atom $proid ($prox, $proy, $proz)is $distance\n";
          }
    }
close (OUT);
    return ();
#close (OUT);
}
 
sub CalculateDistance {
    my ($ligandx, $ligandy, $ligandz, $proteinx, $proteiny, $proteinz) = @_;
    my $dist = sqrt(
                    ( $proteinx - $ligandx )**2
                    + ( $proteiny - $ligandy )**2
                    + ( $proteinz - $ligandz )**2
                    );
            $dist = sprintf( "%.2f", $dist );
            return $dist;
}

Ligand 2 output is appending even Ligand 1 output with it which is not required...how would i go about it? the same lig_file and pro_file data can be used..

d5e5 109 Master Poster · Answer 11 · 2010-01-13T22:43:27+00:00

I think the problem is the match that seemed to work OK with our previous tests is not specific enough for the current lig_file, which has a couple of blank lines after the second bunch of ligand data. The following code: if ($row =~ m/^\s{2}(\s|\-)/){ says that a line has good ligand data if it begins with two spaces followed by either another space or a minus sign. This matches good data BUT it also matches lines that have nothing but two or more spaces and no data. Since we don't process the contents of @DataBunch until we read a line that DOESN'T match, we come to the end of the file and quit the program without having processed the second bunch of ligand data that is saved in @DataBunch . To try and fix this we can do two things:

Improve the pattern in our match condition so it matches good data but does not match lines containing nothing but spaces.
After reaching the end of our file, check if there remains any unprocessed data in @DataBunch, and if so, process it.

I would modify the match condition as follows to make it more specific for matching a data line that begins with spaces followed by an optional minus sign, followed by a number consisting of digits and a decimal point. That describes the first number in the good data lines and should be specific enough to distinguish from other lines. So in your program try replacing if ($row =~ m/^\s{2}(\s|\-)/){ with if ( $row =~ m/^\s{2}(\s|\-)[B][\.\d]+[/B]/) { Then, after the ligand file has been read and closed I would test the size of @DataBunch and if there are any unprocessed data, process them. After the changes, the main logic of the program, would like this:

#!/usr/bin/perl -w
use strict;
my @DataBunch = ();
my $matched   = 0;
my $count;
#Read entire protein file into array
#my $f = "/home/pro_file";
my $f = "pro_file";
open( IN, $f ) or die "Can't open $f";
my @proteins = grep /^ATOM/, <IN>;
#print "@proteins\n";
close(IN);
 
#Read and process ligand file
#$f = "/home/lig_file";
$f = "lig_file";
open( IN, $f ) or die "Can't open $f";
while ( my $row = <IN> ) {
    chomp($row);
#     if ($row =~ m/^\s{2}(\s|\-)/){
    if ( $row =~ m/^\s{2}(\s|\-)[\.\d]+/) {
        $matched = 1;
        push( @DataBunch, $row );
}
    elsif ($matched) {
#     my @DataBunch = LigArray(@DataBunch); @DataBunch has already been declared
        @DataBunch = LigArray(@DataBunch);
        $matched = 0;
    }
}
close(IN);

#Test if there remains any unprocessed data in @DataBunch
if ($#DataBunch > 0) {
    @DataBunch = LigArray(@DataBunch);
}
#....

(The above doesn't include the code for the subroutines, which I didn't change.)

Vindhyaauri 0 Newbie Poster · Answer 12 · 2010-01-25T12:35:28+00:00

Hi d5e5,

Thanka a ton... it worked finally!! Thanks again...

Reading complex data.......ufff

Recommended Answers Collapse Answers

All 17 Replies

Recommended Answers