First of all, hello!
I've been reading the forums for quite a while now, but this time I need real help.

To summarize things: I'm building a parser which has to consolidate data based on variables contained in an array.
The source file contains a set of tab-separated-values, and those are parsed out into an array which contains
pdbID | resNum | resID | secstructID, these are then consolidated into a file which should contain:
pdbID | startRes | endRes | secstructID

source array after parsing a file has the data for consolidation:
1b6g 1 M \N
1b6g 2 V \N
1b6g 3 N \N
1b6g 4 N H
1b6g 5 N H
1b6g 6 N \N
3hba 7 W H
2cdg 8 N H
2cdg 9 V \N
2cdg 10 M \N
2cdg 11 A B
2cdg 12 M \N

expected result after consolidation, should be:
1b6g 1 3 \N
1b6g 4 5 H
1b6g 6 6 \N
3hba 7 7 H
2cdg 5 6 H
2cdg 7 7 \N
2cdg 8 8 H
2cdg 9 10 H
2cdg 11 11 B
2cdg 12 12 \N

As you can see each pdbID is assigned a secStructID in a sequential manner and any interruptions in the secStructID are considered points from which the assignment restarts (should).
Each pdbID can thus have multiple occurences of for example \N in different places of the sequence and they are differentiated by the startRes and endRes values which are all derived from the resNum.
All is wonderful and I have a working code which consolidates the data, unfortunately it doesn't recognize the occurence of the new secstructID automatically as the end of the previous one rather it finds the last possible in the whole sequence for one pdbID and considers that as the end.

and so my result is incorrectly displayed as:
1b6g 4 5 H
1b6g 1 6 \N ---- error here - this should be in fact two separate "entities" because 4 and 5 do not belong to \N
3hba 7 7 H
2cdg 8 8 H
2cdg 9 12 \N ---- same here (7 and 8 should break this into two)
2cdg 11 11 B

And here's my code:

#!/usr/bin/perl -w
use strict;
use warnings;

#   --------------------------------------------------------------
# This script uses the residue.txt file generated by
# resTabmakerBatch.pl and creates a new file called
# SecStructList.txt
# each protein is described by secondary structures with a
# pdbID, 2ry structureID (char or \N'), startResidue, endResidue
# Input: residue.txt (this file is the output of resTabmakerBatch.pl)
# Output: secStructList.txt
# usage: secStructList.txt to populate the SecStructure entity
#   --------------------------------------------------------------

#Read arguments, print error message if insufficient
if ($#ARGV<0)
{
 die("\n\nUsage:  sstruct.pl [residue_table_file.txt]\n\n");
}

my $filename = $ARGV[0];

#if either file not found return error message
if (! -e "$filename")
{
 die("\n\nresidue file $filename does not exist!\n\n");
}

 # Read residue.txt file, extracting the data of interest - only
 # pdb id, resNum, resID, secondaryStructID

#First read file, storing each line in an array 'dssplines' splitting the
data
open (MYFILE,"$filename") or die ("\nERROR: Can't open $filename\n");
 my @dssplines= split(/\r/, <MYFILE>);
 my $arraySize=@dssplines;
close(MYFILE);
#read one line from the originally loaded array dssplines at a time and loop
#over it splitting the values using the tabs
 my @dsspdata;
 my $dsspdataSize=@dssplines;

 my $n=0;
 for (my $i=0; $i < $arraySize; $i++)
 {
  #each line from the array goes into a new dsspline variable
  my $dsspline = $dssplines[$i];
  for (my $j = 0; $j <=4; $j++)
  {
   #each time values inside are separated using the tabs
   my ($pdbID, $resNo, $resID, $phi, $psi, $chi1, $chi2, $secStruct,
$activesite) = split(/\t/, $dsspline);
   # now each value of interest is stored into a new array @dsspdata
   $dsspdata[$n][0] = $pdbID;
   $dsspdata[$n][1] = $resNo;
   $dsspdata[$n][2] = $resID;
   $dsspdata[$n][3] = $secStruct;
  }
  $n++;
 }

#my @dsspdata array is now perfect to reformat into a hash analyzing the
value correlation

 #initialize the hash and counter
 my %dane;
 my $k=0;
 #loop around the dsspdata array
 for (my $i=0; $i < $dsspdataSize; $i++)
 {
  #split each cell in a row into variables for the hash
  for (my $k = 0; $k <=4; $k++)
  {
   my $pdb = $dsspdata[$i][0];
   my $residueNum = $dsspdata[$i][1];
   my $secStructure = $dsspdata[$i][3];
   push @{ $dane{$pdb}->{$secStructure} }, $residueNum;
  }
  $k++;
 }

 #now for each pdbID using the hash keys
 foreach my $pdbID ( keys %dane )
 {
  #check the secondary structure id with pdbID as a key (only if the pdbID
is the same will the values be stored)
  foreach my $secID ( keys %{ $dane{$pdbID} } )
  {
   #finally create an array of residue numbers
   my @resnums = ( $dane{$pdbID}->{$secID}->[0],
$dane{$pdbID}->{$secID}->[-1] );
   #create a new file with the secondary structures list
   open (SStruc, ">>secStructList.txt") || die "Can't open file: $!";
   #append each line to the new file with tab separated data
   print SStruc ("$pdbID \t @resnums \t $secID\n");
  }
 }
close(SStruc);

I have attached the source file (this file is already processed from another script, which wasn't nearly as complicated as this issue ;))
Run the program with residue.txt as attribute.

If anyone has an idea how to deal with this I would be very grateful for suggestions, as you can see from my code I am slightly java-twisted.

Cheers,
Matt

Recommended Answers

All 5 Replies

Is this school work? Do you this question posted on other perl forums? Whats the purpose of including the resID if it's not being used in the output?

Terribly helpful of you ;) but to ease your curiosity - it's not homework, I'm trying to rewrite a parser that I wrote in Java some time ago (originally for dssp files if that tells you anything), this particular script is part of a larger set of programs used to populate a database and the residue.txt is the output of a different script (in the final version it will be reintegrated into a batch without processing the additional file. It's not posted on other web forums, it's posted on usenet though. It's not finished otherwise I wouldn't be posting questions regarding it, resID might turn out to be potentially useful for me, as I am considering storing a whole sequence in a separate attribute of a DB entity, I haven't decided yet as I am still modifying the DB schema.
Now instead of answering a question with a question, do you think you could give me a hint on dealing with the hash for this rather unusual case :) ? I would be really grateful, I'm trying to learn Perl by writing it, but with hashes I seem to be stumbling in the dark, I'm more of a java person I think :).

I've slightly simplified the code after a couple of suggestions that I got elsewhere and now it should be easier to follow:

#!/usr/bin/perl -w
use strict;
use warnings;
if ($#ARGV<0)
{ die("\n\nUsage:  sstruct.pl [residue_table_file.txt]\n\n"); }
my $filename = $ARGV[0];
if (! -e "$filename")    
{ die("\n\nresidue file $filename does not exist!\n\n"); }
open (MYFILE,"$filename") or die ("\nERROR: Can't open $filename\n");
    my @dssplines= split(/\r/, <MYFILE>);
    my $arraySize=@dssplines;
close(MYFILE);

my @dsspdata;
my $dsspdataSize=@dssplines;
    for (my $i=0; $i < $arraySize; $i++)
    {
      my $dsspline = $dssplines[$i];
      my ($pdbID, $resNo, $resID, $phi, $psi, $chi1, $chi2, $secStruct, $activesite) = split(/\t/, $dsspline);
      push( @dsspdata, [$pdbID,$resNo,$resID,$secStruct] );
    }

my %dane;
my $k=0;
for (my $i=0; $i < $dsspdataSize; $i++)
{
    for (my $k = 0; $k <=4; $k++)
    {
        my $pdb = $dsspdata[$i][0];
        my $residueNum = $dsspdata[$i][1];
        my $secStructure = $dsspdata[$i][3];
        push @{ $dane{$pdb}->{$secStructure} }, $residueNum;
    }
    $k++;
}
    
foreach my $pdbID ( keys %dane )
{
    foreach my $secID ( keys %{ $dane{$pdbID} } )
    {
        my @resnums = ( $dane{$pdbID}->{$secID}->[0], $dane{$pdbID}->{$secID}->[-1] );
        open (SStruc, ">>secStructList.txt") || die "Can't open file: $!";
        print SStruc ("$pdbID \t @resnums \t $secID\n");
    }
}
close(SStruc);

Seems that I will be solving this problem using a response from a usenet user.

my ( $x, $y, undef, $z ) = split ' ', <DATA>;
my ($last_x,$last_z,$min,$max)=($x,$z,$y,$y);

while (<DATA>) {
    my ( $x, $y, undef, $z ) = split;
    if ($x ne $last_x or $z ne $last_z) {
       print "$last_x $min $max $last_z\n";
       ($last_x,$last_z,$min,$max)=($x,$z,$y,$y);
    };
    $max=$y;
}
print "$last_x $min $max $last_z\n";

It works flawlessly however, it's not the way I want it solved, as I will still try to get it with the hashes.

I'm going to pass on helping you.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.