Data consolidation across an array a.k.a. Hashed to death... heeelp...

Please support our Perl advertiser: Programming Forums - DaniWeb Sister Site
Thread Solved

Join Date: May 2007
Posts: 7
Reputation: BioTeq is an unknown quantity at this point 
Solved Threads: 0
BioTeq BioTeq is offline Offline
Newbie Poster

Data consolidation across an array a.k.a. Hashed to death... heeelp...

 
0
  #1
May 23rd, 2007
First of all, hello!
I've been reading the forums for quite a while now, but this time I need real help.

To summarize things: I'm building a parser which has to consolidate data based on variables contained in an array.
The source file contains a set of tab-separated-values, and those are parsed out into an array which contains
pdbID | resNum | resID | secstructID, these are then consolidated into a file which should contain:
pdbID | startRes | endRes | secstructID

source array after parsing a file has the data for consolidation:
1b6g 1 M \N
1b6g 2 V \N
1b6g 3 N \N
1b6g 4 N H
1b6g 5 N H
1b6g 6 N \N
3hba 7 W H
2cdg 8 N H
2cdg 9 V \N
2cdg 10 M \N
2cdg 11 A B
2cdg 12 M \N

expected result after consolidation, should be:
1b6g 1 3 \N
1b6g 4 5 H
1b6g 6 6 \N
3hba 7 7 H
2cdg 5 6 H
2cdg 7 7 \N
2cdg 8 8 H
2cdg 9 10 H
2cdg 11 11 B
2cdg 12 12 \N

As you can see each pdbID is assigned a secStructID in a sequential manner and any interruptions in the secStructID are considered points from which the assignment restarts (should).
Each pdbID can thus have multiple occurences of for example \N in different places of the sequence and they are differentiated by the startRes and endRes values which are all derived from the resNum.
All is wonderful and I have a working code which consolidates the data, unfortunately it doesn't recognize the occurence of the new secstructID automatically as the end of the previous one rather it finds the last possible in the whole sequence for one pdbID and considers that as the end.

and so my result is incorrectly displayed as:
1b6g 4 5 H
1b6g 1 6 \N ---- error here - this should be in fact two separate "entities" because 4 and 5 do not belong to \N
3hba 7 7 H
2cdg 8 8 H
2cdg 9 12 \N ---- same here (7 and 8 should break this into two)
2cdg 11 11 B

And here's my code:

  1. #!/usr/bin/perl -w
  2. use strict;
  3. use warnings;
  4.  
  5. # --------------------------------------------------------------
  6. # This script uses the residue.txt file generated by
  7. # resTabmakerBatch.pl and creates a new file called
  8. # SecStructList.txt
  9. # each protein is described by secondary structures with a
  10. # pdbID, 2ry structureID (char or \N'), startResidue, endResidue
  11. # Input: residue.txt (this file is the output of resTabmakerBatch.pl)
  12. # Output: secStructList.txt
  13. # usage: secStructList.txt to populate the SecStructure entity
  14. # --------------------------------------------------------------
  15.  
  16. #Read arguments, print error message if insufficient
  17. if ($#ARGV<0)
  18. {
  19. die("\n\nUsage: sstruct.pl [residue_table_file.txt]\n\n");
  20. }
  21.  
  22. my $filename = $ARGV[0];
  23.  
  24. #if either file not found return error message
  25. if (! -e "$filename")
  26. {
  27. die("\n\nresidue file $filename does not exist!\n\n");
  28. }
  29.  
  30. # Read residue.txt file, extracting the data of interest - only
  31. # pdb id, resNum, resID, secondaryStructID
  32.  
  33. #First read file, storing each line in an array 'dssplines' splitting the
  34. data
  35. open (MYFILE,"$filename") or die ("\nERROR: Can't open $filename\n");
  36. my @dssplines= split(/\r/, <MYFILE>);
  37. my $arraySize=@dssplines;
  38. close(MYFILE);
  39. #read one line from the originally loaded array dssplines at a time and loop
  40. #over it splitting the values using the tabs
  41. my @dsspdata;
  42. my $dsspdataSize=@dssplines;
  43.  
  44. my $n=0;
  45. for (my $i=0; $i < $arraySize; $i++)
  46. {
  47. #each line from the array goes into a new dsspline variable
  48. my $dsspline = $dssplines[$i];
  49. for (my $j = 0; $j <=4; $j++)
  50. {
  51. #each time values inside are separated using the tabs
  52. my ($pdbID, $resNo, $resID, $phi, $psi, $chi1, $chi2, $secStruct,
  53. $activesite) = split(/\t/, $dsspline);
  54. # now each value of interest is stored into a new array @dsspdata
  55. $dsspdata[$n][0] = $pdbID;
  56. $dsspdata[$n][1] = $resNo;
  57. $dsspdata[$n][2] = $resID;
  58. $dsspdata[$n][3] = $secStruct;
  59. }
  60. $n++;
  61. }
  62.  
  63. #my @dsspdata array is now perfect to reformat into a hash analyzing the
  64. value correlation
  65.  
  66. #initialize the hash and counter
  67. my %dane;
  68. my $k=0;
  69. #loop around the dsspdata array
  70. for (my $i=0; $i < $dsspdataSize; $i++)
  71. {
  72. #split each cell in a row into variables for the hash
  73. for (my $k = 0; $k <=4; $k++)
  74. {
  75. my $pdb = $dsspdata[$i][0];
  76. my $residueNum = $dsspdata[$i][1];
  77. my $secStructure = $dsspdata[$i][3];
  78. push @{ $dane{$pdb}->{$secStructure} }, $residueNum;
  79. }
  80. $k++;
  81. }
  82.  
  83. #now for each pdbID using the hash keys
  84. foreach my $pdbID ( keys %dane )
  85. {
  86. #check the secondary structure id with pdbID as a key (only if the pdbID
  87. is the same will the values be stored)
  88. foreach my $secID ( keys %{ $dane{$pdbID} } )
  89. {
  90. #finally create an array of residue numbers
  91. my @resnums = ( $dane{$pdbID}->{$secID}->[0],
  92. $dane{$pdbID}->{$secID}->[-1] );
  93. #create a new file with the secondary structures list
  94. open (SStruc, ">>secStructList.txt") || die "Can't open file: $!";
  95. #append each line to the new file with tab separated data
  96. print SStruc ("$pdbID \t @resnums \t $secID\n");
  97. }
  98. }
  99. close(SStruc);

I have attached the source file (this file is already processed from another script, which wasn't nearly as complicated as this issue )
Run the program with residue.txt as attribute.

If anyone has an idea how to deal with this I would be very grateful for suggestions, as you can see from my code I am slightly java-twisted.

Cheers,
Matt
Attached Files
File Type: txt residue.txt (70.9 KB, 5 views)
Reply With Quote Quick reply to this message  
Join Date: Mar 2006
Posts: 898
Reputation: KevinADC has a spectacular aura about KevinADC has a spectacular aura about 
Solved Threads: 67
KevinADC's Avatar
KevinADC KevinADC is offline Offline
Practically a Posting Shark

Re: Data consolidation across an array a.k.a. Hashed to death... heeelp...

 
0
  #2
May 23rd, 2007
Is this school work? Do you this question posted on other perl forums? Whats the purpose of including the resID if it's not being used in the output?
Last edited by KevinADC; May 23rd, 2007 at 9:39 pm.
Reply With Quote Quick reply to this message  
Join Date: May 2007
Posts: 7
Reputation: BioTeq is an unknown quantity at this point 
Solved Threads: 0
BioTeq BioTeq is offline Offline
Newbie Poster

Re: Data consolidation across an array a.k.a. Hashed to death... heeelp...

 
0
  #3
May 23rd, 2007
Terribly helpful of you but to ease your curiosity - it's not homework, I'm trying to rewrite a parser that I wrote in Java some time ago (originally for dssp files if that tells you anything), this particular script is part of a larger set of programs used to populate a database and the residue.txt is the output of a different script (in the final version it will be reintegrated into a batch without processing the additional file. It's not posted on other web forums, it's posted on usenet though. It's not finished otherwise I wouldn't be posting questions regarding it, resID might turn out to be potentially useful for me, as I am considering storing a whole sequence in a separate attribute of a DB entity, I haven't decided yet as I am still modifying the DB schema.
Now instead of answering a question with a question, do you think you could give me a hint on dealing with the hash for this rather unusual case ? I would be really grateful, I'm trying to learn Perl by writing it, but with hashes I seem to be stumbling in the dark, I'm more of a java person I think .
Reply With Quote Quick reply to this message  
Join Date: May 2007
Posts: 7
Reputation: BioTeq is an unknown quantity at this point 
Solved Threads: 0
BioTeq BioTeq is offline Offline
Newbie Poster

Re: Data consolidation across an array a.k.a. Hashed to death... heeelp...

 
0
  #4
May 23rd, 2007
I've slightly simplified the code after a couple of suggestions that I got elsewhere and now it should be easier to follow:
  1. #!/usr/bin/perl -w
  2. use strict;
  3. use warnings;
  4. if ($#ARGV<0)
  5. { die("\n\nUsage: sstruct.pl [residue_table_file.txt]\n\n"); }
  6. my $filename = $ARGV[0];
  7. if (! -e "$filename")
  8. { die("\n\nresidue file $filename does not exist!\n\n"); }
  9. open (MYFILE,"$filename") or die ("\nERROR: Can't open $filename\n");
  10. my @dssplines= split(/\r/, <MYFILE>);
  11. my $arraySize=@dssplines;
  12. close(MYFILE);
  13.  
  14. my @dsspdata;
  15. my $dsspdataSize=@dssplines;
  16. for (my $i=0; $i < $arraySize; $i++)
  17. {
  18. my $dsspline = $dssplines[$i];
  19. my ($pdbID, $resNo, $resID, $phi, $psi, $chi1, $chi2, $secStruct, $activesite) = split(/\t/, $dsspline);
  20. push( @dsspdata, [$pdbID,$resNo,$resID,$secStruct] );
  21. }
  22.  
  23. my %dane;
  24. my $k=0;
  25. for (my $i=0; $i < $dsspdataSize; $i++)
  26. {
  27. for (my $k = 0; $k <=4; $k++)
  28. {
  29. my $pdb = $dsspdata[$i][0];
  30. my $residueNum = $dsspdata[$i][1];
  31. my $secStructure = $dsspdata[$i][3];
  32. push @{ $dane{$pdb}->{$secStructure} }, $residueNum;
  33. }
  34. $k++;
  35. }
  36.  
  37. foreach my $pdbID ( keys %dane )
  38. {
  39. foreach my $secID ( keys %{ $dane{$pdbID} } )
  40. {
  41. my @resnums = ( $dane{$pdbID}->{$secID}->[0], $dane{$pdbID}->{$secID}->[-1] );
  42. open (SStruc, ">>secStructList.txt") || die "Can't open file: $!";
  43. print SStruc ("$pdbID \t @resnums \t $secID\n");
  44. }
  45. }
  46. close(SStruc);
Last edited by BioTeq; May 23rd, 2007 at 10:05 pm.
Reply With Quote Quick reply to this message  
Join Date: May 2007
Posts: 7
Reputation: BioTeq is an unknown quantity at this point 
Solved Threads: 0
BioTeq BioTeq is offline Offline
Newbie Poster

Re: Data consolidation across an array a.k.a. Hashed to death... heeelp...

 
0
  #5
May 23rd, 2007
Seems that I will be solving this problem using a response from a usenet user.

  1. my ( $x, $y, undef, $z ) = split ' ', <DATA>;
  2. my ($last_x,$last_z,$min,$max)=($x,$z,$y,$y);
  3.  
  4. while (<DATA>) {
  5. my ( $x, $y, undef, $z ) = split;
  6. if ($x ne $last_x or $z ne $last_z) {
  7. print "$last_x $min $max $last_z\n";
  8. ($last_x,$last_z,$min,$max)=($x,$z,$y,$y);
  9. };
  10. $max=$y;
  11. }
  12. print "$last_x $min $max $last_z\n";

It works flawlessly however, it's not the way I want it solved, as I will still try to get it with the hashes.
Last edited by BioTeq; May 23rd, 2007 at 11:10 pm.
Reply With Quote Quick reply to this message  
Join Date: Mar 2006
Posts: 898
Reputation: KevinADC has a spectacular aura about KevinADC has a spectacular aura about 
Solved Threads: 67
KevinADC's Avatar
KevinADC KevinADC is offline Offline
Practically a Posting Shark

Re: Data consolidation across an array a.k.a. Hashed to death... heeelp...

 
0
  #6
May 24th, 2007
I'm going to pass on helping you.
Reply With Quote Quick reply to this message  
Reply

This thread has been marked solved.
Perhaps start a new thread instead?
Message:


Thread Tools Search this Thread



About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC