| | |
Data consolidation across an array a.k.a. Hashed to death... heeelp...
Please support our Perl advertiser: Programming Forums - DaniWeb Sister Site
Thread Solved |
•
•
Join Date: May 2007
Posts: 7
Reputation:
Solved Threads: 0
First of all, hello!
I've been reading the forums for quite a while now, but this time I need real help.
To summarize things: I'm building a parser which has to consolidate data based on variables contained in an array.
The source file contains a set of tab-separated-values, and those are parsed out into an array which contains
pdbID | resNum | resID | secstructID, these are then consolidated into a file which should contain:
pdbID | startRes | endRes | secstructID
source array after parsing a file has the data for consolidation:
1b6g 1 M \N
1b6g 2 V \N
1b6g 3 N \N
1b6g 4 N H
1b6g 5 N H
1b6g 6 N \N
3hba 7 W H
2cdg 8 N H
2cdg 9 V \N
2cdg 10 M \N
2cdg 11 A B
2cdg 12 M \N
expected result after consolidation, should be:
1b6g 1 3 \N
1b6g 4 5 H
1b6g 6 6 \N
3hba 7 7 H
2cdg 5 6 H
2cdg 7 7 \N
2cdg 8 8 H
2cdg 9 10 H
2cdg 11 11 B
2cdg 12 12 \N
As you can see each pdbID is assigned a secStructID in a sequential manner and any interruptions in the secStructID are considered points from which the assignment restarts (should).
Each pdbID can thus have multiple occurences of for example \N in different places of the sequence and they are differentiated by the startRes and endRes values which are all derived from the resNum.
All is wonderful and I have a working code which consolidates the data, unfortunately it doesn't recognize the occurence of the new secstructID automatically as the end of the previous one rather it finds the last possible in the whole sequence for one pdbID and considers that as the end.
and so my result is incorrectly displayed as:
1b6g 4 5 H
1b6g 1 6 \N ---- error here - this should be in fact two separate "entities" because 4 and 5 do not belong to \N
3hba 7 7 H
2cdg 8 8 H
2cdg 9 12 \N ---- same here (7 and 8 should break this into two)
2cdg 11 11 B
And here's my code:
I have attached the source file (this file is already processed from another script, which wasn't nearly as complicated as this issue
)
Run the program with residue.txt as attribute.
If anyone has an idea how to deal with this I would be very grateful for suggestions, as you can see from my code I am slightly java-twisted.
Cheers,
Matt
I've been reading the forums for quite a while now, but this time I need real help.
To summarize things: I'm building a parser which has to consolidate data based on variables contained in an array.
The source file contains a set of tab-separated-values, and those are parsed out into an array which contains
pdbID | resNum | resID | secstructID, these are then consolidated into a file which should contain:
pdbID | startRes | endRes | secstructID
source array after parsing a file has the data for consolidation:
1b6g 1 M \N
1b6g 2 V \N
1b6g 3 N \N
1b6g 4 N H
1b6g 5 N H
1b6g 6 N \N
3hba 7 W H
2cdg 8 N H
2cdg 9 V \N
2cdg 10 M \N
2cdg 11 A B
2cdg 12 M \N
expected result after consolidation, should be:
1b6g 1 3 \N
1b6g 4 5 H
1b6g 6 6 \N
3hba 7 7 H
2cdg 5 6 H
2cdg 7 7 \N
2cdg 8 8 H
2cdg 9 10 H
2cdg 11 11 B
2cdg 12 12 \N
As you can see each pdbID is assigned a secStructID in a sequential manner and any interruptions in the secStructID are considered points from which the assignment restarts (should).
Each pdbID can thus have multiple occurences of for example \N in different places of the sequence and they are differentiated by the startRes and endRes values which are all derived from the resNum.
All is wonderful and I have a working code which consolidates the data, unfortunately it doesn't recognize the occurence of the new secstructID automatically as the end of the previous one rather it finds the last possible in the whole sequence for one pdbID and considers that as the end.
and so my result is incorrectly displayed as:
1b6g 4 5 H
1b6g 1 6 \N ---- error here - this should be in fact two separate "entities" because 4 and 5 do not belong to \N
3hba 7 7 H
2cdg 8 8 H
2cdg 9 12 \N ---- same here (7 and 8 should break this into two)
2cdg 11 11 B
And here's my code:
perl Syntax (Toggle Plain Text)
#!/usr/bin/perl -w use strict; use warnings; # -------------------------------------------------------------- # This script uses the residue.txt file generated by # resTabmakerBatch.pl and creates a new file called # SecStructList.txt # each protein is described by secondary structures with a # pdbID, 2ry structureID (char or \N'), startResidue, endResidue # Input: residue.txt (this file is the output of resTabmakerBatch.pl) # Output: secStructList.txt # usage: secStructList.txt to populate the SecStructure entity # -------------------------------------------------------------- #Read arguments, print error message if insufficient if ($#ARGV<0) { die("\n\nUsage: sstruct.pl [residue_table_file.txt]\n\n"); } my $filename = $ARGV[0]; #if either file not found return error message if (! -e "$filename") { die("\n\nresidue file $filename does not exist!\n\n"); } # Read residue.txt file, extracting the data of interest - only # pdb id, resNum, resID, secondaryStructID #First read file, storing each line in an array 'dssplines' splitting the data open (MYFILE,"$filename") or die ("\nERROR: Can't open $filename\n"); my @dssplines= split(/\r/, <MYFILE>); my $arraySize=@dssplines; close(MYFILE); #read one line from the originally loaded array dssplines at a time and loop #over it splitting the values using the tabs my @dsspdata; my $dsspdataSize=@dssplines; my $n=0; for (my $i=0; $i < $arraySize; $i++) { #each line from the array goes into a new dsspline variable my $dsspline = $dssplines[$i]; for (my $j = 0; $j <=4; $j++) { #each time values inside are separated using the tabs my ($pdbID, $resNo, $resID, $phi, $psi, $chi1, $chi2, $secStruct, $activesite) = split(/\t/, $dsspline); # now each value of interest is stored into a new array @dsspdata $dsspdata[$n][0] = $pdbID; $dsspdata[$n][1] = $resNo; $dsspdata[$n][2] = $resID; $dsspdata[$n][3] = $secStruct; } $n++; } #my @dsspdata array is now perfect to reformat into a hash analyzing the value correlation #initialize the hash and counter my %dane; my $k=0; #loop around the dsspdata array for (my $i=0; $i < $dsspdataSize; $i++) { #split each cell in a row into variables for the hash for (my $k = 0; $k <=4; $k++) { my $pdb = $dsspdata[$i][0]; my $residueNum = $dsspdata[$i][1]; my $secStructure = $dsspdata[$i][3]; push @{ $dane{$pdb}->{$secStructure} }, $residueNum; } $k++; } #now for each pdbID using the hash keys foreach my $pdbID ( keys %dane ) { #check the secondary structure id with pdbID as a key (only if the pdbID is the same will the values be stored) foreach my $secID ( keys %{ $dane{$pdbID} } ) { #finally create an array of residue numbers my @resnums = ( $dane{$pdbID}->{$secID}->[0], $dane{$pdbID}->{$secID}->[-1] ); #create a new file with the secondary structures list open (SStruc, ">>secStructList.txt") || die "Can't open file: $!"; #append each line to the new file with tab separated data print SStruc ("$pdbID \t @resnums \t $secID\n"); } } close(SStruc);
I have attached the source file (this file is already processed from another script, which wasn't nearly as complicated as this issue
)Run the program with residue.txt as attribute.
If anyone has an idea how to deal with this I would be very grateful for suggestions, as you can see from my code I am slightly java-twisted.
Cheers,
Matt
•
•
Join Date: May 2007
Posts: 7
Reputation:
Solved Threads: 0
Terribly helpful of you
but to ease your curiosity - it's not homework, I'm trying to rewrite a parser that I wrote in Java some time ago (originally for dssp files if that tells you anything), this particular script is part of a larger set of programs used to populate a database and the residue.txt is the output of a different script (in the final version it will be reintegrated into a batch without processing the additional file. It's not posted on other web forums, it's posted on usenet though. It's not finished otherwise I wouldn't be posting questions regarding it, resID might turn out to be potentially useful for me, as I am considering storing a whole sequence in a separate attribute of a DB entity, I haven't decided yet as I am still modifying the DB schema.
Now instead of answering a question with a question, do you think you could give me a hint on dealing with the hash for this rather unusual case
? I would be really grateful, I'm trying to learn Perl by writing it, but with hashes I seem to be stumbling in the dark, I'm more of a java person I think
.
but to ease your curiosity - it's not homework, I'm trying to rewrite a parser that I wrote in Java some time ago (originally for dssp files if that tells you anything), this particular script is part of a larger set of programs used to populate a database and the residue.txt is the output of a different script (in the final version it will be reintegrated into a batch without processing the additional file. It's not posted on other web forums, it's posted on usenet though. It's not finished otherwise I wouldn't be posting questions regarding it, resID might turn out to be potentially useful for me, as I am considering storing a whole sequence in a separate attribute of a DB entity, I haven't decided yet as I am still modifying the DB schema.Now instead of answering a question with a question, do you think you could give me a hint on dealing with the hash for this rather unusual case
? I would be really grateful, I'm trying to learn Perl by writing it, but with hashes I seem to be stumbling in the dark, I'm more of a java person I think
. •
•
Join Date: May 2007
Posts: 7
Reputation:
Solved Threads: 0
I've slightly simplified the code after a couple of suggestions that I got elsewhere and now it should be easier to follow:
perl Syntax (Toggle Plain Text)
#!/usr/bin/perl -w use strict; use warnings; if ($#ARGV<0) { die("\n\nUsage: sstruct.pl [residue_table_file.txt]\n\n"); } my $filename = $ARGV[0]; if (! -e "$filename") { die("\n\nresidue file $filename does not exist!\n\n"); } open (MYFILE,"$filename") or die ("\nERROR: Can't open $filename\n"); my @dssplines= split(/\r/, <MYFILE>); my $arraySize=@dssplines; close(MYFILE); my @dsspdata; my $dsspdataSize=@dssplines; for (my $i=0; $i < $arraySize; $i++) { my $dsspline = $dssplines[$i]; my ($pdbID, $resNo, $resID, $phi, $psi, $chi1, $chi2, $secStruct, $activesite) = split(/\t/, $dsspline); push( @dsspdata, [$pdbID,$resNo,$resID,$secStruct] ); } my %dane; my $k=0; for (my $i=0; $i < $dsspdataSize; $i++) { for (my $k = 0; $k <=4; $k++) { my $pdb = $dsspdata[$i][0]; my $residueNum = $dsspdata[$i][1]; my $secStructure = $dsspdata[$i][3]; push @{ $dane{$pdb}->{$secStructure} }, $residueNum; } $k++; } foreach my $pdbID ( keys %dane ) { foreach my $secID ( keys %{ $dane{$pdbID} } ) { my @resnums = ( $dane{$pdbID}->{$secID}->[0], $dane{$pdbID}->{$secID}->[-1] ); open (SStruc, ">>secStructList.txt") || die "Can't open file: $!"; print SStruc ("$pdbID \t @resnums \t $secID\n"); } } close(SStruc);
Last edited by BioTeq; May 23rd, 2007 at 10:05 pm.
•
•
Join Date: May 2007
Posts: 7
Reputation:
Solved Threads: 0
Seems that I will be solving this problem using a response from a usenet user.
It works flawlessly however, it's not the way I want it solved, as I will still try to get it with the hashes.
perl Syntax (Toggle Plain Text)
my ( $x, $y, undef, $z ) = split ' ', <DATA>; my ($last_x,$last_z,$min,$max)=($x,$z,$y,$y); while (<DATA>) { my ( $x, $y, undef, $z ) = split; if ($x ne $last_x or $z ne $last_z) { print "$last_x $min $max $last_z\n"; ($last_x,$last_z,$min,$max)=($x,$z,$y,$y); }; $max=$y; } print "$last_x $min $max $last_z\n";
It works flawlessly however, it's not the way I want it solved, as I will still try to get it with the hashes.
Last edited by BioTeq; May 23rd, 2007 at 11:10 pm.
![]() |
Similar Threads
- Reading .dat file data into an array (C#)
- txt data into 2d Array (C++)
- How to read data from csv file in an array and parse (C++)
- Inputting text file data into an array, please help! (C++)
Other Threads in the Perl Forum
- Previous Thread: seek function
- Next Thread: grep into one line
| Thread Tools | Search this Thread |





