Parsing of information in perl

Please support our Perl advertiser: Programming Forums - DaniWeb Sister Site
Reply

Join Date: Aug 2008
Posts: 32
Reputation: Vandithar is an unknown quantity at this point 
Solved Threads: 0
Vandithar Vandithar is offline Offline
Light Poster

Parsing of information in perl

 
0
  #1
Oct 15th, 2009
Hi,

I have strings like this:

  1. $string="OWN - NLM STAT- Publisher DA - 20091005 AU - Gannon AM AU - Turner EC AU - Reid HM AU - Kinsella BT AU- XYZ AD - UCD School of Biomolecular and Biomedical Sciences";

I want to parse these tags and create an hash for this string.

The output should be like this:

  1. OWN- NLM
  2. STAT- Publisher
  3. DA - 20091005
  4. AU- Gannon AM Turner EC Kinsella BT XYZ
  5. AD - UCD School of Biomolecular and Biomedical Sciences

I tried paring using regular expression but some times the format might be different.

  1. $srting=~/OWN-(.*)AU(.*)/g;

How to parse all information and create hash?

Regards
Vandita
Reply With Quote Quick reply to this message  
Join Date: Sep 2009
Posts: 52
Reputation: d5e5 is an unknown quantity at this point 
Solved Threads: 7
d5e5's Avatar
d5e5 d5e5 is offline Offline
Junior Poster in Training
 
0
  #2
Oct 16th, 2009
The following program parses the string and stores the substrings in an array or list. So far I haven't figured out how to put the contents of this array into a hash that would be useful.
  1. #!/usr/bin/perl -w
  2. use strict;
  3.  
  4. $_ = "OWN - NLM "
  5. . "STAT- Publisher "
  6. . "DA - 20091005 "
  7. . "AU - Gannon AM "
  8. . "AU - Turner EC "
  9. . "AU - Reid HM "
  10. . "AU - Kinsella BT "
  11. . "AU- XYZ "
  12. . "AD - UCD School of Biomolecular and Biomedical Sciences";
  13. print;
  14. print "\n";
  15. #Parse the above string into key-value pairs.
  16. s/([A-Z]+\s*)-/###$1###/g;
  17. #Put the resulting substrings into a list
  18. my @list = [];
  19. @list = split /\s*###/;
  20.  
  21. #Remove leading and trailing spaces from each element
  22. foreach my $i (@list) {
  23. $i =~ s/^\s+//;
  24. $i =~ s/\s+$//;
  25. }
  26.  
  27. #The first element is empty. Remove it with shift.
  28. shift @list;
  29. #Here's what the list contains now
  30. for (0..$#list) {
  31. print "Element $_: $list[$_]\n";
  32. }
Reply With Quote Quick reply to this message  
Join Date: Sep 2009
Posts: 52
Reputation: d5e5 is an unknown quantity at this point 
Solved Threads: 7
d5e5's Avatar
d5e5 d5e5 is offline Offline
Junior Poster in Training
 
0
  #3
Oct 17th, 2009
This should do it.
  1. #!/usr/bin/perl -w
  2. #ParseStringCreateHash.pl
  3. use strict;
  4. $_ = "OWN - NLM STAT- Publisher DA - 20091005 AU - Gannon AM AU - Turner EC AU - Reid HM AU - Kinsella BT AU- XYZ AD - UCD School of Biomolecular and Biomedical Sciences";
  5. print "\n";
  6. #Parse the above string into key-value pairs.
  7. s/([A-Z]+\s*)-/###$1###/g;
  8. #Put the resulting substrings into a array
  9. my @array = [];
  10. @array = split /\s*###/;
  11.  
  12. #Remove leading and trailing spaces from each element
  13. foreach my $i (@array) {
  14. $i =~ s/^\s+//;
  15. $i =~ s/\s+$//;
  16. }
  17.  
  18. #The first element is empty. Remove it with shift.
  19. shift @array;
  20. #Here's what the array contains now
  21. print "Here's what the array contains now:\n";
  22. for (0..$#array) {
  23. print "Element $_: $array[$_]\n";
  24. }
  25. #Create a hash from the above array
  26. my %hash = ();
  27. for (0..$#array) {
  28. if ($_%2 eq 0) { #$_ is even
  29. #Array element represents a hash key
  30. unless (exists $hash{$array[$_]}) {
  31. $hash{$array[$_]} .= "";
  32. } }
  33. else { #$_ is odd
  34. #Array element[$_] represents part of value associated with previous element-key
  35. $hash{$array[$_ - 1]} .= "$array[$_] "
  36. };
  37. }
  38.  
  39. print "\nHash keys and values separated by '-' :\n";
  40. for (keys %hash) {
  41. print "$_ - $hash{$_}\n";
  42. }
Reply With Quote Quick reply to this message  
Join Date: Aug 2008
Posts: 32
Reputation: Vandithar is an unknown quantity at this point 
Solved Threads: 0
Vandithar Vandithar is offline Offline
Light Poster
 
0
  #4
Oct 29th, 2009
Originally Posted by d5e5 View Post
The following program parses the string and stores the substrings in an array or list. So far I haven't figured out how to put the contents of this array into a hash that would be useful.
  1. #!/usr/bin/perl -w
  2. use strict;
  3.  
  4. $_ = "OWN - NLM "
  5. . "STAT- Publisher "
  6. . "DA - 20091005 "
  7. . "AU - Gannon AM "
  8. . "AU - Turner EC "
  9. . "AU - Reid HM "
  10. . "AU - Kinsella BT "
  11. . "AU- XYZ "
  12. . "AD - UCD School of Biomolecular and Biomedical Sciences";
  13. print;
  14. print "\n";
  15. #Parse the above string into key-value pairs.
  16. s/([A-Z]+\s*)-/###$1###/g;
  17. #Put the resulting substrings into a list
  18. my @list = [];
  19. @list = split /\s*###/;
  20.  
  21. #Remove leading and trailing spaces from each element
  22. foreach my $i (@list) {
  23. $i =~ s/^\s+//;
  24. $i =~ s/\s+$//;
  25. }
  26.  
  27. #The first element is empty. Remove it with shift.
  28. shift @list;
  29. #Here's what the list contains now
  30. for (0..$#list) {
  31. print "Element $_: $list[$_]\n";
  32. }

Hi,

Thanks for the reply.

I had one doubt. Suppose for example id string is like this:

  1. $str=" AB - Thromboxane plays an essential role in hemostasis, regulating platelet aggregation and vessel tone. TP - beta- isoforms that are transcriptionally regulated by distinct Prm1 and Prm3, respectively";

The output should be like this:

  1. AB - Thromboxane plays an essential role in hemostasis, regulating platelet aggregation and vessel tone. TP - beta- isoforms that are transcriptionally regulated by distinct Prm1 and Prm3, respectively.

But the output is not like the above one:

  1. TP - beta- isoforms that are transcriptionally regulated by distinct Prm1 and Prm3, respectively
  2. AB - Thromboxane plays an essential role in hemostasis, regulating platelet aggregation and vessel tone

Actually the whole paragraph (sentences) belongs to AB Tag only not two separate tags as TP and AB its under only one tag i.e AB.

Finally How to get the output as below??

  1. AB - Thromboxane plays an essential role in hemostasis, regulating platelet aggregation and vessel tone. TP - beta- isoforms that are transcriptionally regulated by distinct Prm1 and Prm3, respectively.

How can i get the desired output?

AB , AD are the main tags so the information at the beginning of the tag i.e only AB - , AD - has to be parsed not the sentences which contains TP-beta- isoforms should not be parsed.

Regards
Archana
Last edited by Vandithar; Oct 29th, 2009 at 4:06 am.
Reply With Quote Quick reply to this message  
Join Date: Nov 2008
Posts: 63
Reputation: eggmatters is an unknown quantity at this point 
Solved Threads: 4
eggmatters eggmatters is offline Offline
Junior Poster in Training
 
0
  #5
Oct 29th, 2009
Originally Posted by Vandithar View Post
Hi,

Thanks for the reply.

I had one doubt. Suppose for example id string is like this:

  1. $str=" AB - Thromboxane plays an essential role in hemostasis, regulating platelet aggregation and vessel tone. TP - beta- isoforms that are transcriptionally regulated by distinct Prm1 and Prm3, respectively";

The output should be like this:

  1. AB - Thromboxane plays an essential role in hemostasis, regulating platelet aggregation and vessel tone. TP - beta- isoforms that are transcriptionally regulated by distinct Prm1 and Prm3, respectively.

But the output is not like the above one:

  1. TP - beta- isoforms that are transcriptionally regulated by distinct Prm1 and Prm3, respectively
  2. AB - Thromboxane plays an essential role in hemostasis, regulating platelet aggregation and vessel tone

Actually the whole paragraph (sentences) belongs to AB Tag only not two separate tags as TP and AB its under only one tag i.e AB.

Finally How to get the output as below??

  1. AB - Thromboxane plays an essential role in hemostasis, regulating platelet aggregation and vessel tone. TP - beta- isoforms that are transcriptionally regulated by distinct Prm1 and Prm3, respectively.

How can i get the desired output?

AB , AD are the main tags so the information at the beginning of the tag i.e only AB - , AD - has to be parsed not the sentences which contains TP-beta- isoforms should not be parsed.

Regards
Archana
Hmm, the first question you will be asked is to provide the code you used to parse the string mentioned above. But also, I think you may need to step back for a second and try to determine exactly how you want your data organized. It appears that you have some type of custom mark-up you are using to separate your data. This is good, but you should certainly make sure that you have a well-formed set of rules for your markup before you start asking Perl to start doing esoteric things to your string. Can you provide a list of all your tags and the intended actions you wish to occur for each? I would suggest using some pen and paper and try to dilineate all of your tags and actions Maybe something like:
  1. <ID tags>
  2. elements: AB AD TP etc.
  3. elements: <id string>
  4. <id string>
  5. elements: any character that is not an id tag.
  6. <School Tags>
  7. elements: AU <school info>etc
  8. <school info>
  9. elemsnts: any character that is not a School Tag
You can structure the language any way you choose, but you want the logic that describes how the string is handled to be inherrent in the language (mark-up) you construct. Sorry for the vague overview but sounds like you need to work on constructing your tags rather than have Perl format static data on a case by case basis.
Reply With Quote Quick reply to this message  
Join Date: Sep 2009
Posts: 52
Reputation: d5e5 is an unknown quantity at this point 
Solved Threads: 7
d5e5's Avatar
d5e5 d5e5 is offline Offline
Junior Poster in Training

Parsing strings into Hash

 
0
  #6
Oct 29th, 2009
Now that you have given us a second example of a string to be parsed we need to infer a rule that will handle both examples. Say we assign your first example to a variable called $str1 and the second to $str2. Your program needs to parse both of the following strings into keys and values to put in a hash:
  1. #RULE 1:
  2. # Every substring consisting of capital letters followed by
  3. # a hyphen (-) represents a key and whatever follows before the
  4. # next substring consisting of capital letters followed by
  5. # a hyphen is the value associated with that key.
  6. $str1="OWN - NLM STAT- Publisher DA - 20091005 AU - Gannon AM AU - Turner EC AU - Reid HM AU - Kinsella BT AU- XYZ AD - UCD School of Biomolecular and Biomedical Sciences";
  7. #
  8. #RULE 2:
  9. # A substring consisting of capital letters followed by a hyphen -
  10. # represents a key only if it occurs at the beginning of a line.
  11. # It may (or must?) be preceded by a space which we will
  12. # remove when we put the key in the hash.
  13. # Whatever follows, including substrings of capital letters
  14. # followed by a hyphen, up until the end of the line
  15. # (indicated by a newline character?) is the value associated
  16. # with that key.
  17. $str2=" AB - Thromboxane plays an essential role in hemostasis, regulating platelet aggregation and vessel tone. TP - beta- isoforms that are transcriptionally regulated by distinct Prm1 and Prm3, respectively";
I assume you want a program that can parse both $str1 and $str2 into hash keys and values. Is that correct? Right now I can't think of one set of rules that would handle both. Maybe someone else can suggest one.
Reply With Quote Quick reply to this message  
Join Date: Nov 2008
Posts: 63
Reputation: eggmatters is an unknown quantity at this point 
Solved Threads: 4
eggmatters eggmatters is offline Offline
Junior Poster in Training
 
0
  #7
Oct 29th, 2009
Model an external formal grammar based on how you want your data structured and tell perl to perform operations on that grammar based on the rules you've created.
Reply With Quote Quick reply to this message  
Reply

Tags
hash, parsing, string

Message:




Views: 1178 | Replies: 6
Thread Tools Search this Thread



Tag cloud for hash, parsing, string
About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC