Hi,

I have strings like this:

$string="OWN - NLM STAT- Publisher DA - 20091005  AU - Gannon AM AU - Turner EC AU - Reid HM AU - Kinsella BT  AU- XYZ AD - UCD School of Biomolecular and Biomedical Sciences";

I want to parse these tags and create an hash for this string.

The output should be like this:

OWN- NLM
STAT- Publisher 
DA - 20091005
AU- Gannon AM Turner EC Kinsella BT XYZ
AD - UCD School of Biomolecular and Biomedical Sciences

I tried paring using regular expression but some times the format might be different.

$srting=~/OWN-(.*)AU(.*)/g;

How to parse all information and create hash?

Regards
Vandita

Recommended Answers

All 6 Replies

The following program parses the string and stores the substrings in an array or list. So far I haven't figured out how to put the contents of this array into a hash that would be useful.

#!/usr/bin/perl -w
use strict;

$_ = "OWN - NLM "
   . "STAT- Publisher "
   . "DA - 20091005 "
   . "AU - Gannon AM "
   . "AU - Turner EC "
   . "AU - Reid HM "
   . "AU - Kinsella BT "
   . "AU- XYZ "
   . "AD - UCD School of Biomolecular and Biomedical Sciences";
print;
print "\n";
#Parse the above string into key-value pairs.
s/([A-Z]+\s*)-/###$1###/g;
#Put the resulting substrings into a list
my @list = [];
@list = split /\s*###/;

#Remove leading and trailing spaces from each element
foreach my $i (@list) {
    $i =~ s/^\s+//;
    $i =~ s/\s+$//;
}

#The first element is empty. Remove it with shift.
shift @list;
#Here's what the list contains now
for (0..$#list) {
    print "Element $_: $list[$_]\n";
}

This should do it.

#!/usr/bin/perl -w
#ParseStringCreateHash.pl
use strict;
$_ = "OWN - NLM STAT- Publisher DA - 20091005  AU - Gannon AM AU - Turner EC AU - Reid HM AU - Kinsella BT  AU- XYZ AD - UCD School of Biomolecular and Biomedical Sciences";
print "\n";
#Parse the above string into key-value pairs.
s/([A-Z]+\s*)-/###$1###/g;
#Put the resulting substrings into a array
my @array = [];
@array = split /\s*###/;

#Remove leading and trailing spaces from each element
foreach my $i (@array) {
    $i =~ s/^\s+//;
    $i =~ s/\s+$//;
}

#The first element is empty. Remove it with shift.
shift @array;
#Here's what the array contains now
print "Here's what the array contains now:\n";
for (0..$#array) {
    print "Element $_: $array[$_]\n";
}
#Create a hash from the above array
my %hash = ();
for (0..$#array) {
    if ($_%2 eq 0) { #$_ is even
        #Array element represents a hash key
        unless (exists $hash{$array[$_]}) {
            $hash{$array[$_]} .= "";
    }   }
    else { #$_ is odd
        #Array element[$_] represents part of value associated with previous element-key
        $hash{$array[$_ - 1]} .= "$array[$_] "
    };
}

print "\nHash keys and values separated by '-' :\n";
for (keys %hash) {
    print "$_ - $hash{$_}\n";
}

The following program parses the string and stores the substrings in an array or list. So far I haven't figured out how to put the contents of this array into a hash that would be useful.

#!/usr/bin/perl -w
use strict;

$_ = "OWN - NLM "
   . "STAT- Publisher "
   . "DA - 20091005 "
   . "AU - Gannon AM "
   . "AU - Turner EC "
   . "AU - Reid HM "
   . "AU - Kinsella BT "
   . "AU- XYZ "
   . "AD - UCD School of Biomolecular and Biomedical Sciences";
print;
print "\n";
#Parse the above string into key-value pairs.
s/([A-Z]+\s*)-/###$1###/g;
#Put the resulting substrings into a list
my @list = [];
@list = split /\s*###/;

#Remove leading and trailing spaces from each element
foreach my $i (@list) {
    $i =~ s/^\s+//;
    $i =~ s/\s+$//;
}

#The first element is empty. Remove it with shift.
shift @list;
#Here's what the list contains now
for (0..$#list) {
    print "Element $_: $list[$_]\n";
}

Hi,

Thanks for the reply.

I had one doubt. Suppose for example id string is like this:

$str=" AB  - Thromboxane plays an essential role in hemostasis, regulating platelet aggregation and vessel tone. TP - beta- isoforms that are transcriptionally regulated by distinct  Prm1 and Prm3, respectively";

The output should be like this:

AB -  Thromboxane plays an essential role in hemostasis, regulating platelet aggregation and vessel tone. TP - beta- isoforms that are transcriptionally regulated by distinct  Prm1 and Prm3, respectively.

But the output is not like the above one:

TP - beta- isoforms that are transcriptionally regulated by distinct  Prm1 and Prm3, respectively
AB - Thromboxane plays an essential role in hemostasis, regulating platelet aggregation and vessel tone

Actually the whole paragraph (sentences) belongs to AB Tag only not two separate tags as TP and AB its under only one tag i.e AB.

Finally How to get the output as below??

AB - Thromboxane plays an essential role in hemostasis, regulating platelet aggregation and vessel tone. TP - beta- isoforms that are transcriptionally regulated by distinct  Prm1 and Prm3, respectively.

How can i get the desired output?

AB , AD are the main tags so the information at the beginning of the tag i.e only AB - , AD - has to be parsed not the sentences which contains TP-beta- isoforms should not be parsed.

Regards
Archana

Hi,

Thanks for the reply.

I had one doubt. Suppose for example id string is like this:

$str=" AB  - Thromboxane plays an essential role in hemostasis, regulating platelet aggregation and vessel tone. TP - beta- isoforms that are transcriptionally regulated by distinct  Prm1 and Prm3, respectively";

The output should be like this:

AB -  Thromboxane plays an essential role in hemostasis, regulating platelet aggregation and vessel tone. TP - beta- isoforms that are transcriptionally regulated by distinct  Prm1 and Prm3, respectively.

But the output is not like the above one:

TP - beta- isoforms that are transcriptionally regulated by distinct  Prm1 and Prm3, respectively
AB - Thromboxane plays an essential role in hemostasis, regulating platelet aggregation and vessel tone

Actually the whole paragraph (sentences) belongs to AB Tag only not two separate tags as TP and AB its under only one tag i.e AB.

Finally How to get the output as below??

AB - Thromboxane plays an essential role in hemostasis, regulating platelet aggregation and vessel tone. TP - beta- isoforms that are transcriptionally regulated by distinct  Prm1 and Prm3, respectively.

How can i get the desired output?

AB , AD are the main tags so the information at the beginning of the tag i.e only AB - , AD - has to be parsed not the sentences which contains TP-beta- isoforms should not be parsed.

Regards
Archana

Hmm, the first question you will be asked is to provide the code you used to parse the string mentioned above. But also, I think you may need to step back for a second and try to determine exactly how you want your data organized. It appears that you have some type of custom mark-up you are using to separate your data. This is good, but you should certainly make sure that you have a well-formed set of rules for your markup before you start asking Perl to start doing esoteric things to your string. Can you provide a list of all your tags and the intended actions you wish to occur for each? I would suggest using some pen and paper and try to dilineate all of your tags and actions Maybe something like:

<ID tags>
elements: AB AD TP etc. 
elements: <id string>
<id string>
elements: any character that is not an id tag.
<School Tags>
elements: AU <school info>etc
<school info>
elemsnts: any character that is not a School Tag

You can structure the language any way you choose, but you want the logic that describes how the string is handled to be inherrent in the language (mark-up) you construct. Sorry for the vague overview but sounds like you need to work on constructing your tags rather than have Perl format static data on a case by case basis.

Now that you have given us a second example of a string to be parsed we need to infer a rule that will handle both examples. Say we assign your first example to a variable called $str1 and the second to $str2. Your program needs to parse both of the following strings into keys and values to put in a hash:

#RULE 1:
# Every substring consisting of capital letters followed by 
# a hyphen (-) represents a key and whatever follows before the 
# next substring consisting of capital letters followed by 
# a hyphen is the value associated with that key.
$str1="OWN - NLM STAT- Publisher DA - 20091005  AU - Gannon AM AU - Turner EC AU - Reid HM AU - Kinsella BT  AU- XYZ AD - UCD School of Biomolecular and Biomedical Sciences"; 
#
#RULE 2:
# A substring consisting of capital letters followed by a hyphen - 
# represents a key only if it occurs at the beginning of a line. 
# It may (or must?) be preceded by a space which we will 
# remove when we put the key in the hash. 
# Whatever follows, including substrings of capital letters 
# followed by a hyphen, up until the end of the line 
# (indicated by a newline character?) is the value associated 
# with that key.
$str2=" AB  - Thromboxane plays an essential role in hemostasis, regulating platelet aggregation and vessel tone. TP - beta- isoforms that are transcriptionally regulated by distinct  Prm1 and Prm3, respectively";

I assume you want a program that can parse both $str1 and $str2 into hash keys and values. Is that correct? Right now I can't think of one set of rules that would handle both. Maybe someone else can suggest one.

Model an external formal grammar based on how you want your data structured and tell perl to perform operations on that grammar based on the rules you've created.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.