Hello All,

I have working for this simple little thing for the past few days and I am stuck. Can anyone tell me or explain a regex formula that will extract words from xml.

Example:

<person> Sue Smith
<age> 32 </age>
<sex> female </sex>
</person>
<person> John
<child>
<name>Jim</name>
<age> 2</age>
</child>
<age> 45 </age>
<sex> female </sex>
</person>


output:
name: Sue Smith
age: 32
sex: female


name: John
child: Jim
age: 2
Age:45
sex: male

Thanks for any suggestions and input

Recommended Answers

All 4 Replies

As much as I want to use an XML parser, I have to write it myself.

Any suggestions would be great.

Thanks

Better yet, I think I will learn more if I just write it without using regex. Any ideas?

Better yet, I think I will learn more if I just write it without using regex. Any ideas?

Well, if you absolutely cannot use an XML parser (I am assuming because of class assignment restrictions?), then regex is your next best choice. Straight parsing might be possible if you are certain that each data element will always be on a separate line, such as your example above, but if it is not then you should stick with regex.

Even with regex there can be tricky spots with XML because of it's nested nature. The code below will show you what I mean and perhaps give you a starting point to work from

String in = "<person> Sue Smith "+
                    "<age> 32 </age> "+
                    "<sex> female </sex> "+
                    "</person> "+
                    "<person> John "+
                    "<child> "+
                    "<name>Jim</name> "+
                    "<age> 2</age> "+
                    "</child> "+
                    "<age> 45 </age> "+
                    "<sex> female </sex> "+
                    "</person>";
        Pattern personPattern = Pattern.compile("<person>(.+?)</person>",Pattern.CASE_INSENSITIVE);
        Pattern namePattern = Pattern.compile("<person>(.+?)<",Pattern.CASE_INSENSITIVE);
        Pattern agePattern = Pattern.compile("<age>(.+?)<",Pattern.CASE_INSENSITIVE);
        Matcher personMatcher = personPattern.matcher(in);
        while (personMatcher.find()){
            System.out.println("person match:");
            Matcher nameMatcher = namePattern.matcher(personMatcher.group(0));
            while (nameMatcher.find()){
                System.out.println("Name: "+nameMatcher.group(1));
            }
            Matcher ageMatcher = agePattern.matcher(personMatcher.group(0));
            while (ageMatcher.find()){
                System.out.println("Age: "+ageMatcher.group(1));
            }
        }

If you run that (in a test class main() is fine), you will see that it catches 2 ages for John because there is a <child> element with an <age> tag within his <person> element. Your regex parsing needs to take that into account to separate that age from the child age. Good luck!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.