Hello All,

I have working for this simple little thing for the past few days and I am stuck. Can anyone tell me or explain a regex formula that will extract words from xml.


<person> Sue Smith
<age> 32 </age>
<sex> female </sex>
<person> John
<age> 2</age>
<age> 45 </age>
<sex> female </sex>

name: Sue Smith
age: 32
sex: female

name: John
child: Jim
age: 2
sex: male

Thanks for any suggestions and input

10 Years
Discussion Span
Last Post by Ezzaral

As much as I want to use an XML parser, I have to write it myself.

Any suggestions would be great.



Better yet, I think I will learn more if I just write it without using regex. Any ideas?


Better yet, I think I will learn more if I just write it without using regex. Any ideas?

Well, if you absolutely cannot use an XML parser (I am assuming because of class assignment restrictions?), then regex is your next best choice. Straight parsing might be possible if you are certain that each data element will always be on a separate line, such as your example above, but if it is not then you should stick with regex.

Even with regex there can be tricky spots with XML because of it's nested nature. The code below will show you what I mean and perhaps give you a starting point to work from

String in = "<person> Sue Smith "+
                    "<age> 32 </age> "+
                    "<sex> female </sex> "+
                    "</person> "+
                    "<person> John "+
                    "<child> "+
                    "<name>Jim</name> "+
                    "<age> 2</age> "+
                    "</child> "+
                    "<age> 45 </age> "+
                    "<sex> female </sex> "+
        Pattern personPattern = Pattern.compile("<person>(.+?)</person>",Pattern.CASE_INSENSITIVE);
        Pattern namePattern = Pattern.compile("<person>(.+?)<",Pattern.CASE_INSENSITIVE);
        Pattern agePattern = Pattern.compile("<age>(.+?)<",Pattern.CASE_INSENSITIVE);
        Matcher personMatcher = personPattern.matcher(in);
        while (personMatcher.find()){
            System.out.println("person match:");
            Matcher nameMatcher = namePattern.matcher(personMatcher.group(0));
            while (nameMatcher.find()){
                System.out.println("Name: "+nameMatcher.group(1));
            Matcher ageMatcher = agePattern.matcher(personMatcher.group(0));
            while (ageMatcher.find()){
                System.out.println("Age: "+ageMatcher.group(1));

If you run that (in a test class main() is fine), you will see that it catches 2 ages for John because there is a <child> element with an <age> tag within his <person> element. Your regex parsing needs to take that into account to separate that age from the child age. Good luck!

This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.