I need to parse XML documents without parsing libraries,
that is, just create one of myself.

E.g.
input:

<aa><bb>H</bb></aa>

I have to get rid of tags and show, if any, errors(mismatching tag etc).
Any suggestion please.
Thanks in advance!!

Recommended Answers

All 18 Replies

What do you have against "parsing librarires"? Especially since there is both a DOM and SAX parser included in the JDK and JRE.

In any case it's called regex. or use toCharArray() and really do it yourself.

If you're writing your own parser, first you have to recognize your tokens. Parsing XML, you've pretty much got two sorts of tokens: things in angle brackets, and things not in angle brackets.
If you're just validating that tags are matched, then all you need to do is check that things are properly nested. Take a simpler case: you want to verify that a string consists of some sequence of lower-case letters followed by the reverse sequence in capitals. That is this string obeys the rule:
abcdeEDCBA
and these strings do not:
abcdeAEDCB
abcdeEDCB
abAcBCdeED

The strings can be of arbitrary length. How do you verify this?

You'll probably need a LIFO stack to keep track of opening tags for matching with closing tags

You'll probably need a LIFO stack to keep track of opening tags for matching with closing tags

Yeah, that's what I'm thinking.
@tedtdu - do you see how a stack helps you with the problem?

Thanks for your attention. Really appreciated of that.
Actually no need to make perfect parser and it is also requirement not to use DOM &SAX etc...
Soloutions:

!stack.empty()&&stack.peek()!=starting tag

,
then push it into stack, otherwise pop off, or indicate error if can`t pop off.

Questions:How can I sort them with "starting tag", "ending tag" and "without tag". Extremly thankful for any help. Thanks again!

I'm not sure I understand what you mean there. To be blunt, if I do understand you, it's wrong.

Work it out in English, don't try to do it in code until you have it worked out logically.

Try the simpler problem:
A well-formed string consists of a sequence of lower-case letters followed by the reverse sequence, upper case.
How do you check whether a string (say, "abcCBD")is well-formed? You're looking at one char at a time and you don't know how many more there are. You have a stack to work with, and you're returning a boolean to indicate "well-formed" or not.

You can state the solution in two sentences, or one compound sentence, so there's no need to do it in code. Solve that, and you've got the backbone of your validation problem.

Questions:How can I sort them with "starting tag", "ending tag" and "without tag".

<snide_mode>
Hm... is there any formal marker of a closing tag versus its corresponding opening tag? Let me think, I'm sure it'll come to me...

</snide_mode>

I thought, if I could sort xml document by "open tag+element"and"close tag+element" "without tag+element"and"/+element", then push the "open tag+element" into stack. pop off, if "/+element" encounters, otherwise indicate error. Thus
from xml: E.g.
<a>
" "<b>
" "" "<c>H</c>
" "</b>
</a>
wanted output would be:
a
" "b
" "" "c-H

I have deadline for this, please help. Teach me with something to move on. Treamendouly grateful for helping me to solve this. please!!!!!!!!!!!!

You've got two sorts of things here, really. You've got things that have angle brackets around them, which you want to check against certain rules, and you have things that don't have angle brackets around them, which you're going to echo to the output. (or, if you'd rather, you're going to build into an array of Strings for potential output, if the XML validates)

Having to echo the contents of the tags complicates things in a very minor way, but not seriously. Set that aside for the moment, and just discard the tag contents.

So you're going to go through the document, and you're going to encounter tokens of these two types. In your example, the tokens would be
<a>, " ", <b>, " "" ", <c>, H, </c>, " ", </b>, </a>

(where comma serves as our delimiter, of course)

So you encounter the tokens in this order. When you get a token, you can push it on the stack, or you can shove it on to the output, or you can pop the stack and do a comparison.

Now walk down that series of tokens and tell me what you're going to do with each one. Do that, and you should be ready to write your code.

By the way, I'm sorry about your deadline, but I have to trust your professor's judgement here - I'm going to assume that you had this assignment in time to finish it, and that you've been given the tools to solve it, so all I'm going to do is make vague suggestions, unless there's something so simple and abstruse that you couldn't be expected to know it. Trust me, you can solve this. I don't know if you've given yourself enough time to solve it before your deadline, but you can in fact solve it in finite time.

Dear jon.kiparsky
Thanks Sir, I`ve read it three times, got following questions.

1)How can I remove tags and output content of tags as string.
2)How can I compare<kind> to </kind>, if I need to pop off when they are same.

Could it possible for you to shed light by a few lines code please.
Sorry for taking so much of your time. I would do anything for solving it,
it is kind of important for me.

The String class has a lot of useful methods in it. For your current purposes, I can suggest a few to look at particularly.

String.indexOf() returns an integer value, which is the index of the first occurence of the argument, or -1 if the argument does not appear in the string. This can tell you whether a given character or String appears in a String.

String.replace() replaces every occurence of one char with another. Remember, '' is a valid char.

String.charAt() tells you the char value at a given location in the String. This might be handy for checking whether a given char is at a given location - sort of the reverse of "indexOf".

String.subString() will extract a portion of the String - so if you know where you can to start and where you want to end, this will give you the String's contents.

I don't think you'll need to use all of these, but any of them could be useful in solving the two problems you've mentioned.

Remember, if you do this right, you have a known closing tag and a known non-closing tag, and you just need to figure out whether N is the same in <N> and </N>. I think there are three easy ways to do this, coming from the methods I've just pointed you to.

Your item 2 suggests that you might not be thinking about this as I would. I can't tell, because you haven't spelled out your approach, but you say "I need to pop off when they are the same" - but you can't know if they're the same until you pop, right?
Examining each token should tell you whether you're going to add it to the output, or push it to the stack (and, I guess, put part of it in the output as well - that's different, but easy), or whether you're going to take something off the stack and make a decision.

Best of luck, I'm going off station. I expect to see this marked "solved" when I sign in in the morning! :)

Sorry I could not do it.

How can I compare<kind> to </kind>

If one starts with < (and not </) and the other starts with </ then compare the remaining for equality.

Easier than that, even. If you've popped one off the stack, you must have found a closing tag (a </ tag). That's the only time you're going to pop the stack, right? So you can remove the / ( ie, replace the '/' with a '') and compare the two strings directly.

@tedtdu - Sorry you missed your deadline. If you want to go back to the top and work through how this is done, we can still do that.

If one starts with < (and not </) and the other starts with </ then compare the remaining for equality.

Dear NormR1

Thanks for your hint. Let me ask further questions.
1)Shall I need to tokenize the string of "<Kind>Yes</Kind>",
Thus, "<kind","</kind" become independent tokens?
2)Could you please to tell me how to compare REMAINING for equality?

Sincerely
Thanks in advance

tedtdu

Dear jon.kiparsky
Thanks for kind supporting. I think I have to finish it and am still working on it.
Below is work that I have done so far, but UNsuccessfull yet. Help will highly be appreaciated and remembered. "stream.txt" is XML document.

import java.util.*;
import java.io.*;

public class LastHope {

	public static void main(String[] args){

		Stack<String> stack = new Stack<String>();
		try{
		BufferedReader bf = new BufferedReader(new FileReader("stream.txt"));
		
		String line;
		String de;
		String str;
		boolean flag=true;
		
		while((line=bf.readLine())!=null){
			de="<>/";
		 	StringTokenizer st = new StringTokenizer(line,de,true);
		 	while(st.hasMoreTokens()){
		 		str=st.nextToken();
		 		if(str.equals("<")){
		 			str=st.nextToken();
		 			if(!str.equals("/")){
		 				System.out.print("  " +stack.push(str));
		 			}else{
		 				str=st.nextToken();
		 				if(!str.equals(stack.peek())){
		 					System.out.println("err");
		 					break;
		 				}else{
		 					//System.out.println("pop "+stack.pop());
		 					stack.pop();
		 				}
		 			}
		 		}else if((str.equals(">"))&&str.equals(line.length())){
		 			System.out.println();
		 			//flag=false;
		 		}else if((str.equals(">"))&&!str.equals(line.length())){
		 			System.out.print("");
		 		}else{System.out.print(" "+str);
		 		}
		 	}
		}
		}catch(Exception e){System.out.println("exception");}
	}
}

Easier than that, even. If you've popped one off the stack, you must have found a closing tag (a </ tag). That's the only time you're going to pop the stack, right? So you can remove the / ( ie, replace the '/' with a '') and compare the two strings directly.

@jon.kiparsky. Yes, when "/" is encountered, stack should be popped off. Please excuse if it sounds strange. How do I know "/" is encountered if it is replaced with ''.
Could you please illustrate your commend with a few line of codes. Thanks for your attention.

The remaining part of the string is the string following the / would be the substring starting at index=2. Sample code to remove leading bits for comparing remaining:

String sub1 = begTok.subString(1); // drop leading <
String sub2 = endTok.substring(2); // drop leading </
sub1.equals(sub2) // test equality of remaining

Suggestion:
For testing, create a String array in your program and use a StringReader to wrap it.
That makes the testing self contained ie not requiring a separate file.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.