I've played around with HTMLEditorKit and HTMLDocument, and while I've managed to do the parsing I needed, I also need the complete source code of the document to pass along to a webkit renderer. Java's existing document throws out some tags after I read it in.

HTMLEditorKit kit = new HTMLEditorKit();
HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
kit.read(myFile, doc, 0);

// find BODY tag and insert necessary html code here

/* now grab source text from document */
String sourceText = doc.getText(0, doc.getLength());

The source file it's reading from contains a doctype and link tags to style sheets. Those lines are being thrown out, so when I read back the source, it's not actually the full source I expect.

Now, the API docs for HTMLEditorKit state:
When inserting into a non-empty document all tags outside of the body (head, title) will be dropped.

Am I not reading my original file into an empty document like I thought?

As new messages come into my program, I need to append several tags to the end of the BODY and sometimes inside a DIV before the end. Then the full source is passed on to the webkit renderer.

Hopefully this all made sense to someone who can make a suggestion to me.

I ended up using Jericho to do what I needed.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.