parsing html

Question

Phaelax 52 Practically a Posting Shark

18 Years Ago

The probably isn't the parsing actually, I can't even get to that part yet. The webpage uses a different character set, "windows-1252". But even after setting the reader to use that charset (which exists in the system), I still get the ChangedCharSetException.

String link = "myurl.com";
 
URL url = new URL(link);
			URLConnection conn = url.openConnection();
			Reader reader = new InputStreamReader(conn.getInputStream(),Charset.forName("windows-1252"));
			
			EditorKit kit = new HTMLEditorKit();
			HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
			//throws error here while reading
			kit.read(reader, doc, 0);

Here's the first couple lines from the html file:

<html>
<head>
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<meta http-equiv="Pragma" content="no-cache">

Is there perhaps some way of reading the file but ignoring the meta data?

java

4 Contributors
12 Replies
360 Views
1 Month Discussion Span
Latest Post 18 Years Ago Latest Post by Phaelax

All 12 Replies

server_crash 64 Postaholic

18 Years Ago

I read where the microsoft encoding name really isn't a valid encoding name! I haven't read anything about what to change it to or anything, though. You could also look into adding support for it via the charsetprovider. My guess is that charset is not supported, so try this:

boolean isSupported(String charsetName)

and see if it is or not.

server_crash 64 Postaholic

18 Years Ago

I took a look at the exception a little, and it seems a bit weird. It happens as the name implies, whenever the charset is changed.......

But when and why is it changed?? (I guess that would solve everything)

I'm only going to take a stab, but I think you need some decoding or something. The read method or the editor kit is converting to some kind of format that it likes, regardless of whether you specifiy otherwise. I don't know that format and I don't know how to find out. I just think you need to convert before you try to read...

Maybe I'm wrong, but it could be worth a try.

server_crash 64 Postaholic

18 Years Ago

Reader reader = new InputStreamReader(conn.getInputStream(),cs.newDecoder());

cs.newDecoder() just creates a new decoder. The question is, will the reader automatically decode using your decoder? You'll have to answer that one because I don't know.

I haven't tried parsing anything yet, but its not throwing the error anymore.

So it's working now? Not sure why it wouldn't work before and work now. Only time I've seen such this is when a crappy IDE like bluj is used..doubt that's the problem.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

allexx 0 Newbie Poster · Answer 1 · 2006-02-21T13:32:42+00:00

Hello!

Look at Html Parse Demo example here:
http://javafaq.nu/java-example-code-656.html

If it does not work for you just search for "HTMLEditorKit" there
I found ~10 examples on how to handle different tags..
By the way, there are all examples are by API, package, class... So you can find everything you need fast yourself

allexx

Phaelax 52 Practically a Posting Shark · Answer 2 · 2006-02-23T06:11:08+00:00

I checked to see if it was supported, and it said it was.

Phaelax 52 Practically a Posting Shark · Answer 3 · 2006-02-23T21:04:20+00:00

This is driving me nuts.
Should this not be decoding?

Charset cs = Charset.forName("windows-1252");
Reader reader = new InputStreamReader(conn.getInputStream(),cs.newDecoder());

Did some searching around bug forums on Sun website. Though not a bug, I found related problems and this seems to work ok. I haven't tried parsing anything yet, but its not throwing the error anymore.

HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
//added this method call
doc.putProperty("IgnoreCharsetDirective", new Boolean(true));

kit.read(reader, doc, 0);

Phaelax 52 Practically a Posting Shark · Answer 4 · 2006-02-24T10:19:39+00:00

heh, i was using bluej. i need to update my netbeans. But it started working after I set the property to ignore the charset, before I didn't have that set to ignore. But I got my document parsed and everything sorted the way I want it.

Phaelax 52 Practically a Posting Shark · Answer 5 · 2006-02-27T07:21:33+00:00

Got a new problem. I just can't seem to figure out how to get the value of the Anchor tag. Not the href attribute, but the value between the opening and closing tags.

<a href="http://something.com">i want this text</a>

Here's the full code for how I'm currently getting the attributes.

import java.net.*;
import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.EditorKit;
import javax.swing.text.*;
import java.nio.charset.Charset;
 
public class Nullsoft
{
	
	public Nullsoft()
	{
		/*
		 * iTunes radio station lists?
		 * http://pri.kts-af.net/
		 */
		
		
		String genre = "ambient";
		String link = "<A href="http://yp.shoutcast.com/directory/index.phtml?s="+genre">http://yp.shoutcast.com/directory/index.phtml?s="+genre;
		
		URL url = null;
		try
		{
			url = new URL(link);
			URLConnection conn = url.openConnection();
			Reader reader = new InputStreamReader(conn.getInputStream());
			
			EditorKit kit = new HTMLEditorKit();
			HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
			doc.putProperty("IgnoreCharsetDirective", new Boolean(true));
			kit.read(reader, doc, 0);
			
			HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
			
			while(it.isValid())
			{
				SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();
				String href = (String)s.getAttribute(HTML.Attribute.HREF);
				System.out.println(href);
				it.next();
			}
			
		}
		catch(ChangedCharSetException e){
			System.out.println(e.getCharSetSpec());
		}
		catch(Exception e){
			System.out.println(e);
		}
		
	}
	
	/**
	 *
	 */
	public static void main(String[] args)
	{
		Nullsoft ns = new Nullsoft();
	}
}

Phaelax 52 Practically a Posting Shark · Answer 6 · 2006-02-27T07:26:11+00:00

Figures, soon as I post this I got an idea. Since I can get the offsets of the tag within the document, why not just extract the text straight from the document myself?

int start = it.getStartOffset();
int end = it.getEndOffset();
String name = doc.getText(start, end-start);

I thought that might return the tags themselves, but its not. Gives me exactly what I wanted.

server_crash 64 Postaholic · Answer 7 · 2006-02-27T07:41:53+00:00

I think I did the exact same thing once writing a console web crawler. Took some tricky work with the indexOf() method.

Well, actually it was the link i was after.

sjoshi 0 Newbie Poster · Answer 8 · 2006-03-23T06:06:51+00:00

Hi,

I am getting a javax.swing.text.ChangedCharSetException when I use the following code. Where do I set the prperty that you are talking about? ( I have a meta tag that is causing the exception.

try {
Reader r = new FileReader("PJMData.htm");
ParserDelegator parser = new ParserDelegator();
HTMLEditorKit.ParserCallback callback = new PJMParser();
parser.parse(r, callback, false);
} catch (IOException e) {
e.printStackTrace();
}

Let me know. Thanks.

Phaelax 52 Practically a Posting Shark · Answer 9 · 2006-03-23T07:06:57+00:00

I'm not sure in your case, since you're using the parser callback whereas I read the html into a document. The property I mentioned is set on the document itself.

doc.putProperty("IgnoreCharsetDirective", new Boolean(true));

Could you read your htm file into a document first then use the parser on it?

parsing html

Recommended Answers Collapse Answers

All 12 Replies

Recommended Answers