944,124 Members | Top Members by Rank

Ad:
  • Java Discussion Thread
  • Unsolved
  • Views: 14491
  • Java RSS
You are currently viewing page 1 of this multi-page discussion thread
Feb 19th, 2006
0

parsing html

Expand Post »
The probably isn't the parsing actually, I can't even get to that part yet. The webpage uses a different character set, "windows-1252". But even after setting the reader to use that charset (which exists in the system), I still get the ChangedCharSetException.


Java Syntax (Toggle Plain Text)
  1. String link = "myurl.com";
  2.  
  3. URL url = new URL(link);
  4. URLConnection conn = url.openConnection();
  5. Reader reader = new InputStreamReader(conn.getInputStream(),Charset.forName("windows-1252"));
  6.  
  7. EditorKit kit = new HTMLEditorKit();
  8. HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
  9. //throws error here while reading
  10. kit.read(reader, doc, 0);

Here's the first couple lines from the html file:
<html>
<head>
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<meta http-equiv="Pragma" content="no-cache">

Is there perhaps some way of reading the file but ignoring the meta data?
Similar Threads
Reputation Points: 92
Solved Threads: 51
Practically a Posting Shark
Phaelax is offline Offline
856 posts
since Mar 2004
Feb 21st, 2006
0

Re: parsing html

Hello!

Look at Html Parse Demo example here:
http://javafaq.nu/java-example-code-656.html

If it does not work for you just search for "HTMLEditorKit" there
I found ~10 examples on how to handle different tags..
By the way, there are all examples are by API, package, class... So you can find everything you need fast yourself

allexx
Reputation Points: 10
Solved Threads: 0
Newbie Poster
allexx is offline Offline
1 posts
since Feb 2006
Feb 21st, 2006
0

Re: parsing html

I read where the microsoft encoding name really isn't a valid encoding name! I haven't read anything about what to change it to or anything, though. You could also look into adding support for it via the charsetprovider. My guess is that charset is not supported, so try this:

Java Syntax (Toggle Plain Text)
  1. boolean isSupported(String charsetName)

and see if it is or not.
Reputation Points: 113
Solved Threads: 19
Postaholic
server_crash is offline Offline
2,108 posts
since Jun 2004
Feb 22nd, 2006
0

Re: parsing html

I checked to see if it was supported, and it said it was.
Reputation Points: 92
Solved Threads: 51
Practically a Posting Shark
Phaelax is offline Offline
856 posts
since Mar 2004
Feb 22nd, 2006
0

Re: parsing html

I took a look at the exception a little, and it seems a bit weird. It happens as the name implies, whenever the charset is changed.......

But when and why is it changed?? (I guess that would solve everything)

I'm only going to take a stab, but I think you need some decoding or something. The read method or the editor kit is converting to some kind of format that it likes, regardless of whether you specifiy otherwise. I don't know that format and I don't know how to find out. I just think you need to convert before you try to read...

Maybe I'm wrong, but it could be worth a try.
Reputation Points: 113
Solved Threads: 19
Postaholic
server_crash is offline Offline
2,108 posts
since Jun 2004
Feb 23rd, 2006
0

Re: parsing html

This is driving me nuts.
Should this not be decoding?
Java Syntax (Toggle Plain Text)
  1. Charset cs = Charset.forName("windows-1252");
  2. Reader reader = new InputStreamReader(conn.getInputStream(),cs.newDecoder());

Did some searching around bug forums on Sun website. Though not a bug, I found related problems and this seems to work ok. I haven't tried parsing anything yet, but its not throwing the error anymore.
Java Syntax (Toggle Plain Text)
  1. HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
  2. //added this method call
  3. doc.putProperty("IgnoreCharsetDirective", new Boolean(true));
  4.  
  5. kit.read(reader, doc, 0);
Reputation Points: 92
Solved Threads: 51
Practically a Posting Shark
Phaelax is offline Offline
856 posts
since Mar 2004
Feb 23rd, 2006
0

Re: parsing html

Quote originally posted by Phaelax ...
Reader reader = new InputStreamReader(conn.getInputStream(),cs.newDecoder());
cs.newDecoder() just creates a new decoder. The question is, will the reader automatically decode using your decoder? You'll have to answer that one because I don't know.

Quote ...
I haven't tried parsing anything yet, but its not throwing the error anymore.
So it's working now? Not sure why it wouldn't work before and work now. Only time I've seen such this is when a crappy IDE like bluj is used..doubt that's the problem.
Reputation Points: 113
Solved Threads: 19
Postaholic
server_crash is offline Offline
2,108 posts
since Jun 2004
Feb 24th, 2006
0

Re: parsing html

heh, i was using bluej. i need to update my netbeans. But it started working after I set the property to ignore the charset, before I didn't have that set to ignore. But I got my document parsed and everything sorted the way I want it.
Reputation Points: 92
Solved Threads: 51
Practically a Posting Shark
Phaelax is offline Offline
856 posts
since Mar 2004
Feb 26th, 2006
0

Re: parsing html

Got a new problem. I just can't seem to figure out how to get the value of the Anchor tag. Not the href attribute, but the value between the opening and closing tags.

Java Syntax (Toggle Plain Text)
  1. <a href="http://something.com">i want this text</a>


Here's the full code for how I'm currently getting the attributes.
Java Syntax (Toggle Plain Text)
  1. import java.net.*;
  2. import java.io.*;
  3. import javax.swing.text.html.*;
  4. import javax.swing.text.EditorKit;
  5. import javax.swing.text.*;
  6. import java.nio.charset.Charset;
  7.  
  8. public class Nullsoft
  9. {
  10.  
  11. public Nullsoft()
  12. {
  13. /*
  14. * iTunes radio station lists?
  15. * <a rel="nofollow" href="http://pri.kts-af.net/" target="_blank">http://pri.kts-af.net/</a>
  16. */
  17.  
  18.  
  19. String genre = "ambient";
  20. String link = "<A href="http://yp.shoutcast.com/directory/index.phtml?s="+genre">http://yp.shoutcast.com/directory/index.phtml?s="+genre;
  21.  
  22. URL url = null;
  23. try
  24. {
  25. url = new URL(link);
  26. URLConnection conn = url.openConnection();
  27. Reader reader = new InputStreamReader(conn.getInputStream());
  28.  
  29. EditorKit kit = new HTMLEditorKit();
  30. HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
  31. doc.putProperty("IgnoreCharsetDirective", new Boolean(true));
  32. kit.read(reader, doc, 0);
  33.  
  34. HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
  35.  
  36. while(it.isValid())
  37. {
  38. SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();
  39. String href = (String)s.getAttribute(HTML.Attribute.HREF);
  40. System.out.println(href);
  41. it.next();
  42. }
  43.  
  44. }
  45. catch(ChangedCharSetException e){
  46. System.out.println(e.getCharSetSpec());
  47. }
  48. catch(Exception e){
  49. System.out.println(e);
  50. }
  51.  
  52. }
  53.  
  54. /**
  55. *
  56. */
  57. public static void main(String[] args)
  58. {
  59. Nullsoft ns = new Nullsoft();
  60. }
  61. }
Reputation Points: 92
Solved Threads: 51
Practically a Posting Shark
Phaelax is offline Offline
856 posts
since Mar 2004
Feb 26th, 2006
0

Re: parsing html

Figures, soon as I post this I got an idea. Since I can get the offsets of the tag within the document, why not just extract the text straight from the document myself?

Java Syntax (Toggle Plain Text)
  1. int start = it.getStartOffset();
  2. int end = it.getEndOffset();
  3. String name = doc.getText(start, end-start);

I thought that might return the tags themselves, but its not. Gives me exactly what I wanted.
Reputation Points: 92
Solved Threads: 51
Practically a Posting Shark
Phaelax is offline Offline
856 posts
since Mar 2004

This thread is more than three months old

No one has posted to this discussion for at least three months. Please let old threads die and do not reply to them unless you feel you have something new and valuable to contribute that absolutely must be added to make the discussion complete. Otherwise, please start a new thread in this forum instead.
Message:
Previous Thread in Java Forum Timeline: Drag and Drop headaches
Next Thread in Java Forum Timeline: Topic -gui





About Us | Contact Us | Advertise | Acceptable Use Policy
Forum Index | Build Custom RSS Feed


Follow us on Twitter


© 2011 DaniWeb® LLC