| | |
parsing html
![]() |
•
•
Join Date: Mar 2004
Posts: 763
Reputation:
Solved Threads: 38
The probably isn't the parsing actually, I can't even get to that part yet. The webpage uses a different character set, "windows-1252". But even after setting the reader to use that charset (which exists in the system), I still get the ChangedCharSetException.
Here's the first couple lines from the html file:
Is there perhaps some way of reading the file but ignoring the meta data?
Java Syntax (Toggle Plain Text)
String link = "myurl.com"; URL url = new URL(link); URLConnection conn = url.openConnection(); Reader reader = new InputStreamReader(conn.getInputStream(),Charset.forName("windows-1252")); EditorKit kit = new HTMLEditorKit(); HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument(); //throws error here while reading kit.read(reader, doc, 0);
Here's the first couple lines from the html file:
<html> <head> <meta http-equiv="Content-Language" content="en-us"> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> <meta http-equiv="Pragma" content="no-cache">
Is there perhaps some way of reading the file but ignoring the meta data?
•
•
Join Date: Feb 2006
Posts: 1
Reputation:
Solved Threads: 0
Hello!
Look at Html Parse Demo example here:
http://javafaq.nu/java-example-code-656.html
If it does not work for you just search for "HTMLEditorKit" there
I found ~10 examples on how to handle different tags..
By the way, there are all examples are by API, package, class... So you can find everything you need fast yourself
allexx
Look at Html Parse Demo example here:
http://javafaq.nu/java-example-code-656.html
If it does not work for you just search for "HTMLEditorKit" there
I found ~10 examples on how to handle different tags..
By the way, there are all examples are by API, package, class... So you can find everything you need fast yourself
allexx
•
•
Join Date: Jun 2004
Posts: 2,108
Reputation:
Solved Threads: 18
I read where the microsoft encoding name really isn't a valid encoding name! I haven't read anything about what to change it to or anything, though. You could also look into adding support for it via the charsetprovider. My guess is that charset is not supported, so try this:
and see if it is or not.
Java Syntax (Toggle Plain Text)
boolean isSupported(String charsetName)
and see if it is or not.
•
•
Join Date: Jun 2004
Posts: 2,108
Reputation:
Solved Threads: 18
I took a look at the exception a little, and it seems a bit weird. It happens as the name implies, whenever the charset is changed.......
But when and why is it changed?? (I guess that would solve everything)
I'm only going to take a stab, but I think you need some decoding or something. The read method or the editor kit is converting to some kind of format that it likes, regardless of whether you specifiy otherwise. I don't know that format and I don't know how to find out. I just think you need to convert before you try to read...
Maybe I'm wrong, but it could be worth a try.
But when and why is it changed?? (I guess that would solve everything)
I'm only going to take a stab, but I think you need some decoding or something. The read method or the editor kit is converting to some kind of format that it likes, regardless of whether you specifiy otherwise. I don't know that format and I don't know how to find out. I just think you need to convert before you try to read...
Maybe I'm wrong, but it could be worth a try.
•
•
Join Date: Mar 2004
Posts: 763
Reputation:
Solved Threads: 38
This is driving me nuts.
Should this not be decoding?
Did some searching around bug forums on Sun website. Though not a bug, I found related problems and this seems to work ok. I haven't tried parsing anything yet, but its not throwing the error anymore.
Should this not be decoding?
Java Syntax (Toggle Plain Text)
Charset cs = Charset.forName("windows-1252"); Reader reader = new InputStreamReader(conn.getInputStream(),cs.newDecoder());
Did some searching around bug forums on Sun website. Though not a bug, I found related problems and this seems to work ok. I haven't tried parsing anything yet, but its not throwing the error anymore.
Java Syntax (Toggle Plain Text)
HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument(); //added this method call doc.putProperty("IgnoreCharsetDirective", new Boolean(true)); kit.read(reader, doc, 0);
•
•
Join Date: Jun 2004
Posts: 2,108
Reputation:
Solved Threads: 18
•
•
•
•
Originally Posted by Phaelax
Reader reader = new InputStreamReader(conn.getInputStream(),cs.newDecoder());
•
•
•
•
I haven't tried parsing anything yet, but its not throwing the error anymore.
•
•
Join Date: Mar 2004
Posts: 763
Reputation:
Solved Threads: 38
Got a new problem. I just can't seem to figure out how to get the value of the Anchor tag. Not the href attribute, but the value between the opening and closing tags.
Here's the full code for how I'm currently getting the attributes.
Java Syntax (Toggle Plain Text)
<a href="http://something.com">i want this text</a>
Here's the full code for how I'm currently getting the attributes.
Java Syntax (Toggle Plain Text)
import java.net.*; import java.io.*; import javax.swing.text.html.*; import javax.swing.text.EditorKit; import javax.swing.text.*; import java.nio.charset.Charset; public class Nullsoft { public Nullsoft() { /* * iTunes radio station lists? * <a rel="nofollow" class="t" href="http://pri.kts-af.net/" target="_blank">http://pri.kts-af.net/</a> */ String genre = "ambient"; String link = "<A href="http://yp.shoutcast.com/directory/index.phtml?s="+genre">http://yp.shoutcast.com/directory/index.phtml?s="+genre; URL url = null; try { url = new URL(link); URLConnection conn = url.openConnection(); Reader reader = new InputStreamReader(conn.getInputStream()); EditorKit kit = new HTMLEditorKit(); HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument(); doc.putProperty("IgnoreCharsetDirective", new Boolean(true)); kit.read(reader, doc, 0); HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A); while(it.isValid()) { SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes(); String href = (String)s.getAttribute(HTML.Attribute.HREF); System.out.println(href); it.next(); } } catch(ChangedCharSetException e){ System.out.println(e.getCharSetSpec()); } catch(Exception e){ System.out.println(e); } } /** * */ public static void main(String[] args) { Nullsoft ns = new Nullsoft(); } }
•
•
Join Date: Mar 2004
Posts: 763
Reputation:
Solved Threads: 38
Figures, soon as I post this I got an idea. Since I can get the offsets of the tag within the document, why not just extract the text straight from the document myself?
I thought that might return the tags themselves, but its not. Gives me exactly what I wanted.
Java Syntax (Toggle Plain Text)
int start = it.getStartOffset(); int end = it.getEndOffset(); String name = doc.getText(start, end-start);
I thought that might return the tags themselves, but its not. Gives me exactly what I wanted.
![]() |
Similar Threads
- Parsing html form. (PHP)
Other Threads in the Java Forum
- Previous Thread: Drag and Drop headaches
- Next Thread: Topic -gui
| Thread Tools | Search this Thread |
-xlint add android api applet application array arrays automation bi binary blackberry block bluetooth class client code compile compiler component database developmenthelp eclipse equation error event fractal freeze functiontesting game gameprogramming givemetehcodez graphics gui health html hyper ide idea image int integer j2me j2seprojects java javac javaprojects jetbrains jni jpanel jtable julia learningresources lego linux list login loops mac main map method methods mobile myregfun netbeans nonstatic notdisplaying number online pearl problem program project qt recursion scanner screen server set singleton sms sort spamblocker sql string swing system textfields thread threads time title tree tutorial-sample update variablebinding windows working xor






