•
•
•
•
What is DaniWeb IT Discussion Community?
You're currently browsing the Java section within the Software Development category of DaniWeb, a massive community of 391,905 software developers, web developers, Internet marketers, and tech gurus who are all enthusiastic about making contacts, networking, and learning from each other. In fact, there are 3,564 IT professionals currently interacting right now! Registration is free, only takes a minute and lets you enjoy all of the interactive features of the site.
Please support our Java advertiser: Lunarpages Java Web Hosting
Views: 5234 | Replies: 12
![]() |
•
•
Join Date: Mar 2004
Posts: 715
Reputation:
Rep Power: 6
Solved Threads: 28
The probably isn't the parsing actually, I can't even get to that part yet. The webpage uses a different character set, "windows-1252". But even after setting the reader to use that charset (which exists in the system), I still get the ChangedCharSetException.
Here's the first couple lines from the html file:
Is there perhaps some way of reading the file but ignoring the meta data?
String link = "myurl.com";
URL url = new URL(link);
URLConnection conn = url.openConnection();
Reader reader = new InputStreamReader(conn.getInputStream(),Charset.forName("windows-1252"));
EditorKit kit = new HTMLEditorKit();
HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
//throws error here while reading
kit.read(reader, doc, 0);Here's the first couple lines from the html file:
<html> <head> <meta http-equiv="Content-Language" content="en-us"> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> <meta http-equiv="Pragma" content="no-cache">
Is there perhaps some way of reading the file but ignoring the meta data?
•
•
Join Date: Feb 2006
Posts: 1
Reputation:
Rep Power: 0
Solved Threads: 0
Hello!
Look at Html Parse Demo example here:
http://javafaq.nu/java-example-code-656.html
If it does not work for you just search for "HTMLEditorKit" there
I found ~10 examples on how to handle different tags..
By the way, there are all examples are by API, package, class... So you can find everything you need fast yourself
allexx
Look at Html Parse Demo example here:
http://javafaq.nu/java-example-code-656.html
If it does not work for you just search for "HTMLEditorKit" there
I found ~10 examples on how to handle different tags..
By the way, there are all examples are by API, package, class... So you can find everything you need fast yourself
allexx
•
•
Join Date: Jun 2004
Location: H4x0rville
Posts: 2,105
Reputation:
Rep Power: 9
Solved Threads: 18
I read where the microsoft encoding name really isn't a valid encoding name! I haven't read anything about what to change it to or anything, though. You could also look into adding support for it via the charsetprovider. My guess is that charset is not supported, so try this:
and see if it is or not.
boolean isSupported(String charsetName)
and see if it is or not.
•
•
Join Date: Jun 2004
Location: H4x0rville
Posts: 2,105
Reputation:
Rep Power: 9
Solved Threads: 18
I took a look at the exception a little, and it seems a bit weird. It happens as the name implies, whenever the charset is changed.......
But when and why is it changed?? (I guess that would solve everything)
I'm only going to take a stab, but I think you need some decoding or something. The read method or the editor kit is converting to some kind of format that it likes, regardless of whether you specifiy otherwise. I don't know that format and I don't know how to find out. I just think you need to convert before you try to read...
Maybe I'm wrong, but it could be worth a try.
But when and why is it changed?? (I guess that would solve everything)
I'm only going to take a stab, but I think you need some decoding or something. The read method or the editor kit is converting to some kind of format that it likes, regardless of whether you specifiy otherwise. I don't know that format and I don't know how to find out. I just think you need to convert before you try to read...
Maybe I'm wrong, but it could be worth a try.
•
•
Join Date: Mar 2004
Posts: 715
Reputation:
Rep Power: 6
Solved Threads: 28
This is driving me nuts.
Should this not be decoding?
Did some searching around bug forums on Sun website. Though not a bug, I found related problems and this seems to work ok. I haven't tried parsing anything yet, but its not throwing the error anymore.
Should this not be decoding?
Charset cs = Charset.forName("windows-1252");
Reader reader = new InputStreamReader(conn.getInputStream(),cs.newDecoder());Did some searching around bug forums on Sun website. Though not a bug, I found related problems and this seems to work ok. I haven't tried parsing anything yet, but its not throwing the error anymore.
HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
//added this method call
doc.putProperty("IgnoreCharsetDirective", new Boolean(true));
kit.read(reader, doc, 0);•
•
Join Date: Jun 2004
Location: H4x0rville
Posts: 2,105
Reputation:
Rep Power: 9
Solved Threads: 18
•
•
•
•
Originally Posted by Phaelax
Reader reader = new InputStreamReader(conn.getInputStream(),cs.newDecoder());
•
•
•
•
I haven't tried parsing anything yet, but its not throwing the error anymore.
So it's working now? Not sure why it wouldn't work before and work now. Only time I've seen such this is when a crappy IDE like bluj is used..doubt that's the problem.
•
•
Join Date: Mar 2004
Posts: 715
Reputation:
Rep Power: 6
Solved Threads: 28
Got a new problem. I just can't seem to figure out how to get the value of the Anchor tag. Not the href attribute, but the value between the opening and closing tags.
Here's the full code for how I'm currently getting the attributes.
<a href="http://something.com">i want this text</a>
Here's the full code for how I'm currently getting the attributes.
import java.net.*;
import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.EditorKit;
import javax.swing.text.*;
import java.nio.charset.Charset;
public class Nullsoft
{
public Nullsoft()
{
/*
* iTunes radio station lists?
* http://pri.kts-af.net/
*/
String genre = "ambient";
String link = "<A href="http://yp.shoutcast.com/directory/index.phtml?s="+genre">http://yp.shoutcast.com/directory/index.phtml?s="+genre;
URL url = null;
try
{
url = new URL(link);
URLConnection conn = url.openConnection();
Reader reader = new InputStreamReader(conn.getInputStream());
EditorKit kit = new HTMLEditorKit();
HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
doc.putProperty("IgnoreCharsetDirective", new Boolean(true));
kit.read(reader, doc, 0);
HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
while(it.isValid())
{
SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();
String href = (String)s.getAttribute(HTML.Attribute.HREF);
System.out.println(href);
it.next();
}
}
catch(ChangedCharSetException e){
System.out.println(e.getCharSetSpec());
}
catch(Exception e){
System.out.println(e);
}
}
/**
*
*/
public static void main(String[] args)
{
Nullsoft ns = new Nullsoft();
}
}•
•
Join Date: Mar 2004
Posts: 715
Reputation:
Rep Power: 6
Solved Threads: 28
Figures, soon as I post this I got an idea. Since I can get the offsets of the tag within the document, why not just extract the text straight from the document myself?
I thought that might return the tags themselves, but its not. Gives me exactly what I wanted.
int start = it.getStartOffset(); int end = it.getEndOffset(); String name = doc.getText(start, end-start);
I thought that might return the tags themselves, but its not. Gives me exactly what I wanted.
![]() |
•
•
•
•
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
•
•
•
•
•
•
•
•
DaniWeb Java Marketplace
- Parsing html form. (PHP)
Other Threads in the Java Forum
- Previous Thread: Drag and Drop headaches
- Next Thread: Topic -gui



Linear Mode