User Name Password Register
DaniWeb IT Discussion Community
All
What is DaniWeb IT Discussion Community?
You're currently browsing the Java section within the Software Development category of DaniWeb, a massive community of 391,905 software developers, web developers, Internet marketers, and tech gurus who are all enthusiastic about making contacts, networking, and learning from each other. In fact, there are 3,564 IT professionals currently interacting right now! Registration is free, only takes a minute and lets you enjoy all of the interactive features of the site.
Please support our Java advertiser: Lunarpages Java Web Hosting
Views: 5234 | Replies: 12
Reply
Join Date: Mar 2004
Posts: 715
Reputation: Phaelax is on a distinguished road 
Rep Power: 6
Solved Threads: 28
Phaelax Phaelax is offline Offline
Master Poster

parsing html

  #1  
Feb 19th, 2006
The probably isn't the parsing actually, I can't even get to that part yet. The webpage uses a different character set, "windows-1252". But even after setting the reader to use that charset (which exists in the system), I still get the ChangedCharSetException.


String link = "myurl.com";
 
URL url = new URL(link);
			URLConnection conn = url.openConnection();
			Reader reader = new InputStreamReader(conn.getInputStream(),Charset.forName("windows-1252"));
			
			EditorKit kit = new HTMLEditorKit();
			HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
			//throws error here while reading
			kit.read(reader, doc, 0);

Here's the first couple lines from the html file:
<html>
<head>
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<meta http-equiv="Pragma" content="no-cache">

Is there perhaps some way of reading the file but ignoring the meta data?
AddThis Social Bookmark Button
Reply With Quote  
Join Date: Feb 2006
Posts: 1
Reputation: allexx is an unknown quantity at this point 
Rep Power: 0
Solved Threads: 0
allexx allexx is offline Offline
Newbie Poster

Re: parsing html

  #2  
Feb 21st, 2006
Hello!

Look at Html Parse Demo example here:
http://javafaq.nu/java-example-code-656.html

If it does not work for you just search for "HTMLEditorKit" there
I found ~10 examples on how to handle different tags..
By the way, there are all examples are by API, package, class... So you can find everything you need fast yourself

allexx
Reply With Quote  
Join Date: Jun 2004
Location: H4x0rville
Posts: 2,105
Reputation: server_crash is on a distinguished road 
Rep Power: 9
Solved Threads: 18
server_crash's Avatar
server_crash server_crash is offline Offline
Postaholic

Re: parsing html

  #3  
Feb 21st, 2006
I read where the microsoft encoding name really isn't a valid encoding name! I haven't read anything about what to change it to or anything, though. You could also look into adding support for it via the charsetprovider. My guess is that charset is not supported, so try this:

boolean isSupported(String charsetName)

and see if it is or not.
Reply With Quote  
Join Date: Mar 2004
Posts: 715
Reputation: Phaelax is on a distinguished road 
Rep Power: 6
Solved Threads: 28
Phaelax Phaelax is offline Offline
Master Poster

Re: parsing html

  #4  
Feb 22nd, 2006
I checked to see if it was supported, and it said it was.
Reply With Quote  
Join Date: Jun 2004
Location: H4x0rville
Posts: 2,105
Reputation: server_crash is on a distinguished road 
Rep Power: 9
Solved Threads: 18
server_crash's Avatar
server_crash server_crash is offline Offline
Postaholic

Re: parsing html

  #5  
Feb 22nd, 2006
I took a look at the exception a little, and it seems a bit weird. It happens as the name implies, whenever the charset is changed.......

But when and why is it changed?? (I guess that would solve everything)

I'm only going to take a stab, but I think you need some decoding or something. The read method or the editor kit is converting to some kind of format that it likes, regardless of whether you specifiy otherwise. I don't know that format and I don't know how to find out. I just think you need to convert before you try to read...

Maybe I'm wrong, but it could be worth a try.
Reply With Quote  
Join Date: Mar 2004
Posts: 715
Reputation: Phaelax is on a distinguished road 
Rep Power: 6
Solved Threads: 28
Phaelax Phaelax is offline Offline
Master Poster

Re: parsing html

  #6  
Feb 23rd, 2006
This is driving me nuts.
Should this not be decoding?
Charset cs = Charset.forName("windows-1252");
Reader reader = new InputStreamReader(conn.getInputStream(),cs.newDecoder());

Did some searching around bug forums on Sun website. Though not a bug, I found related problems and this seems to work ok. I haven't tried parsing anything yet, but its not throwing the error anymore.
HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
//added this method call
doc.putProperty("IgnoreCharsetDirective", new Boolean(true));

kit.read(reader, doc, 0);
Reply With Quote  
Join Date: Jun 2004
Location: H4x0rville
Posts: 2,105
Reputation: server_crash is on a distinguished road 
Rep Power: 9
Solved Threads: 18
server_crash's Avatar
server_crash server_crash is offline Offline
Postaholic

Re: parsing html

  #7  
Feb 23rd, 2006
Originally Posted by Phaelax
Reader reader = new InputStreamReader(conn.getInputStream(),cs.newDecoder());
cs.newDecoder() just creates a new decoder. The question is, will the reader automatically decode using your decoder? You'll have to answer that one because I don't know.

I haven't tried parsing anything yet, but its not throwing the error anymore.

So it's working now? Not sure why it wouldn't work before and work now. Only time I've seen such this is when a crappy IDE like bluj is used..doubt that's the problem.
Reply With Quote  
Join Date: Mar 2004
Posts: 715
Reputation: Phaelax is on a distinguished road 
Rep Power: 6
Solved Threads: 28
Phaelax Phaelax is offline Offline
Master Poster

Re: parsing html

  #8  
Feb 23rd, 2006
heh, i was using bluej. i need to update my netbeans. But it started working after I set the property to ignore the charset, before I didn't have that set to ignore. But I got my document parsed and everything sorted the way I want it.
Reply With Quote  
Join Date: Mar 2004
Posts: 715
Reputation: Phaelax is on a distinguished road 
Rep Power: 6
Solved Threads: 28
Phaelax Phaelax is offline Offline
Master Poster

Re: parsing html

  #9  
Feb 26th, 2006
Got a new problem. I just can't seem to figure out how to get the value of the Anchor tag. Not the href attribute, but the value between the opening and closing tags.

<a href="http://something.com">i want this text</a>


Here's the full code for how I'm currently getting the attributes.
import java.net.*;
import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.EditorKit;
import javax.swing.text.*;
import java.nio.charset.Charset;
 
public class Nullsoft
{
	
	public Nullsoft()
	{
		/*
		 * iTunes radio station lists?
		 * http://pri.kts-af.net/
		 */
		
		
		String genre = "ambient";
		String link = "<A href="http://yp.shoutcast.com/directory/index.phtml?s="+genre">http://yp.shoutcast.com/directory/index.phtml?s="+genre;
		
		URL url = null;
		try
		{
			url = new URL(link);
			URLConnection conn = url.openConnection();
			Reader reader = new InputStreamReader(conn.getInputStream());
			
			EditorKit kit = new HTMLEditorKit();
			HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
			doc.putProperty("IgnoreCharsetDirective", new Boolean(true));
			kit.read(reader, doc, 0);
			
			HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
			
			while(it.isValid())
			{
				SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();
				String href = (String)s.getAttribute(HTML.Attribute.HREF);
				System.out.println(href);
				it.next();
			}
			
		}
		catch(ChangedCharSetException e){
			System.out.println(e.getCharSetSpec());
		}
		catch(Exception e){
			System.out.println(e);
		}
		
	}
	
	/**
	 *
	 */
	public static void main(String[] args)
	{
		Nullsoft ns = new Nullsoft();
	}
}
Reply With Quote  
Join Date: Mar 2004
Posts: 715
Reputation: Phaelax is on a distinguished road 
Rep Power: 6
Solved Threads: 28
Phaelax Phaelax is offline Offline
Master Poster

Re: parsing html

  #10  
Feb 26th, 2006
Figures, soon as I post this I got an idea. Since I can get the offsets of the tag within the document, why not just extract the text straight from the document myself?

int start = it.getStartOffset();
int end = it.getEndOffset();
String name = doc.getText(start, end-start);

I thought that might return the tags themselves, but its not. Gives me exactly what I wanted.
Reply With Quote  
Reply

Only community members can participate in forum threads. You must register or log in to contribute.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)

 

DaniWeb Java Marketplace
Thread Tools Display Modes

Similar Threads
Other Threads in the Java Forum

All times are GMT -4. The time now is 7:36 am.
Forum system based on vBulletin Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
©2003 - 2008 DaniWeb® LLC