Suggestion Java ways to parse HTML

Question

gretty 0 Junior Poster

14 Years Ago

Hello

I am new to Java & I am trying to find ways(in built Java objects/ways) to parse HTML. Can you suggest some objects in the Java Standard Library?

I have extended the object HTMLEditorKit.ParserCallback but when parsing a web pages' source code it literally takes 2 minutes or more! But some of that is my internet right now is only getting 9kb download per second :P but nvm that.

Here is what I have done to the HTMLEditorKit.ParserCallback (overloaded the handleStartTag() & handleText() functions ):

public class SearchResultGatherer extends HTMLEditorKit.ParserCallback implements Runnable
{

	/// Class Variables:
	
	int index = 0;
	
	private Vector <SearchResult> searchResults;
	private SearchEngine          searchEngine;
	private View                  appView;
	private double                searchTime;
	private int                   searchQuantity;	
	
	/// Class Methods:
	
	public SearchResultGatherer( Vector <SearchResult> _searchResults, SearchEngine _searchEngine, View _appView )
	{
		searchResults     = _searchResults;
		searchEngine      = _searchEngine;
		appView           = _appView;
		searchTime        = -1;
		searchQuantity    = -1;

		Thread thisThread = new Thread( this );
		thisThread.start();
		// Maybe do
		// thisThread.invokeAndWait();
	}
	
         public void handleStartTag( HTML.Tag t, MutableAttributeSet a, int pos ) 
	{
		for ( int i=0; i<searchEngine.targetElementInfo.length; i++ )
		{
			
			if ( t.toString().equals( searchEngine.targetElementInfo[i][0] ) )
			{
				if ( a.toString().equals( searchEngine.targetElementInfo[i][1] ) )
				{
					searchEngine.targetElementIdentified = true;
					return;
				}
				else if (i == 1)
				{
					System.out.println( "Element = " + t.toString() );
					System.out.println( "id      = " + a.toString() );
					searchEngine.targetElementIdentified = true;
					return;
				}
			}
			
		}
    }
	
	
	public void handleText( char[] arg0, int arg1 )
	{
		if ( searchEngine.targetElementIdentified )
		{
			System.out.println( arg0 );
			// System.out.println( "String arg = " + arg0.toString() );
			// System.out.println( "Int arg    = " + arg1 );
			Object searchData[]                  = searchEngine.retrieveSearchData( arg0 );
			searchTime                           = Double.parseDouble( searchData[0].toString() );
			searchQuantity                       = Integer.parseInt  ( searchData[1].toString() ); 
			//searchEngine.targetElementIdentified = false;
		}
	}

.....

html-css java

4 Contributors
4 Replies
201 Views
4 Years Discussion Span
Latest Post 10 Years Ago Latest Post by olupotd

peter_budo 2,532 Code tags enforcer

14 Years Ago

You better to check out some other projects because above code is just very small portion of big idea, google results here

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

gretty 0 Junior Poster · Answer 1 · 2010-10-29T11:37:03+00:00

gretty 0 Junior Poster

14 Years Ago

No suggestions of a java HMTL parser? :(

Taywin 312 Posting Virtuoso · Answer 2 · 2010-10-29T20:41:40+00:00

Hmm... Parsing HTML text is not easy. I am not sure what's the purpose of your Java class? I would assume that you are printing out tag and its value??? How would you print out a nested value in a tag then? What would you do with ill-formatted HTML text? Would you be able to handle customized tag as well?

You could, however, implement a HTML text parser yourself in Java, but you need to be clear about each situation. Using other's library is OK as long as you completely understand what it does.

olupotd 0 Newbie Poster · Answer 3 · 2015-01-28T15:42:48+00:00

Try jsoup, it's pretty owesome and cool. get it at jsoup.org