Hello

I am new to Java & I am trying to find ways(in built Java objects/ways) to parse HTML. Can you suggest some objects in the Java Standard Library?

I have extended the object HTMLEditorKit.ParserCallback but when parsing a web pages' source code it literally takes 2 minutes or more! But some of that is my internet right now is only getting 9kb download per second :P but nvm that.

Here is what I have done to the HTMLEditorKit.ParserCallback (overloaded the handleStartTag() & handleText() functions ):

public class SearchResultGatherer extends HTMLEditorKit.ParserCallback implements Runnable
{

	/// Class Variables:
	
	int index = 0;
	
	private Vector <SearchResult> searchResults;
	private SearchEngine          searchEngine;
	private View                  appView;
	private double                searchTime;
	private int                   searchQuantity;	
	
	/// Class Methods:
	
	public SearchResultGatherer( Vector <SearchResult> _searchResults, SearchEngine _searchEngine, View _appView )
	{
		searchResults     = _searchResults;
		searchEngine      = _searchEngine;
		appView           = _appView;
		searchTime        = -1;
		searchQuantity    = -1;

		Thread thisThread = new Thread( this );
		thisThread.start();
		// Maybe do
		// thisThread.invokeAndWait();
	}
	
         public void handleStartTag( HTML.Tag t, MutableAttributeSet a, int pos ) 
	{
		for ( int i=0; i<searchEngine.targetElementInfo.length; i++ )
		{
			
			if ( t.toString().equals( searchEngine.targetElementInfo[i][0] ) )
			{
				if ( a.toString().equals( searchEngine.targetElementInfo[i][1] ) )
				{
					searchEngine.targetElementIdentified = true;
					return;
				}
				else if (i == 1)
				{
					System.out.println( "Element = " + t.toString() );
					System.out.println( "id      = " + a.toString() );
					searchEngine.targetElementIdentified = true;
					return;
				}
			}
			
		}
    }
	
	
	public void handleText( char[] arg0, int arg1 )
	{
		if ( searchEngine.targetElementIdentified )
		{
			System.out.println( arg0 );
			// System.out.println( "String arg = " + arg0.toString() );
			// System.out.println( "Int arg    = " + arg1 );
			Object searchData[]                  = searchEngine.retrieveSearchData( arg0 );
			searchTime                           = Double.parseDouble( searchData[0].toString() );
			searchQuantity                       = Integer.parseInt  ( searchData[1].toString() ); 
			//searchEngine.targetElementIdentified = false;
		}
	}

.....

Hmm... Parsing HTML text is not easy. I am not sure what's the purpose of your Java class? I would assume that you are printing out tag and its value??? How would you print out a nested value in a tag then? What would you do with ill-formatted HTML text? Would you be able to handle customized tag as well?

You could, however, implement a HTML text parser yourself in Java, but you need to be clear about each situation. Using other's library is OK as long as you completely understand what it does.

This article has been dead for over six months. Start a new discussion instead.