Hello

I am new to Java & I am trying to find ways(in built Java objects/ways) to parse HTML. Can you suggest some objects in the Java Standard Library?

I have extended the object HTMLEditorKit.ParserCallback but when parsing a web pages' source code it literally takes 2 minutes or more! But some of that is my internet right now is only getting 9kb download per second :P but nvm that.

Here is what I have done to the HTMLEditorKit.ParserCallback (overloaded the handleStartTag() & handleText() functions ):

public class SearchResultGatherer extends HTMLEditorKit.ParserCallback implements Runnable
{

	/// Class Variables:
	
	int index = 0;
	
	private Vector <SearchResult> searchResults;
	private SearchEngine          searchEngine;
	private View                  appView;
	private double                searchTime;
	private int                   searchQuantity;	
	
	/// Class Methods:
	
	public SearchResultGatherer( Vector <SearchResult> _searchResults, SearchEngine _searchEngine, View _appView )
	{
		searchResults     = _searchResults;
		searchEngine      = _searchEngine;
		appView           = _appView;
		searchTime        = -1;
		searchQuantity    = -1;

		Thread thisThread = new Thread( this );
		thisThread.start();
		// Maybe do
		// thisThread.invokeAndWait();
	}
	
         public void handleStartTag( HTML.Tag t, MutableAttributeSet a, int pos ) 
	{
		for ( int i=0; i<searchEngine.targetElementInfo.length; i++ )
		{
			
			if ( t.toString().equals( searchEngine.targetElementInfo[i][0] ) )
			{
				if ( a.toString().equals( searchEngine.targetElementInfo[i][1] ) )
				{
					searchEngine.targetElementIdentified = true;
					return;
				}
				else if (i == 1)
				{
					System.out.println( "Element = " + t.toString() );
					System.out.println( "id      = " + a.toString() );
					searchEngine.targetElementIdentified = true;
					return;
				}
			}
			
		}
    }
	
	
	public void handleText( char[] arg0, int arg1 )
	{
		if ( searchEngine.targetElementIdentified )
		{
			System.out.println( arg0 );
			// System.out.println( "String arg = " + arg0.toString() );
			// System.out.println( "Int arg    = " + arg1 );
			Object searchData[]                  = searchEngine.retrieveSearchData( arg0 );
			searchTime                           = Double.parseDouble( searchData[0].toString() );
			searchQuantity                       = Integer.parseInt  ( searchData[1].toString() ); 
			//searchEngine.targetElementIdentified = false;
		}
	}

.....

Recommended Answers

All 4 Replies

No suggestions of a java HMTL parser? :(

Hmm... Parsing HTML text is not easy. I am not sure what's the purpose of your Java class? I would assume that you are printing out tag and its value??? How would you print out a nested value in a tag then? What would you do with ill-formatted HTML text? Would you be able to handle customized tag as well?

You could, however, implement a HTML text parser yourself in Java, but you need to be clear about each situation. Using other's library is OK as long as you completely understand what it does.

You better to check out some other projects because above code is just very small portion of big idea, google results here

Try jsoup, it's pretty owesome and cool. get it at jsoup.org

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.