I've been working on this webcrawler and I've ran into a problem. I can read the first URL and get all the URLs out of the HTML code, but I can't seem to set up a looping structure that will work.

This is basically what it does:

Searches through html of first URL.
It may find, say 20 other URLs contained in that one.
stops.


How can I make it to were it would continously search through the ones that were found?

Here is the code I have so far, but it's not complete:

import java.io.*;
import java.net.*;
import java.util.*;

public class CustomWebCrawler implements Runnable
{
    ArrayList alCurrentSearches = new ArrayList();
    ArrayList alAlreadySearched = new ArrayList();
    ArrayList alMatchingSearches = new ArrayList();
    Thread running;
    URL enteredURL;
    int count = 0;
    
    public CustomWebCrawler()
    {
	    
    }
    
    public void start()
    {

        if (running == null)
        {
            running = new Thread();
        }
    }
    
    public void stop()
    {

        if (running != null)
        {
            running = null;

        }
    }
    
    public void run()
    {
        if (enteredURL == null || enteredURL.getProtocol().compareTo("http") != 0)
        {
            running = null;
        }
	alCurrentSearches.add(enteredURL);
	
          BufferedReader br = null;
        try
        {
           br = new BufferedReader(new InputStreamReader(enteredURL.openStream()));
           String inputText = "";
           while ((inputText = br.readLine()) != null)
          {
                int first = inputText.lastIndexOf("<a href=");
                int end = inputText.indexOf("\">",first);
                if (first != -1 && end != -1)
                {
                    findURL(inputText,first,end);
                }
                else
                {
                }
                
            }
          }

        catch(Exception e)
        {
            e.printStackTrace();
        }
    }
    
    public void findURL(String text, int numFirst, int numEnd)
    {
        String link = text.substring(numFirst+9, numEnd);
	try
	{
		URL newURL = new URL(link);
		if (newURL.getProtocol().compareTo("http") == 0)
		{
			if (!(alMatchingSearches.contains(newURL)))
			{
				alAlreadySearched.add(newURL);
				alMatchingSearches.add(newURL);
				System.out.println(newURL + "");
			}
		}
	}
	catch(MalformedURLException mue)
	{
	}
    }

	
}

Recommended Answers

All 3 Replies

Hi,
Since your code is not commented I will comment on general strategy to accompolish your goal.
Actually you dont need 3 arraylist. 1 is suffice.
You need to decide on which approach to take --- depth first search or breadth first. Advantage of DFS is that using recursion its very simple to program. However since this is a webcrawler stuff it can run you into memory problems and hence performance unless you restrict your depth e.g. 1 url ---> 20 urls -->each 1 of 20 ---> 10 other .... and none of these are over till you reach the html which does not have any further urls.
With BFS the situation can be dealt better.
So its your call.
If you need further help let me know.

Thanks for the reply. I've updated the code a little bit. I've got it to were it will loop, but it hangs after the 60th URL. I know there are a few things I still have to do, but I really want to get this looping going. I do have a full blown GUI, but I'll include a Console app that tests this thing so you can see what it's doing.


CustomWebCrawler.java

/*
 Import all the needed packages.  IO is needed to open streams.
 net is needed for the URL object.  The util class is needed for
 ArrayList.
 */
import java.io.*;
import java.net.*;
import java.util.*;

/**
 *  Create a class called CustomWebCrawler that implements Runnable.
 *  Runnable is needed for threadiing
 **/
public class CustomWebCrawler implements Runnable
{
    /*
     Create an arraylist to hold the actuall searches.
     */
    ArrayList alMatchingSearches = new ArrayList();
    Thread running;
    URL enteredURL;
    int count = 0;
    int count2 = 0;
    /*
     Constructor
     */
    public CustomWebCrawler()
    {
        
    }
    
    public void start()
    {
        if (running == null)
        {
            running = new Thread();
        }
    }
    
    public void stop()
    {
        if (running != null)
        {
            running = null;
        }
    }
    
    public void run()
    {
        if (enteredURL == null)
        {
            running = null;
        }
	/*
	 If we make it this far, then then the thread is running.
	 So lets call the load method.
	 */
            load();  
    }
    
    /**
     *  This method takes three parameters.  The text, which is whats being 
     *  currently read from the URL.  The second parameter is the index of
     *  the first occurence of an anchor tag.  The third parameter is the ending
     *  index of the anchor tag.
     **/
    public void findURL(String text, int numFirst, int numEnd)
    {
	/*
	 Create a STRING link out of the given parameters.
	 */
        String link = text.substring(numFirst+9, numEnd);

    /*
     try to create an actuall URL out of the String link
     if we are able to, then check to make sure it has the right protocol
     */
    try
    {
        URL newURL = new URL(link);
        if (newURL.getProtocol().compareTo("http") == 0)
        {
	    /*
	     Make sure we haven't already searched it.
	     */
            if (!(alMatchingSearches.contains(newURL)))
            {
                    alMatchingSearches.add(newURL);
                    System.out.println(count2 + "   " + newURL + "");
                    count2++;
            }
        }
    }
    catch(MalformedURLException mue)
    {
    }   
    }   
    
    /**
     This method called load, will make sure we have less than 101 searches,
     and then try looping through the searched URLs
     **/
    public void load()
    {
            while(alMatchingSearches.size() < 100)
            {
                BufferedReader br = null;
                int count = 0;
		/*
		 Try reading from the URL
		 */
                try
                {
                      br = new BufferedReader(new InputStreamReader(enteredURL.openStream()));
                      String inputText = "";
		      /*
		       While the inputText is not null, then loop
		       */
                      while ((inputText = br.readLine()) != null)
                      {
                          int first = inputText.lastIndexOf("<a href=");
                          int end = inputText.indexOf("\">",first);
			  /*
			   If the read text contains an anchor tag, then
			   send the findURL method the parameters.
			   */
                          if (first != -1 && end != -1)
                          {
                             findURL(inputText,first,end);
                            }
                            else
                            {
                            } 
                        }
                    }
                    catch(MalformedURLException mue)
                    {
                        System.out.println("Malformed Exception");
                        break;
                
                    }
                        catch(IOException ioe)
                        {
                            System.out.println("IO Exception");
                            break;
                        }
		/*
		 Change the enteredURl to the next URl in the ArrayList
		 */
                enteredURL = (URL)alMatchingSearches.get(count);
		/*
		 Make it run again
		 */
                run();
                count++;

            }
            
            running = null;
       }
}

WebCrawlerInterface.java

import java.util.*;
import java.net.*;

public class TestWebCrawler
{
    public static void main(String[] args)
    {
          CustomWebCrawler wc = new CustomWebCrawler();
          try
          {
            wc.enteredURL = new URL("http://java.sun.com");
            wc.start();
            wc.run();
        }
        catch(MalformedURLException ue)
        {
        }
    }
}

Any help/comments are greatly appreciated.

Ok, I got passed that problem, and now it seems to be looping just fine.

I do have one more thing though, how can I parse through that HTML faster, or make the whole crawler faster? It took a while to find 700 URLs, when google can find millions in seconds.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.