Member Avatar for kris0r

Hi all,

I am currently trying (and now wondering if) you can add multiple threads to a program to do the same job? Basically I am writing a web crawler for a search engine and it adds many URLs to an ArrayList which is used as a queue. I add a few links to start and open each with jsoup, checking if each link is in the queue and if not adding it (to have it's links taken out). This process works but ends up at about a queue of 3000 before it starts working down. Can I add multithreading to my program to end it quicker? I have used multi threading before but not in this context and am unsure where to start. I am happy to post my program if needed (the whole thing is only about 70 lines long). Thanks

Hi all,

I am currently trying (and now wondering if) you can add multiple threads to a program to do the same job? Basically I am writing a web crawler for a search engine and it adds many URLs to an ArrayList which is used as a queue. I add a few links to start and open each with jsoup, checking if each link is in the queue and if not adding it (to have it's links taken out). This process works but ends up at about a queue of 3000 before it starts working down. Can I add multithreading to my program to end it quicker? I have used multi threading before but not in this context and am unsure where to start. I am happy to post my program if needed (the whole thing is only about 70 lines long). Thanks

Threads are mainly used for independent tasks and cannot be used for same task.
However if u are interested in doing so , proper synchronization between all those threads should be maintained . It depends on your logic and your program modules dependency

Member Avatar for kris0r

Hi James,

Thank you very much for the info, will give them a good read. I think my main work now will be working out which parts of my program I want the threads to run on. I think this will be the part which removes a link then loops through its links and checks if the new ones are in the queue yet or not. I think I will need to work out how to go about changing my vanilla ArrayList to something which is multithread safe as well. Either way I have a bit more direction than I did before!

I guess the place where threads will help is not in the internal execution of your code, but in the request/wait for external HTTP responses.

A ThreadPoolExecutor with an unbounded queue will handle the queue for you.

Member Avatar for kris0r

After trying to blindly trying to re-organise my program into threads I think you may be correct there. Basically I have a program which opens an HTML page and retrieves all its <a href="this_part_here"> which is fine. It then steps through every this_part_here and checks if it is in the queue, then makes an object from a custom class. I would like to speed up this process and my lecturer loosely suggested threading would help.

Would it be possible for you to elaborate slightly on how threading would help my HTTP requests? Do you mean open more than one page at once or make the process of opening each one quicker?

this is my program:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.net.SocketException;

import java.util.ArrayList;

public class URLReader
{    							
    public static void main(String[] args) throws Exception
    {
		ArrayList<CrawledDoc> FoundDocs = new ArrayList<CrawledDoc>();
    	ArrayList<String> Q = new ArrayList<String>();
    	boolean correctProtocol = true;
    	Document page = null;
    	int URLcount = 0, docID = 0;
   		String[] initialURLs = {"http://www.bbc.co.uk/", "http://www.engadget.com/", "http://news.google.com/",
    							"http://www.skynews.com/","http://www.youtube.com/","http://www.theregister.co.uk/"};
		String title = null, content = null, abslink, previousContent = null;
		
    	for(int counter = 0; counter < initialURLs.length; counter++)
    		Q.add(initialURLs[counter]);
		
		while(Q.size() > 1)
		{
			String L = Q.remove(1);
			System.out.println("Opening: "+ L);
			
			try
			{
				page = Jsoup.connect(L).get();
			}
			catch(SocketException e){}
			catch(IOException e){}
			catch(IllegalArgumentException e){}
			
			Elements links = page.select("a[href]");
			Element paragraph = page.select("p").first();
			
			for(Element link : links)
			{
				abslink = link.attr("abs:href");
				if((abslink.startsWith("ftp://")) || (abslink.startsWith("mailto:"))) correctProtocol = false;
				
				if((!Q.contains(abslink)) && correctProtocol)
				{
					Q.add(abslink);
				
					docID++;
					title = page.title();
				
					if(paragraph == null) content = title;
					else content = paragraph.text();
					
					System.out.println("/// START OF OBJECT ///");
					System.out.println("[Link]: "+abslink);
					System.out.println("[ID]: "+docID);
					System.out.println("[Title]: "+title);
					System.out.println("[content]: "+content);
					System.out.println("/// END OF OBJECT ///");
					
					CrawledDoc tempDoc = new CrawledDoc(abslink, docID, title, content);
					FoundDocs.add(tempDoc);
				}
        	}
        	System.out.println("Q size: "+ Q.size());
		}
    }
}

apologies for some of the formatting being off, it is tabbed nice and tidy in my text editor :P

Sorry, no time to study your code now, but yes, I mean open more than one page at once. What you do with the queues etc will be processor-bound, so threads won't do much for you, but when you open a web page and wait for the response your processor will be idle for seconds waiting. So it makes sense to send off many requests and wait for them all on different threads.

Member Avatar for kris0r

Ok, thanks James I have much to bang my head against a wall about lol. Watch this space...

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.