WebCrawler problem

Reply

Join Date: Jun 2004
Posts: 2,108
Reputation: server_crash is on a distinguished road 
Solved Threads: 18
server_crash server_crash is offline Offline
Postaholic

WebCrawler problem

 
0
  #1
May 11th, 2005
I've been working on this webcrawler and I've ran into a problem. I can read the first URL and get all the URLs out of the HTML code, but I can't seem to set up a looping structure that will work.

This is basically what it does:

Searches through html of first URL.
It may find, say 20 other URLs contained in that one.
stops.


How can I make it to were it would continously search through the ones that were found?

Here is the code I have so far, but it's not complete:

  1. import java.io.*;
  2. import java.net.*;
  3. import java.util.*;
  4.  
  5. public class CustomWebCrawler implements Runnable
  6. {
  7. ArrayList alCurrentSearches = new ArrayList();
  8. ArrayList alAlreadySearched = new ArrayList();
  9. ArrayList alMatchingSearches = new ArrayList();
  10. Thread running;
  11. URL enteredURL;
  12. int count = 0;
  13.  
  14. public CustomWebCrawler()
  15. {
  16.  
  17. }
  18.  
  19. public void start()
  20. {
  21.  
  22. if (running == null)
  23. {
  24. running = new Thread();
  25. }
  26. }
  27.  
  28. public void stop()
  29. {
  30.  
  31. if (running != null)
  32. {
  33. running = null;
  34.  
  35. }
  36. }
  37.  
  38. public void run()
  39. {
  40. if (enteredURL == null || enteredURL.getProtocol().compareTo("http") != 0)
  41. {
  42. running = null;
  43. }
  44. alCurrentSearches.add(enteredURL);
  45.  
  46. BufferedReader br = null;
  47. try
  48. {
  49. br = new BufferedReader(new InputStreamReader(enteredURL.openStream()));
  50. String inputText = "";
  51. while ((inputText = br.readLine()) != null)
  52. {
  53. int first = inputText.lastIndexOf("<a href=");
  54. int end = inputText.indexOf("\">",first);
  55. if (first != -1 && end != -1)
  56. {
  57. findURL(inputText,first,end);
  58. }
  59. else
  60. {
  61. }
  62.  
  63. }
  64. }
  65.  
  66. catch(Exception e)
  67. {
  68. e.printStackTrace();
  69. }
  70. }
  71.  
  72. public void findURL(String text, int numFirst, int numEnd)
  73. {
  74. String link = text.substring(numFirst+9, numEnd);
  75. try
  76. {
  77. URL newURL = new URL(link);
  78. if (newURL.getProtocol().compareTo("http") == 0)
  79. {
  80. if (!(alMatchingSearches.contains(newURL)))
  81. {
  82. alAlreadySearched.add(newURL);
  83. alMatchingSearches.add(newURL);
  84. System.out.println(newURL + "");
  85. }
  86. }
  87. }
  88. catch(MalformedURLException mue)
  89. {
  90. }
  91. }
  92.  
  93.  
  94. }
Reply With Quote Quick reply to this message  
Join Date: Mar 2005
Posts: 53
Reputation: aj.wh.ca is an unknown quantity at this point 
Solved Threads: 1
aj.wh.ca aj.wh.ca is offline Offline
Junior Poster in Training

Re: WebCrawler problem

 
0
  #2
May 12th, 2005
Hi,
Since your code is not commented I will comment on general strategy to accompolish your goal.
Actually you dont need 3 arraylist. 1 is suffice.
You need to decide on which approach to take --- depth first search or breadth first. Advantage of DFS is that using recursion its very simple to program. However since this is a webcrawler stuff it can run you into memory problems and hence performance unless you restrict your depth e.g. 1 url ---> 20 urls -->each 1 of 20 ---> 10 other .... and none of these are over till you reach the html which does not have any further urls.
With BFS the situation can be dealt better.
So its your call.
If you need further help let me know.
cheers,
aj.wh.ca

-------------------------------------------
www.swiftthoughts.com
-------------------------------------------
Reply With Quote Quick reply to this message  
Join Date: Jun 2004
Posts: 2,108
Reputation: server_crash is on a distinguished road 
Solved Threads: 18
server_crash server_crash is offline Offline
Postaholic

Re: WebCrawler problem

 
0
  #3
May 12th, 2005
Thanks for the reply. I've updated the code a little bit. I've got it to were it will loop, but it hangs after the 60th URL. I know there are a few things I still have to do, but I really want to get this looping going. I do have a full blown GUI, but I'll include a Console app that tests this thing so you can see what it's doing.


CustomWebCrawler.java
  1. /*
  2.  Import all the needed packages. IO is needed to open streams.
  3.  net is needed for the URL object. The util class is needed for
  4.  ArrayList.
  5.  */
  6. import java.io.*;
  7. import java.net.*;
  8. import java.util.*;
  9.  
  10. /**
  11.  * Create a class called CustomWebCrawler that implements Runnable.
  12.  * Runnable is needed for threadiing
  13.  **/
  14. public class CustomWebCrawler implements Runnable
  15. {
  16. /*
  17.   Create an arraylist to hold the actuall searches.
  18.   */
  19. ArrayList alMatchingSearches = new ArrayList();
  20. Thread running;
  21. URL enteredURL;
  22. int count = 0;
  23. int count2 = 0;
  24. /*
  25.   Constructor
  26.   */
  27. public CustomWebCrawler()
  28. {
  29.  
  30. }
  31.  
  32. public void start()
  33. {
  34. if (running == null)
  35. {
  36. running = new Thread();
  37. }
  38. }
  39.  
  40. public void stop()
  41. {
  42. if (running != null)
  43. {
  44. running = null;
  45. }
  46. }
  47.  
  48. public void run()
  49. {
  50. if (enteredURL == null)
  51. {
  52. running = null;
  53. }
  54. /*
  55. If we make it this far, then then the thread is running.
  56. So lets call the load method.
  57. */
  58. load();
  59. }
  60.  
  61. /**
  62.   * This method takes three parameters. The text, which is whats being
  63.   * currently read from the URL. The second parameter is the index of
  64.   * the first occurence of an anchor tag. The third parameter is the ending
  65.   * index of the anchor tag.
  66.   **/
  67. public void findURL(String text, int numFirst, int numEnd)
  68. {
  69. /*
  70. Create a STRING link out of the given parameters.
  71. */
  72. String link = text.substring(numFirst+9, numEnd);
  73.  
  74. /*
  75.   try to create an actuall URL out of the String link
  76.   if we are able to, then check to make sure it has the right protocol
  77.   */
  78. try
  79. {
  80. URL newURL = new URL(link);
  81. if (newURL.getProtocol().compareTo("http") == 0)
  82. {
  83. /*
  84. Make sure we haven't already searched it.
  85. */
  86. if (!(alMatchingSearches.contains(newURL)))
  87. {
  88. alMatchingSearches.add(newURL);
  89. System.out.println(count2 + " " + newURL + "");
  90. count2++;
  91. }
  92. }
  93. }
  94. catch(MalformedURLException mue)
  95. {
  96. }
  97. }
  98.  
  99. /**
  100.   This method called load, will make sure we have less than 101 searches,
  101.   and then try looping through the searched URLs
  102.   **/
  103. public void load()
  104. {
  105. while(alMatchingSearches.size() < 100)
  106. {
  107. BufferedReader br = null;
  108. int count = 0;
  109. /*
  110. Try reading from the URL
  111. */
  112. try
  113. {
  114. br = new BufferedReader(new InputStreamReader(enteredURL.openStream()));
  115. String inputText = "";
  116. /*
  117. While the inputText is not null, then loop
  118. */
  119. while ((inputText = br.readLine()) != null)
  120. {
  121. int first = inputText.lastIndexOf("<a href=");
  122. int end = inputText.indexOf("\">",first);
  123. /*
  124. If the read text contains an anchor tag, then
  125. send the findURL method the parameters.
  126. */
  127. if (first != -1 && end != -1)
  128. {
  129. findURL(inputText,first,end);
  130. }
  131. else
  132. {
  133. }
  134. }
  135. }
  136. catch(MalformedURLException mue)
  137. {
  138. System.out.println("Malformed Exception");
  139. break;
  140.  
  141. }
  142. catch(IOException ioe)
  143. {
  144. System.out.println("IO Exception");
  145. break;
  146. }
  147. /*
  148. Change the enteredURl to the next URl in the ArrayList
  149. */
  150. enteredURL = (URL)alMatchingSearches.get(count);
  151. /*
  152. Make it run again
  153. */
  154. run();
  155. count++;
  156.  
  157. }
  158.  
  159. running = null;
  160. }
  161. }







WebCrawlerInterface.java
  1. import java.util.*;
  2. import java.net.*;
  3.  
  4. public class TestWebCrawler
  5. {
  6. public static void main(String[] args)
  7. {
  8. CustomWebCrawler wc = new CustomWebCrawler();
  9. try
  10. {
  11. wc.enteredURL = new URL("http://java.sun.com");
  12. wc.start();
  13. wc.run();
  14. }
  15. catch(MalformedURLException ue)
  16. {
  17. }
  18. }
  19. }


Any help/comments are greatly appreciated.
Reply With Quote Quick reply to this message  
Join Date: Jun 2004
Posts: 2,108
Reputation: server_crash is on a distinguished road 
Solved Threads: 18
server_crash server_crash is offline Offline
Postaholic

Re: WebCrawler problem

 
0
  #4
May 12th, 2005
Ok, I got passed that problem, and now it seems to be looping just fine.

I do have one more thing though, how can I parse through that HTML faster, or make the whole crawler faster? It took a while to find 700 URLs, when google can find millions in seconds.
Reply With Quote Quick reply to this message  
Reply

This thread is more than three months old.
Perhaps start a new thread instead?
Message:



Similar Threads
Other Threads in the Java Forum
Thread Tools Search this Thread



About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC