| | |
WebCrawler problem
![]() |
•
•
Join Date: Jun 2004
Posts: 2,108
Reputation:
Solved Threads: 18
I've been working on this webcrawler and I've ran into a problem. I can read the first URL and get all the URLs out of the HTML code, but I can't seem to set up a looping structure that will work.
This is basically what it does:
Searches through html of first URL.
It may find, say 20 other URLs contained in that one.
stops.
How can I make it to were it would continously search through the ones that were found?
Here is the code I have so far, but it's not complete:
This is basically what it does:
Searches through html of first URL.
It may find, say 20 other URLs contained in that one.
stops.
How can I make it to were it would continously search through the ones that were found?
Here is the code I have so far, but it's not complete:
Java Syntax (Toggle Plain Text)
import java.io.*; import java.net.*; import java.util.*; public class CustomWebCrawler implements Runnable { ArrayList alCurrentSearches = new ArrayList(); ArrayList alAlreadySearched = new ArrayList(); ArrayList alMatchingSearches = new ArrayList(); Thread running; URL enteredURL; int count = 0; public CustomWebCrawler() { } public void start() { if (running == null) { running = new Thread(); } } public void stop() { if (running != null) { running = null; } } public void run() { if (enteredURL == null || enteredURL.getProtocol().compareTo("http") != 0) { running = null; } alCurrentSearches.add(enteredURL); BufferedReader br = null; try { br = new BufferedReader(new InputStreamReader(enteredURL.openStream())); String inputText = ""; while ((inputText = br.readLine()) != null) { int first = inputText.lastIndexOf("<a href="); int end = inputText.indexOf("\">",first); if (first != -1 && end != -1) { findURL(inputText,first,end); } else { } } } catch(Exception e) { e.printStackTrace(); } } public void findURL(String text, int numFirst, int numEnd) { String link = text.substring(numFirst+9, numEnd); try { URL newURL = new URL(link); if (newURL.getProtocol().compareTo("http") == 0) { if (!(alMatchingSearches.contains(newURL))) { alAlreadySearched.add(newURL); alMatchingSearches.add(newURL); System.out.println(newURL + ""); } } } catch(MalformedURLException mue) { } } }
•
•
Join Date: Mar 2005
Posts: 53
Reputation:
Solved Threads: 1
Hi,
Since your code is not commented I will comment on general strategy to accompolish your goal.
Actually you dont need 3 arraylist. 1 is suffice.
You need to decide on which approach to take --- depth first search or breadth first. Advantage of DFS is that using recursion its very simple to program. However since this is a webcrawler stuff it can run you into memory problems and hence performance unless you restrict your depth e.g. 1 url ---> 20 urls -->each 1 of 20 ---> 10 other .... and none of these are over till you reach the html which does not have any further urls.
With BFS the situation can be dealt better.
So its your call.
If you need further help let me know.
Since your code is not commented I will comment on general strategy to accompolish your goal.
Actually you dont need 3 arraylist. 1 is suffice.
You need to decide on which approach to take --- depth first search or breadth first. Advantage of DFS is that using recursion its very simple to program. However since this is a webcrawler stuff it can run you into memory problems and hence performance unless you restrict your depth e.g. 1 url ---> 20 urls -->each 1 of 20 ---> 10 other .... and none of these are over till you reach the html which does not have any further urls.
With BFS the situation can be dealt better.
So its your call.
If you need further help let me know.
cheers,
aj.wh.ca
-------------------------------------------
www.swiftthoughts.com
-------------------------------------------
aj.wh.ca
-------------------------------------------
www.swiftthoughts.com
-------------------------------------------
•
•
Join Date: Jun 2004
Posts: 2,108
Reputation:
Solved Threads: 18
Thanks for the reply. I've updated the code a little bit. I've got it to were it will loop, but it hangs after the 60th URL. I know there are a few things I still have to do, but I really want to get this looping going. I do have a full blown GUI, but I'll include a Console app that tests this thing so you can see what it's doing.
CustomWebCrawler.java
WebCrawlerInterface.java
Any help/comments are greatly appreciated.
CustomWebCrawler.java
Java Syntax (Toggle Plain Text)
/* Import all the needed packages. IO is needed to open streams. net is needed for the URL object. The util class is needed for ArrayList. */ import java.io.*; import java.net.*; import java.util.*; /** * Create a class called CustomWebCrawler that implements Runnable. * Runnable is needed for threadiing **/ public class CustomWebCrawler implements Runnable { /* Create an arraylist to hold the actuall searches. */ ArrayList alMatchingSearches = new ArrayList(); Thread running; URL enteredURL; int count = 0; int count2 = 0; /* Constructor */ public CustomWebCrawler() { } public void start() { if (running == null) { running = new Thread(); } } public void stop() { if (running != null) { running = null; } } public void run() { if (enteredURL == null) { running = null; } /* If we make it this far, then then the thread is running. So lets call the load method. */ load(); } /** * This method takes three parameters. The text, which is whats being * currently read from the URL. The second parameter is the index of * the first occurence of an anchor tag. The third parameter is the ending * index of the anchor tag. **/ public void findURL(String text, int numFirst, int numEnd) { /* Create a STRING link out of the given parameters. */ String link = text.substring(numFirst+9, numEnd); /* try to create an actuall URL out of the String link if we are able to, then check to make sure it has the right protocol */ try { URL newURL = new URL(link); if (newURL.getProtocol().compareTo("http") == 0) { /* Make sure we haven't already searched it. */ if (!(alMatchingSearches.contains(newURL))) { alMatchingSearches.add(newURL); System.out.println(count2 + " " + newURL + ""); count2++; } } } catch(MalformedURLException mue) { } } /** This method called load, will make sure we have less than 101 searches, and then try looping through the searched URLs **/ public void load() { while(alMatchingSearches.size() < 100) { BufferedReader br = null; int count = 0; /* Try reading from the URL */ try { br = new BufferedReader(new InputStreamReader(enteredURL.openStream())); String inputText = ""; /* While the inputText is not null, then loop */ while ((inputText = br.readLine()) != null) { int first = inputText.lastIndexOf("<a href="); int end = inputText.indexOf("\">",first); /* If the read text contains an anchor tag, then send the findURL method the parameters. */ if (first != -1 && end != -1) { findURL(inputText,first,end); } else { } } } catch(MalformedURLException mue) { System.out.println("Malformed Exception"); break; } catch(IOException ioe) { System.out.println("IO Exception"); break; } /* Change the enteredURl to the next URl in the ArrayList */ enteredURL = (URL)alMatchingSearches.get(count); /* Make it run again */ run(); count++; } running = null; } }
WebCrawlerInterface.java
Java Syntax (Toggle Plain Text)
import java.util.*; import java.net.*; public class TestWebCrawler { public static void main(String[] args) { CustomWebCrawler wc = new CustomWebCrawler(); try { wc.enteredURL = new URL("http://java.sun.com"); wc.start(); wc.run(); } catch(MalformedURLException ue) { } } }
Any help/comments are greatly appreciated.
![]() |
Similar Threads
- Problem with Windows Update and WinXP (Web Browsers)
- Windows XP keeps restarting since a new video card (Windows NT / 2000 / XP)
- Redhat Linux 6.2 - ipop3d problem? (*nix Software)
- Problem with T720 (Cellphones, PDAs and Handheld Devices)
- Connection Problems (Networking Hardware Configuration)
Other Threads in the Java Forum
- Previous Thread: How do you pass objects as arguements?
- Next Thread: Looking for open source gateway
| Thread Tools | Search this Thread |
911 addball addressbook android api append applet application apps array arrays automation binary bluetooth businessintelligence button card character class client code collision component crashcourse css csv database eclipse ee error fractal free game gis givemetehcodez graphics gui html ide image integer integration j2me japplet java javaarraylist javadoc javafx javaprojects jni jpanel julia jvm linux list loan machine map method methods migrate mobile netbeans newbie objects oriented output panel phone physics problem program programming project projects radio recursion replaydirector reporting researchinmotion scanner se server service set sms software sort sql string swing test textfield threads transfer tree trolltech ubuntu utility windows






