Member Avatar for crzycrusnik

Hello! First time Poster :)

I'm trying to create a Java Web Crawler. What I want it to do is pass the crawler a starting site, a max depth/levels of sites to go to (as opposed to number of pages) and save files like images and documents.

The problem is, i'm unsure of how to do this >_<

I believe I know how to process the page to get to a link, but I'm having trouble adding the sites to a queue, and then looking through the queue, parsing those websites and/or even saving the files/documents that I want.

Are there any concrete examples or guides I could use to help me create something like this that will fulfill my needs?

Thank you all for takin the time to read my post :]

Recommended Answers

All 4 Replies

Member Avatar for crzycrusnik

I'll add the code I have so far so you can see what I have so far, and if its just completely wrong. I am new to java after call.

public void crawl(URL urlPassed, int maxDepth, String fileSave){
		
		Vector newURLs = new Vector();
		
		
		try{
			// try opening the URL
			URLConnection urlConnection = urlPassed.openConnection();
			urlConnection.setAllowUserInteraction(false);
			InputStream urlStream = urlPassed.openStream();
			
			// search the input stream for links
			// first, read in the entire URL
			byte b[] = new byte[1000];
			int numRead = urlStream.read(b);
			String content = new String(b, 0, numRead);
			while( (numRead != -1) && (content.length() < 20000)){
				numRead = urlStream.read(b);
				if(numRead != -1){
					String newContent = new String(b, 0, numRead);
					content += newContent;
					}
			}
			
			String lowerPage = content.toLowerCase(); // turn to lower case
			int index = 0; // position in the page
			int iEndAngle, ihref, iURL, icloseQuote, ihatchMark, iend;
			if(content.length() != 0){
				while( (index = lowerPage.indexOf("<a", index)) != -1){
					iEndAngle = lowerPage.indexOf(">", index);
					ihref = lowerPage.indexOf("href",index);
					if(ihref != -1){
						iURL = lowerPage.indexOf("\"", ihref) + 1;
						if((iURL != -1) && (iEndAngle != -1) && (iURL < iEndAngle)){
							icloseQuote = lowerPage.indexOf("\"", iURL);
							ihatchMark = lowerPage.indexOf("#", iURL);
							if( (icloseQuote != -1) && (icloseQuote < iEndAngle)){
								iend = icloseQuote;
								if( (ihatchMark != -1) && (ihatchMark < icloseQuote))
									iend = ihatchMark;
								String newUrlString = content.substring(iURL, iend);
								URL addUrl;
								try{
									addUrl = new URL(urlPassed, newUrlString);
									String filename = urlPassed.getFile();
									int iSuffix = filename.lastIndexOf("htm");
									
									// int iSuffix2 = filename.lastIndexOf("jpg");
									// int iSuffix3 = filename.lastIndexOf("jpeg");
									// int iSuffix4 = filename.lastIndexOf("gif");
									
									//Recursion has to take place after the elements have been added to the list.
									if((iSuffix == filename.length() - 3) || (iSuffix == filename.length() - 4)){
										newURLs.addElement(urlPassed);
									}
									
								}
							}
						}
					}
				}
			}
		}
	}
}

According to the usage you mention, a stack would be more useful than a queue.

Also take a look at this thread.

Member Avatar for crzycrusnik

When I try to click that link it gives me the 404 file not found error.

Yes the link is broken, here is the link for it.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.