954,536 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

Building a simple Java web crawler

Hi

I intend over the next few months to learnt Java with the purpose of building my own simple web crawler/spider. I have seen a few open source spiders but would like to build my own if possible.

What I would like to ask is how would I go about learning java and also would the building of a simple spider be very hard?

My requirements of the spider are as follows:

Go to the entered URL and gather all content from the site
Collect link structure

The app I am developing will need to be able to build a structured sitemap of the specified URL.

One final note is how would I go about building a browser add-on? What languages can they be built in and which browser is best/easiest to develop for?

Thanks

kooben
Newbie Poster
14 posts since Apr 2005
Reputation Points: 10
Solved Threads: 0
 

I have built a java web crawler/spider before with a front end resembling google for a previous uni project and I would say it is a moderate program to try and do, not overly difficult but a definate challenge for a new java coder.

Some of the main bits you will need to learn to do this is iostreams to read the urls in and JDBC so that you can store the data(you could do it by reading into an array/vector but i wouldnt recommend it as it would eat memory).

There is loads on the web about spider methods and algorithms like word ranking etc but i am sure you have already read up about how they work.

It is probobly quite a good project as you could make it on the command line and then redo it with a gui later if you wanted to.

As for browser plugins I would probobly go for a firefox plugin but then again why stop at a search engine, why not build your own browser too. :mrgreen:

Black Knight
Light Poster
25 posts since Mar 2005
Reputation Points: 10
Solved Threads: 0
 

I think the java.sun site had a tutorial on creating one of these. This is actually the next project I want to take up!

server_crash
Postaholic
2,111 posts since Jun 2004
Reputation Points: 113
Solved Threads: 20
 

hi black knight, i m also building a web crawler in java as a project work.can u giude me?i m new to java.

Dark Master
Newbie Poster
8 posts since Aug 2005
Reputation Points: 10
Solved Threads: 0
 

Hi everyone,

This is a topic i created at wizard solutions that has the entire source codes and extensive explanations on creating your own webcrawlers using java.

Click on the links on that post

Here is the link

http://www.wizardsolutionsusa.com/forum/showthread.php?t=29

Richard West

freesoft_2000
Practically a Master Poster
623 posts since Jun 2004
Reputation Points: 25
Solved Threads: 10
 

can any one help me out i want to build a web crawler

shubh_9797
Newbie Poster
1 post since Jun 2008
Reputation Points: 8
Solved Threads: 0
 

Only if you start a new thread for your request and demonstrate that you have made some effort on your own.

Ezzaral
Posting Genius
Moderator
15,986 posts since May 2007
Reputation Points: 3,250
Solved Threads: 847
 

Hi everyone,

This is a topic i created at wizard solutions that has the entire source codes and extensive explanations on creating your own webcrawlers using java.

Click on the links on that post

Here is the link

http://www.wizardsolutionsusa.com/forum/showthread.php?t=29

Richard West

i am facing problem in accessing the link: http://www.wizardsolutionsusa.com/forum/showthread.php?t=29
i also need the information placed on this page.

giaBaloch
Newbie Poster
1 post since Sep 2009
Reputation Points: 10
Solved Threads: 0
 

that link doesn't work...... and i can't find wizard solutions website, to try to join or whatever... how do you get to wizard solutions

shelley7753
Newbie Poster
2 posts since Nov 2009
Reputation Points: 10
Solved Threads: 0
 

The post is over four years old. Not everything on the web persists forever.

Ezzaral
Posting Genius
Moderator
15,986 posts since May 2007
Reputation Points: 3,250
Solved Threads: 847
 

actually, i wasn't interested in that topic anymore

shelley7753
Newbie Poster
2 posts since Nov 2009
Reputation Points: 10
Solved Threads: 0
 

hi i am akash & i am designing my web crawler in JAVA, can you please send me the code....my id is ....thanks

akash210
Newbie Poster
1 post since Nov 2009
Reputation Points: 10
Solved Threads: 0
 
hi i am akash & i am designing my web crawler in JAVA, can you please send me the code....my id is ....thanks


Here is the code

peter_budo
Code tags enforcer
Moderator
15,436 posts since Dec 2004
Reputation Points: 2,806
Solved Threads: 902
 

This article has been dead for over three months

Post: Markdown Syntax: Formatting Help
You