I intend over the next few months to learnt Java with the purpose of building my own simple web crawler/spider. I have seen a few open source spiders but would like to build my own if possible.
What I would like to ask is how would I go about learning java and also would the building of a simple spider be very hard?
My requirements of the spider are as follows:
Go to the entered URL and gather all content from the site
Collect link structure
The app I am developing will need to be able to build a structured sitemap of the specified URL.
One final note is how would I go about building a browser add-on? What languages can they be built in and which browser is best/easiest to develop for?
I have built a java web crawler/spider before with a front end resembling google for a previous uni project and I would say it is a moderate program to try and do, not overly difficult but a definate challenge for a new java coder.
Some of the main bits you will need to learn to do this is iostreams to read the urls in and JDBC so that you can store the data(you could do it by reading into an array/vector but i wouldnt recommend it as it would eat memory).
There is loads on the web about spider methods and algorithms like word ranking etc but i am sure you have already read up about how they work.
It is probobly quite a good project as you could make it on the command line and then redo it with a gui later if you wanted to.
As for browser plugins I would probobly go for a firefox plugin but then again why stop at a search engine, why not build your own browser too. :mrgreen:
No one has posted to this discussion for at least three months. Please let old threads die and do not reply to them unless you feel you have something new and valuable to contribute that absolutely must be added to make the discussion complete. Otherwise, please start a new thread in this forum instead.
This thread is currently closed and is not accepting any new replies.