Hi, I would like to enquire about web crawler. I have a sch assignment whereby i need to crawl websites but it has to be specific. Display first 10 urls of websites with keyowrds such as IT products around a shopping street/district.

I manage to find tutorials to crawl a url but i have to specify the url before crawling begin.

I am not supposed to use any third party/open source tools/softwares.

How do i begin?

Does it need to be in Java?

Anyway, what's the problem with specify a URL for the crawler? Can you obtain the content from the crawler? If so, there is no need to worry about. You can easily parse all the links in a page content, save them somewhere in a data structure you want, and pass each of them to the crawler again. Be careful on one thing, you need to keep records of which URL you have already visited or you could be in an infinite loop.

I would like to seek your advice.
If I want to crawl sites within a country example Australia how do I go about doing this?

do I provide a list of seed urls to begin with? Eg www.aussie-shipping.com, www.australian-shopping.com
How do I determine I am crawling sites in Australia and not anywhere else?

I read somewhere that we can access domain name server to get a list of starting urls. Where do I get the URL of the domain server?

Does it need to be in Java?

Anyway, what's the problem with specify a URL for the crawler? Can you obtain the content from the crawler? If so, there is no need to worry about. You can easily parse all the links in a page content, save them somewhere in a data structure you want, and pass each of them to the crawler again. Be careful on one thing, you need to keep records of which URL you have already visited or you could be in an infinite loop.

If you could use IP address to crawl instead, I have attached a text file which is the whole list of IP range for Australia. You could obtain the IP range from this site http://www.find-ip-address.org/ip-country/. Hope this help.