What url to crawl?

Question

dandeliondream 0 Newbie Poster

13 Years Ago

Hi, I would like to enquire about web crawler. I have a sch assignment whereby i need to crawl websites but it has to be specific. Display first 10 urls of websites with keyowrds such as IT products around a shopping street/district.

I manage to find tutorials to crawl a url but i have to specify the url before crawling begin.

I am not supposed to use any third party/open source tools/softwares.

How do i begin?

java open-source

Edited 13 Years Ago by dandeliondream because: n/a

2 Contributors
3 Replies
128 Views
6 Days Discussion Span
Latest Post 13 Years Ago Latest Post by Taywin

All 3 Replies

Taywin 312 Posting Virtuoso

13 Years Ago

Does it need to be in Java?

Anyway, what's the problem with specify a URL for the crawler? Can you obtain the content from the crawler? If so, there is no need to worry about. You can easily parse all the links in a page content, save them somewhere in a data structure you want, and pass each of them to the crawler again. Be careful on one thing, you need to keep records of which URL you have already visited or you could be in an infinite loop.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

dandeliondream 0 Newbie Poster · Answer 1 · 2011-10-25T22:56:31+00:00

I would like to seek your advice.
If I want to crawl sites within a country example Australia how do I go about doing this?

do I provide a list of seed urls to begin with? Eg www.aussie-shipping.com, www.australian-shopping.com
How do I determine I am crawling sites in Australia and not anywhere else?

I read somewhere that we can access domain name server to get a list of starting urls. Where do I get the URL of the domain server?

Does it need to be in Java?
Anyway, what's the problem with specify a URL for the crawler? Can you obtain the content from the crawler? If so, there is no need to worry about. You can easily parse all the links in a page content, save them somewhere in a data structure you want, and pass each of them to the crawler again. Be careful on one thing, you need to keep records of which URL you have already visited or you could be in an infinite loop.

Taywin 312 Posting Virtuoso · Answer 2 · 2011-10-25T23:42:08+00:00

Taywin 312 Posting Virtuoso

13 Years Ago

If you could use IP address to crawl instead, I have attached a text file which is the whole list of IP range for Australia. You could obtain the IP range from this site http://www.find-ip-address.org/ip-country/. Hope this help.

AU_ipranges.txt (91.68 KB)

The attachment preview is chopped off after the first 10 KB. Please download the entire file.

<b><h1> 3048 Results For Country Code <font color=#980000>AU</font></h1></b>
 1.0.0.0 - 1.0.0.255
 1.0.4.0 - 1.0.7.255
 1.1.1.0 - 1.1.1.255
 1.2.3.0 - 1.2.3.255
 1.4.0.0 - 1.4.0.255
 1.10.10.0 - 1.10.10.255
 1.40.0.0 - 1.44.255.255
 1.120.0.0 - 1.159.255.255
 1.178.0.0 - 1.179.127.255
 1.255.0.0 - 1.255.255.255
 14.1.16.0 - 14.1.19.255
 14.2.0.0 - 14.2.255.255
 14.102.136.0 - 14.102.143.255
 14.128.32.0 - 14.128.63.255
 14.137.0.0 - 14.137.223.255
 14.192.160.0 - 14.192.191.255
 14.200.0.0 - 14.203.255.255
 27.0.64.0 - 27.0.95.255
 27.32.0.0 - 27.33.255.255
 27.34.224.0 - 27.34.231.255
 27.50.48.0 - 27.50.95.255
 27.54.64.0 - 27.54.71.255
 27.54.80.0 - 27.54.95.255
 27.54.128.0 - 27.54.143.255
 27.96.192.0 - 27.96.223.255
 27.99.0.0 - 27.99.127.255
 27.100.0.0 - 27.100.3.255
 27.100.24.0 - 27.100.27.255
 27.106.192.0 - 27.106.203.255
 27.109.108.0 - 27.109.111.255
 27.111.0.0 - 27.111.7.255
 27.111.64.0 - 27.111.71.255
 27.111.80.0 - 27.111.95.255
 27.111.240.0 - 27.111.255.255
 27.112.72.0 - 27.112.75.255
 27.112.96.0 - 27.112.103.255
 27.113.240.0 - 27.113.247.255
 27.116.36.0 - 27.116.39.255
 27.121.64.0 - 27.121.71.255
 27.121.104.0 - 27.121.107.255
 27.121.112.0 - 27.121.119.255
 27.122.112.0 - 27.122.127.255
 27.123.24.0 - 27.123.31.255
 27.124.96.0 - 27.124.127.255
 27.125.208.0 - 27.125.223.255
 27.126.144.0 - 27.126.151.255
 27.127.192.0 - 27.127.239.255
 27.131.64.0 - 27.131.127.255
 27.131.216.0 - 27.131.219.255
 27.253.0.0 - 27.253.127.255
 31.201.0.48 - 31.201.0.51
 31.201.14.0 - 31.201.14.223
 36.37.38.0 - 36.37.38.255
 39.0.1.0 - 39.0.1.255
 42.62.192.0 - 42.62.255.255
 42.241.0.0 - 42.241.255.255
 46.36.196.1 - 46.36.196.10
 46.36.197.81 - 46.36.197.90
 46.36.198.21 - 46.36.198.30
 46.36.198.111 - 46.36.198.115
 46.136.157.0 - 46.136.157.255
 46.248.22.0 - 46.248.23.255
 49.0.8.0 - 49.0.15.255
 49.2.0.0 - 49.3.255.255
 49.127.0.0 - 49.127.255.255
 49.128.1.0 - 49.128.1.255
 49.128.4.0 - 49.128.7.255
 49.128.12.0 - 49.128.15.255
 49.128.224.0 - 49.128.255.255
 49.143.224.0 - 49.143.227.255
 49.143.233.8 - 49.143.233.15
 49.143.236.0 - 49.143.239.255
 49.143.248.0 - 49.143.251.255
 49.156.16.0 - 49.156.19.255
 49.156.24.0 - 49.156.31.255
 49.176.0.0 - 49.199.255.255
 49.213.16.0 - 49.213.31.255
 49.236.240.0 - 49.236.255.255
 49.255.0.0 - 49.255.255.255
 50.56.137.88 - 50.56.137.91
 58.6.0.0 - 58.7.255.255
 58.65.248.0 - 58.65.255.255
 58.84.64.0 - 58.84.223.255
 58.87.0.0 - 58.87.15.255
 58.96.0.0 - 58.96.159.255
 58.96.192.0 - 58.96.255.255
 58.104.0.0 - 58.111.255.255
 58.145.128.0 - 58.145.159.255
 58.160.0.0 - 58.175.255.255
 58.178.0.0 - 58.179.255.255
 58.181.64.0 - 58.181.95.255
 59.86.160.0 - 59.86.191.255
 59.100.0.0 - 59.102.127.255
 59.154.0.0 - 59.154.255.255
 59.167.0.0 - 59.167.255.255
 59.191.192.0 - 59.191.239.255
 60.224.0.0 - 60.231.255.255
 60.234.49.80 - 60.234.49.95
 60.234.66.24 - 60.234.66.31
 60.234.122.128 - 60.234.122.159
 60.240.0.0 - 60.242.255.255
 61.8.0.0 - 61.8.63.255
 61.8.96.0 - 61.8.127.255
 61.8.176.0 - 61.8.191.255
 61.9.128.0 - 61.9.255.255
 61.14.31.136 - 61.14.31.143
 61.14.96.0 - 61.14.127.255
 61.14.140.0 - 61.14.140.39
 61.14.140.48 - 61.14.140.255
 61.14.141.32 - 61.14.141.223
 61.14.141.240 - 61.14.141.255
 61.14.142.32 - 61.14.142.47
 61.14.142.64 - 61.14.142.95
 61.14.143.0 - 61.14.143.223
 61.14.166.0 - 61.14.166.255
 61.14.186.8 - 61.14.186.15
 61.14.186.80 - 61.14.186.87
 61.14.186.96 - 61.14.186.127
 61.14.186.144 - 61.14.186.159
 61.14.186.192 - 61.14.186.199
 61.14.187.128 - 61.14.187.255
 61.28.193.0 - 61.28.194.255
 61.28.197.0 - 61.28.198.255
 61.28.202.0 - 61.28.205.255
 61.28.207.0 - 61.28.223.255
 61.29.0.0 - 61.29.127.255
 61.45.248.0 - 61.45.255.255
 61.68.0.0 - 61.69.255.255
 61.88.0.0 - 61.88.255.255
 61.95.0.0 - 61.95.127.255
 62.108.144.32 - 62.108.144.47
 63.112.5.168 - 63.112.5.175
 63.122.51.0 - 63.122.51.127
 63.243.175.128 - 63.243.175.191
 63.251.48.128 - 63.251.48.191
 64.15.32.0 - 64.15.47.255
 64.15.205.0 - 64.15.205.255
 64.15.238.32 - 64.15.238.63
 64.34.251.120 - 64.34.251.127
 64.37.82.128 - 64.37.82.159
 64.37.103.32 - 64.37.103.63
 64.38.195.0 - 64.38.195.31
 64.39.5.88 - 64.39.5.95
 64.39.6.8 - 64.39.6.15
 64.39.7.208 - 64.39.7.215
 64.39.8.200 - 64.39.8.207
 64.39.9.248 - 64.39.9.255
 64.39.11.120 - 64.39.11.127
 64.39.12.48 - 64.39.12.55
 64.39.12.248 - 64.39.12.255
 64.39.13.128 - 64.39.13.135
 64.39.13.144 - 64.39.13.151
 64.39.16.104 - 64.39.16.111
 64.39.19.216 - 64.39.19.223
 64.39.21.168 - 64.39.21.183
 64.39.21.224 - 64.39.21.231
 64.39.22.136 - 64.39.22.151
 64.39.22.168 - 64.39.22.175
 64.39.22.192 - 64.39.22.199
 64.39.22.240 - 64.39.22.255
 64.39.23.32 - 64.39.23.39
 64.39.23.160 - 64.39.23.175
 64.39.24.112 - 64.39.24.119
 64.39.28.0 - 64.39.28.15
 64.39.28.112 - 64.39.28.119
 64.39.49.160 - 64.39.49.191
 64.45.29.198 - 64.45.29.212
 64.45.57.184 - 64.45.57.193
 64.49.193.88 - 64.49.193.95
 64.49.194.104 - 64.49.194.111
 64.49.195.32 - 64.49.195.39
 64.49.197.192 - 64.49.197.207
 64.49.198.32 - 64.49.198.39
 64.49.199.224 - 64.49.199.231
 64.49.200.32 - 64.49.200.39
 64.49.203.64 - 64.49.203.95
 64.49.204.224 - 64.49.204.231
 64.49.205.64 - 64.49.205.127
 64.49.206.0 - 64.49.206.15
 64.49.208.128 - 64.49.208.135
 64.49.209.128 - 64.49.209.143
 64.49.211.192 - 64.49.211.199
 64.49.214.64 - 64.49.214.71
 64.49.227.88 - 64.49.227.95
 64.49.230.96 - 64.49.230.111
 64.49.235.96 - 64.49.235.103
 64.49.238.136 - 64.49.238.143
 64.62.199.128 - 64.62.199.191
 64.64.0.130 - 64.64.0.133
 64.64.1.30 - 64.64.1.33
 64.64.1.238 - 64.64.1.238
 64.64.2.154 - 64.64.2.157
 64.64.6.59 - 64.64.6.62
 64.64.6.87 - 64.64.6.90
 64.64.7.118 - 64.64.7.121
 64.64.7.134 - 64.64.7.137
 64.64.12.184 - 64.64.12.187
 64.64.16.62 - 64.64.16.65
 64.64.24.39 - 64.64.24.42
 64.64.24.92 - 64.64.24.95
 64.64.25.8 - 64.64.25.11
 64.64.25.109 - 64.64.25.112
 64.64.25.145 - 64.64.25.148
 64.64.25.235 - 64.64.25.238
 64.64.26.241 - 64.64.26.244
 64.64.27.57 - 64.64.27.60
 64.64.27.140 - 64.64.27.143
 64.64.28.111 - 64.64.28.114
 64.64.28.228 - 64.64.28.231
 64.64.28.236 - 64.64.28.239
 64.64.29.238 - 64.64.29.238
 64.64.30.38 - 64.64.30.41
 64.64.30.94 - 64.64.30.97
 64.64.30.114 - 64.64.30.117
 64.64.30.169 - 64.64.30.176
 64.64.31.34 - 64.64.31.37
 64.64.31.180 - 64.64.31.183
 64.71.129.128 - 64.71.129.255
 64.71.135.184 - 64.71.135.191
 64.71.140.192 - 64.71.140.255
 64.73.194.64 - 64.73.194.127
 64.74.245.224 - 64.74.245.231
 64.77.12.48 - 64.77.12.55
 64.77.12.152 - 64.77.12.175
 64.77.14.32 - 64.77.14.47
 64.77.15.80 - 64.77.15.95
 64.77.15.160 - 64.77.15.175
 64.77.15.240 - 64.77.15.255
 64.77.25.96 - 64.77.25.127
 64.77.30.32 - 64.77.30.39
 64.77.39.72 - 64.77.39.79
 64.77.41.96 - 64.77.41.111
 64.77.46.128 - 64.77.46.159
 64.77.81.160 - 64.77.81.191
 64.86.255.128 - 64.86.255.255
 64.88.228.0 - 64.88.228.255
 64.94.163.64 - 64.94.163.95
 64.106.150.90 - 64.106.150.99
 64.106.150.150 - 64.106.150.159
 64.119.168.192 - 64.119.168.199
 64.119.178.64 - 64.119.178.71
 64.119.184.48 - 64.119.184.55
 64.119.188.16 - 64.119.188.31
 64.127.112.112 - 64.127.112.127
 64.182.63.115 - 64.182.63.123
 64.191.198.0 - 64.191.198.255
 64.224.10.17 - 64.224.10.30
 64.235.47.61 - 64.235.47.68
 64.235.47.100 - 64.235.47.116
 64.235.236.240 - 64.235.236.247
 64.241.64.0 - 64.241.64.7
 64.255.173.112 - 64.255.173.127
 65.17.198.30 - 65.17.198.49
 65.36.243.64 - 65.36.243.95
 65.61.129.24 - 65.61.129.31
 65.61.130.208 - 65.61.130.223
 65.61.134.176 - 65.61.134.191
 65.61.135.240 - 65.61.135.255
 65.61.158.0 - 65.61.158.15
 65.61.168.0 - 65.61.168.31
 65.61.168.64 - 65.61.168.159
 65.61.168.176 - 65.61.168.191
 65.61.170.192 - 65.61.170.199
 65.61.176.40 - 65.61.176.47
 65.61.176.64 - 65.61.176.127
 65.99.208.8 - 65.99.208.15
 65.127.193.56 - 65.127.193.87
 65.127.193.104 - 65.127.193.111
 65.127.193.128 - 65.127.193.159
 65.127.193.176 - 65.127.193.191
 65.127.193.224 - 65.127.193.255
 65.127.194.96 - 65.127.194.111
 65.127.194.128 - 65.127.194.159
 65.127.194.176 - 65.127.194.223
 65.127.194.240 - 65.127.194.255
 65.175.93.240 - 65.175.93.255
 65.182.186.16 - 65.182.186.19
 65.182.186.55 - 65.182.186.71
 65.182.186.125 - 65.182.186.170
 65.182.191.32 - 65.182.191.39
 65.200.193.0 - 65.200.193.63
 65.200.193.128 - 65.200.193.255
 65.200.195.0 - 65.200.195.15
 65.200.205.112 - 65.200.205.127
 66.29.192.70 - 66.29.192.77
 66.29.198.41 - 66.29.198.48
 66.29.217.104 - 66.29.217.111
 66.35.223.128 - 66.35.223.143
 66.35.230.32 - 66.35.230.47
 66.35.236.208 - 66.35.236.223
 66.40.37.0 - 66.40.37.255
 66.45.244.64 - 66.45.244.127
 66.128.48.24 - 66.128.48.31
 66.128.51.240 - 66.128.51.247
 66.135.58.0 - 66.135.58.31
 66.165.78.160 - 66.165.78.191
 66.165.112.192 - 66.165.112.223
 66.165.121.64 - 66.165.121.95
 66.171.242.32 - 66.171.242.63
 66.180.203.224 - 66.180.203.231
 66.186.2.160 - 66.186.2.167
 66.201.112.128 - 66.201.112.255
 66.216.64.96 - 66.216.64.103
 66.216.64.152 - 66.216.64.159
 66.216.64.176 - 66.216.64.184
 66.216.69.144 - 66.216.69.151
 66.216.70.64 - 66.216.70.95
 66.216.72.120 - 66.216.72.127
 66.216.73.96 - 66.216.73.111
 66.216.73.128 - 66.216.73.143
 66.216.73.152 - 66.216.73.159
 66.216.74.88 - 66.216.74.95
 66.216.74.112 - 66.216.74.119
 66.216.74.152 - 66.216.74.159
 66.216.79.120 - 66.216.79.191
 66.216.83.0 - 66.216.83.15
 66.216.83.24 - 66.216.83.31
 66.216.83.160 - 66.216.83.191
 66.216.87.64 - 66.216.87.71
 66.216.87.80 - 66.216.87.111
 66.216.87.144 - 66.216.87.151
 66.216.89.112 - 66.216.89.119
 66.216.93.128 - 66.216.93.191
 66.216.99.224 - 66.216.99.231
 66.216.102.24 - 66.216.102.31
 66.216.102.200 - 66.216.102.223
 66.216.103.192 - 66.216.103.255
 66.216.105.56 - 66.216.105.63
 66.216.108.80 - 66.216.108.103
 66.216.108.240 - 66.216.108.247
 66.216.113.0 - 66.216.113.7
 66.216.113.32 - 66.216.113.63
 66.216.113.80 - 66.216.113.87
 66.216.113.96 - 66.216.113.127
 66.216.114.24 - 66.216.114.31
 66.216.120.48 - 66.216.120.55
 66.220.3.32 - 66.220.3.47
 66.230.164.32 - 66.230.164.63
 66.230.167.0 - 66.230.167.255
 66.230.175.0 - 66.230.175.255
 66.230.189.0 - 66.230.189.255
 67.15.115.128 - 67.15.115.191
 67.15.132.0 - 67.15.132.31
 67.15.144.192 - 67.15.144.255
 67.15.150.0 - 67.15.150.127
 67.15.151.128 - 67.15.151.159
 67.15.159.0 - 67.15.159.63
 67.15.161.96 - 67.

What url to crawl?

Recommended Answers Collapse Answers

All 3 Replies

Recommended Answers