954,554 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

accessing redirected pages through java crawler

hi forum,
i m developing a simple web crawler in java.upon entering an URL, the crawler downloads the corresponding web page and continues this process.but i m having problem in accessing web pages which are redirected to a diferent URL.one such example is www.telegraphindia.com ,in which a new part gets added to the original URL. can anybody help.thanks in advance.

Dark Master
Newbie Poster
8 posts since Aug 2005
Reputation Points: 10
Solved Threads: 0
 

Let's look at the response when requesting this page:

HTTP/1.1 302 Object moved
Date: Tue, 13 Sep 2005 16:06:26 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
Location: section/frontpage/index.asp
Content-Length: 148
Content-Type: text/html
Set-Cookie: ASPSESSIONIDACTBSRRB=FALIOKOCIJNCLJAPOONLFLCF; path=/
Cache-control: private

<head><title>Object moved</title></head>
<body><h1>Object Moved</h1>This object may be found <a HREF="section/frontpage/index.asp">here</a>.</body>


The header indicates a response code of 302. 302 responses include a "Location" directive that indicates where the actual response can be found at (if properly formatted that is). As you can see from the response, the Location is specified as "section/frontpage/index.asp". All you need to do is request that page from the same domain in order to get the information you want.

chrisbliss18
Posting Shark
917 posts since Aug 2005
Reputation Points: 38
Solved Threads: 25
 

thanks criss for ur reply, but i dont know how to implement ur suggetion.i hv created a inputstream object and used a url.openstream() method to access the contents of the page.can u suggest how i can capture the redirected portion of the URL.also how can i find out the http response codes that u showed.plz help.

Dark Master
Newbie Poster
8 posts since Aug 2005
Reputation Points: 10
Solved Threads: 0
 

You should use the HttpURLConnection class for requesting pages through HTTP. This class has a setFollowRedirects method that allows you to tell the class to automatically follow redirects. This class has many methods that you will find very helpful since it gives you the ability to read response messages and header information from the response.

chrisbliss18
Posting Shark
917 posts since Aug 2005
Reputation Points: 38
Solved Threads: 25
 

This article has been dead for over three months

Post: Markdown Syntax: Formatting Help
You