How to resolve an indirect html url?

Question

mcek 0 Newbie Poster

14 Years Ago

I'm writing a program that will download pfd files and collate them. However the urls I get are a bit weird. They look like this:

http://links.ealert.nature.com/ctt?kn=105&m=34136651&r=MjA1NzczMzM4NgS2&b=0&j=NTkwOTY5NjQS1&mt=1&rt=0

I'm using URLConnection and BufferedInputStream, but no data is read from the stream (read() returns -1 at the first try). If they are embedded in a webpage like this:

<a href="http://links.ealert.nature.com/ctt?kn=105&m=34136651&r=MjA1NzczMzM4NgS2&b=0&j=NTkwOTY5NjQS1&mt=1&rt=0">pdf</a>

then the browser resolves this url to a direct url:

http://www.nature.com/nature/journal/v461/n7265/pdf/461697a.pdf

How can I do it in a Java program? I suspect there's some sort of link server involved there, but I don't know how to communicate with it.

java

3 Contributors
8 Replies
534 Views
3 Days Discussion Span
Latest Post 14 Years Ago Latest Post by mcek

All 8 Replies

masijade 1,351 Industrious Poster

14 Years Ago

The browser is not "resolving" anything. The site is "redirecting".

So, are you using URLConnection or HttpURLConnection? HttpURLConnection will "follow" redirects, by default, URLConnection, won't.

Edited 14 Years Ago by masijade because: n/a

masijade 1,351 Industrious Poster

14 Years Ago

That information seems to encapsulate a session, so Google around a bit and find out how to maintain a session using HttpURLConnection.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

moutanna 2 Posting Whiz · Answer 1 · 2009-10-16T19:08:42+00:00

if you are trying to read it line by line it does not work. try reading it byte by byte as follow:

URL url = new URL("..........");
HttpURLConnection conn = (HttpURLConnection)url.openConnection();

InputStream in = conn.getInputStream();

ByteArrayOutputStream bos = new ByteArrayOutputStream();

int i;
while((i = in.read())!= -1)
{
    bos.write(i);
}

byte [] b = bos.toByteArray();
FileOutputStream fos = new FileOutputStream("c:\\temp\\test.pdf");
fos.write(b);
fos.close();
conn.disconnect();

mcek 0 Newbie Poster · Answer 2 · 2009-10-16T22:31:03+00:00

Thank you for your answers. It doesn't seem to solve the problem however. The code im using is this:

public static void download(String urlS, File destination) throws IOException {
        BufferedInputStream bis = null;
        BufferedOutputStream bos = null;
        try {
            URL url = new URL(urlS);
            HttpURLConnection urlc = (HttpURLConnection)url.openConnection();
            System.out.println("Response message: " + urlc.getResponseMessage());
            System.out.println("Follow redirects: " + HttpURLConnection.getFollowRedirects());
            System.out.println("Content length: " + urlc.getContentLength());
            System.out.println("Content type: " + urlc.getContentType());

            bis = new BufferedInputStream(urlc.getInputStream());
            bos = new BufferedOutputStream(new FileOutputStream(
                    destination.getName()));

            int i;
            while ((i = bis.read()) != -1) {
                bos.write(i);
            }
        } finally {
            if (bis != null) {
                try {
                    bis.close();
                } catch (IOException ioe) {
                    ioe.printStackTrace();
                }
            }
            if (bos != null) {
                try {
                    bos.close();
                } catch (IOException ioe) {
                    ioe.printStackTrace();
                }
            }
        }
    }

I don't think reading line by line is the problem because it works perfectly well when I supply the direct URL. The output I get from the first URL is:

Response message: OK
Follow redirects: true
Content length: 0
Content type: text/plain; charset=UTF-8

and I get an empty file.

If I supply the second, direct URL I get this:

Response message: OK
Follow redirects: true
Content length: -1
Content type: application/pdf

and a nice pdf file.

The HttpURLConnect also doesn't seem to follow the redirection or whatever that is. :icon_confused:

I shall just add that if you paste the first URL into browser addres bar it also returns a completely empty page. It only works if this URL is in a link!

moutanna 2 Posting Whiz · Answer 3 · 2009-10-16T23:18:08+00:00

it seem that the PDF file is created dynamicaly using the formating object.

moutanna 2 Posting Whiz · Answer 4 · 2009-10-17T00:56:38+00:00

Have you ever try this:

con = (HttpURLConnection)url.openConnection();
con.setRequestMethod("GET");
con.connect();

if (con.getResponseCode()==HttpURLConnection.HTTP_OK)
{
.......
.......
......

}

moutanna 2 Posting Whiz · Answer 5 · 2009-10-17T17:44:48+00:00

Hi
You code above work perfectly.
I have been tested with netBeans and it's ok.
I think that this thread is solved.

mcek 0 Newbie Poster · Answer 6 · 2009-10-19T19:09:18+00:00

I got it to work!

@moutanna: The code doesn't work. HttpURLConnection uses GET by default.

Thanks to masijade for the session suggestion! I looked for ways of maintaining the session and I found a web testing suite - HttpUnit (http://httpunit.sourceforge.net/index.html). It has convenient classes for traversing web pages as if your program were a user in front of a browser and maintains the session automatically. Easy! Here's the code that downloads the page with links with those weird URLs and downloads all PDFs:

WebConversation wc = new WebConversation();
        WebResponse indexResp = wc.getResource(new GetMethodWebRequest(url));

        WebLink[] links = new WebLink[1];
        try {
            links = indexResp.getLinks();
        } catch (SAXException ex) {
            Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
        }

        System.out.println("Downloading the PDFs...");

        for(WebLink link : links) {
            if(!link.getText().contentEquals("PDF"))
                continue;
            try {
                link.click();
            } catch (SAXException ex) {
                Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
            }

            WebResponse resp = wc.getCurrentPage();
            String fileName = resp.getURL().getFile();
            fileName = fileName.substring(fileName.lastIndexOf("/") + 1);
            System.out.println(fileName);
            
            File file = new File(fileName);

            BufferedInputStream bis = new BufferedInputStream(resp.getInputStream());
            BufferedOutputStream bos = new BufferedOutputStream(new FileOutputStream(file.getName()));

            int i;
            while ((i = bis.read()) != -1) {
                bos.write(i);
            }
            bis.close();
            bos.close();
        }
        System.out.println("Done downloading.");

How to resolve an indirect html url?

Recommended Answers Collapse Answers

All 8 Replies

Recommended Answers