so I'm using code from http://schmidt.devlib.org/java/file-download.html in the hopes of inputting a URL and getting the file that URL points to. It works, sometimes. 90% of the time it doesn't work. Also, it seems a lot of the pages I want to download do not end in html, although I'm not too sure if that is a problem or not. Anyway, what I want is to know if its possible to supply a URL as a string and then download the file that URL points to. I have a feeling there's an easier way to do that, but I'm not sure.

Here's an example of a page that I want to download:

https://www.sportsbet.com.au/results/racing/Date/today

I'm writing a little program that download HTML files, scapes important information from those HTML files and writes the important stuff to a database. So far, as you might have guessed, things aren't going too well.

import java.io.*;
import java.net.*;

/*
 * Command line program to download data from URLs and save
 * it to local files. Run like this:
 * java FileDownload http://schmidt.devlib.org/java/file-download.html
 * @author Marco Schmidt
 */
public class FileDownload {
	public static void download(String address, String localFileName) {
		OutputStream out = null;
		URLConnection conn = null;
		InputStream  in = null;
		
		SocketAddress sa = new InetSocketAddress("proxy.csu.edu.au", 8080);
		Proxy proxy = new Proxy(Proxy.Type.HTTP, sa);
		
		try {
			URL url = new URL(address);
			out = new BufferedOutputStream(
				new FileOutputStream(localFileName));
			conn = url.openConnection(proxy);
			in = conn.getInputStream();
			byte[] buffer = new byte[1024];
			int numRead;
			long numWritten = 0;
			while ((numRead = in.read(buffer)) != -1) {
				out.write(buffer, 0, numRead);
				numWritten += numRead;
			}
			System.out.println(localFileName + "\t" + numWritten);
		} catch (Exception exception) {
			exception.printStackTrace();
		} finally {
			try {
				if (in != null) {
					in.close();
				}
				if (out != null) {
					out.close();
				}
			} catch (IOException ioe) {
			}
		}
	}

	public static void download(String address) {
		int lastSlashIndex = address.lastIndexOf('/');
		if (lastSlashIndex >= 0 &&
		    lastSlashIndex < address.length() - 1) {
			download(address, address.substring(lastSlashIndex + 1));
		} else {
			System.err.println("Could not figure out local file name for " +
				address);
		}
	}

	public static void main(String[] args) {

			download("http://schmidt.devlib.org/java/file-download.html");
		
	}
}

Angus Cheng

Recommended Answers

All 5 Replies

Looks like I got it to work. The thing downloads files no problem, JPGs, html files. The only problem is, a lot of pages, especially horse racing results pages, do not have URLs that end in a filename.

The code fails if the url does not end in a filename. So if I were to input:

h t t ps://w w w.sportsbet.com.au/results/racing/Date/today

the code will fail.

Any ideas? In the meantime I'm scouring the internet looking for someone who has horse racing results that have links ending in .html

Thanks in advanced,
Angus

I'll give you a place to start

The problem you are running into is with indexing urls that are "urlrewrite"

This isn't specific to horse racing, its widely used, but theres a place to start

> The code fails if the url does not end in a filename. So if I were to input:

You have a URL and it points to a resource; filename or not, it doesn't matter. There are multiple representations of exposing the same resource. The different representations don't matter as long as they expose the required resource; http://www.google.com/ or http://www.google.com/pages/index.html or http://www.google.com/index.html doesn't matter as long as I get the resource `index.html'.

> h t t ps://w w w.sportsbet.com.au/results/racing/Date/today

Hint: Secure resources require authentication information be supplied before you access them.

> The code fails if the url does not end in a filename.

Fails in what sense? Don't you get a response or are you unable to come up with a file name? If it is the latter, you need to come up with a better design than mapping remote resources to file names. If your final motive is scraping data from a web page, dump the file name convention and either directly process the HTTP response or come up with a better naming scheme.

> The problem you are running into is with indexing urls that are "urlrewrite"

That isn't a problem, it's just a way of exposing meaningful representations to the client. It's a technique widely used to create user and search engine friendly URL's. Plus it is the responsibility of the server to process the URL in a way previously configured and serve the client with the relevant payload.

Thanks for your replies, I've now solved the problem and everything is working great. Basically I'm a stupid idiot and spent a lot of time on something really simple.

What I did was very simple.

1. I didn't bother with proxy settings and used a direct internet connection (might bite me in the *** later).
2. There are two download methods in the above code.

download(String address);
download(String address, String outputFileName);

At first I was calling the first download function, which looks for a filename from the address, then calls the second download function.

So now I have supplied the address of the page I want to download, then hardcoded an outputFileName. Everything works and just in case anyone out there is as stupid as me (not likely) here it is:

import java.io.*;
import java.net.*;
 
/*
 * Command line program to download data from URLs and save
 * it to local files. Run like this:
 * java FileDownload [SOME SORT OF ADDRESS]
 * @author Marco Schmidt
 */
public class FileDownload {
	public static void download(String address, String localFileName) {
		OutputStream out = null;
		URLConnection conn = null;
		InputStream  in = null;
		
		//SocketAddress sa = new InetSocketAddress("proxy.csu.edu.au", 8080);
		//Proxy proxy = new Proxy(Proxy.Type.HTTP, sa);
		
		try {
			URL url = new URL(address);
			out = new BufferedOutputStream(
				new FileOutputStream(localFileName));
			//conn = url.openConnection(proxy);
			conn = url.openConnection();
			in = conn.getInputStream();
			byte[] buffer = new byte[1024];
			int numRead;
			long numWritten = 0;
			while ((numRead = in.read(buffer)) != -1) {
				out.write(buffer, 0, numRead);
				numWritten += numRead;
			}
			System.out.println(localFileName + "\t" + numWritten);
		} catch (Exception exception) {
			exception.printStackTrace();
		} finally {
			try {
				if (in != null) {
					in.close();
				}
				if (out != null) {
					out.close();
				}
			} catch (IOException ioe) {
			}
		}
	}
 
	public static void download(String address) {
		int lastSlashIndex = address.lastIndexOf('/');
		if (lastSlashIndex >= 0 &&
		    lastSlashIndex < address.length() - 1) {
			download(address, address.substring(lastSlashIndex + 1));
		} else {
			System.err.println("Could not figure out local file name for " +
				address);
		}
	}
 
	public static void main(String[] args) {
 
			download("[ADDRESS]", "jur.txt");
		
	}
}

Can you please guide me, so that i can save this downloaded file in my local directory by specifying the path using dialog box of the local directory.


Thanks for your replies, I've now solved the problem and everything is working great. Basically I'm a stupid idiot and spent a lot of time on something really simple.

What I did was very simple.

1. I didn't bother with proxy settings and used a direct internet connection (might bite me in the *** later).
2. There are two download methods in the above code.

download(String address);
download(String address, String outputFileName);

At first I was calling the first download function, which looks for a filename from the address, then calls the second download function.

So now I have supplied the address of the page I want to download, then hardcoded an outputFileName. Everything works and just in case anyone out there is as stupid as me (not likely) here it is:

import java.io.*;
import java.net.*;
 
/*
 * Command line program to download data from URLs and save
 * it to local files. Run like this:
 * java FileDownload [SOME SORT OF ADDRESS]
 * @author Marco Schmidt
 */
public class FileDownload {
	public static void download(String address, String localFileName) {
		OutputStream out = null;
		URLConnection conn = null;
		InputStream  in = null;
		
		//SocketAddress sa = new InetSocketAddress("proxy.csu.edu.au", 8080);
		//Proxy proxy = new Proxy(Proxy.Type.HTTP, sa);
		
		try {
			URL url = new URL(address);
			out = new BufferedOutputStream(
				new FileOutputStream(localFileName));
			//conn = url.openConnection(proxy);
			conn = url.openConnection();
			in = conn.getInputStream();
			byte[] buffer = new byte[1024];
			int numRead;
			long numWritten = 0;
			while ((numRead = in.read(buffer)) != -1) {
				out.write(buffer, 0, numRead);
				numWritten += numRead;
			}
			System.out.println(localFileName + "\t" + numWritten);
		} catch (Exception exception) {
			exception.printStackTrace();
		} finally {
			try {
				if (in != null) {
					in.close();
				}
				if (out != null) {
					out.close();
				}
			} catch (IOException ioe) {
			}
		}
	}
 
	public static void download(String address) {
		int lastSlashIndex = address.lastIndexOf('/');
		if (lastSlashIndex >= 0 &&
		    lastSlashIndex < address.length() - 1) {
			download(address, address.substring(lastSlashIndex + 1));
		} else {
			System.err.println("Could not figure out local file name for " +
				address);
		}
	}
 
	public static void main(String[] args) {
 
			download("[ADDRESS]", "jur.txt");
		
	}
}
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.