I'm trying to develop a simple program that automatically downloads PDF files from a web server and organizes them into Files. When I download any .pdf file, the size is roughly 30% bigger than when I download it with a browser and it pdf opens, but does not display.

So far I've narrowed the additional bytes being added to

java.io.BufferedInputStream in = new java.io.BufferedInputStream(new java.net.URL("http://www.anywhere.com/test.pdf").openStream() );

The issue is not in the writing of the file because I can open a FileInputStream from a local copy of the pdf then output it and have it open fine. So somewhere between the web server and the BufferedInputStream I'm getting additional bytes added in.

I've tried using the URI decoding function but it appears it's only for the URL, not for content.

I've tried writing the file using Char data type instead of byte, but this didn't cause any changes.

I've verified that html pages downloaded and put in a .txt file match exactly what's see when the page source is view (i.e. no additions to the file with html)

I know it's possible because webcrawlers such as Nutch written in java are able to crawl and index PDFs without changing them.

Any help would be greatly appreciated.

Thanks!

//This code works but adds additional bytes to the outputted PDF causing it not to display

public class Main
{
	public static void main(String[] args) throws IOException
	{
	
		
		java.io.BufferedInputStream in = new java.io.BufferedInputStream(new java.net.URL("http://www.anywhere.com/test.pdf").openStream() );
		java.io.FileOutputStream fos = new java.io.FileOutputStream("test.pdf");
		java.io.BufferedOutputStream bout = new BufferedOutputStream(fos);
		byte data[] = new byte[1024];
		
		while(in.read(data,0,1024)>=0){
			bout.write(data);
		}
		
		bout.close();
		in.close();

	}
}

Recommended Answers

All 2 Replies

If the last block you read is <1024 bytes, you still write 1024 bytes to the output stream.

The difference between the file length far exceeds 1024 bytes and I've already ruled out the output stream as the culprit because it I've tested it with a local PDF by running a local copy of a PDF through the program and having it come out unaltered.

In the textual comparison of the PDF from the local copy to the downloaded copy I found additional 10 digit numbers in a list towards the beginning of the document, so some where between the web-server and the java program the information changes, I just don't know where.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.