Hi All,

I'm trying to loop through and open PDFs in a folder using Java. I have the following code:

import java.awt.Desktop;
import java.io.File;
import java.io.IOException;


public class OpenPDFs {

    public static void main(String[] args) throws IOException {
        // TODO code application logic here
        String fp;
        fp ="S:\\Economic Forecasts\\Fcst13\\SourceForecasts\\";
        File file = new File(fp);
            if(file.toString().endsWith(".pdf"))
                Runtime.getRuntime().exec("rundll32 url.dll,FileProtocolHandler " + file);
            else{
                Desktop dt = Desktop.getDesktop();
                dt.open(file);
            }
    }
}

which only opens the folder window that contains the files. In the end, what I've been tasked to do is go through a lot of PDFs and remove the metadata within each one. I know I can do this using Acrobat 9 but it can only be done 1 at a time. The people asking me about this say they have about 1000 PDFs to do this to. Has anyone ever done this or can you suggest a good way to do this?

Recommended Answers

All 20 Replies

Use a thread do to this, it will do multiple at a time.

ex.

class OpenPDFs implements Runnable{
    public void run(){
        //do the task you did in main in here
    }

    public void main(String[] args){

        OpenPDFs open = new OpenPDFs(); // this is just an example
        OpenPDFs open1 = new OpenPDFs();
        Thread t = new Thread(open);
        Thread t1 = new Thread(open1);
        t.start(); //runs both objects without waiting
        t1.start();
    }
}

Here is an article on this, Im not sure if I explained this well: http://java.sampleexamples.com/how-to-use-runnable-interface-for-creating-thread-in-java/

Member Avatar for iamthwee

Not quite java related but I would opt for:
http://www.rockpdf.com/

Please note the trial version can only handle pdfs with 50 pages

Grab the trial then use a bat file to recurse through the directory removing all the pdf metadata.

Shouldn't be too difficult... If you need further help I can be more specific.

Assuming you are on windows, if not you could always set up virtual box.

*Make sure to do a backup of your master directory in case things go wrong.

Thanks to the both of you. I am going to test iamthwee's suggestion first because I was initially tasked with finding 3rd party software to do the job and wasn't having any luck. I'll let you both know how it goes tomorrow!

I'll check out both. Money isn't an issue as it's not mine being spent! :) Thanks again iamthwee!

Did you tried iText or Apache PDFBox?

commented: great links +14

So I'm truthfully rather noobish when it comes to adding something like pdfbox. Is there a specific folder path I need to follow in order to use it's imports? I've been trying to use any and all of these suggestions but they either aren't installing, aren't importing, or I'm plain old dumb (which could be the case right now). Either way, this is really frustrating me atm.

Oh yeah and trying to run the first two suggested software from the command prompt keeps erroring with the files not being found.

Not a good way for me to start my day that's for sure.

OK, so I delved into Adobe Acrobat 9 Pro and the following steps did exactly what I needed:

Open Adobe Acrobat Pro
Click Advanced-Document Processing-Batch Processing
Click New Sequence
Name the Sequence (I named it “MetaRemove”
Click “Select Commands…”
In the Document folder, click “Examine Document”
Click “Add>>”
Expand the Examine Document field by clicking the +
Double click Remove metadata: Yes
Deselect all except Metadata and hidden text (unless there are other fields you want included in this process)
Click OK
Click to highlight MetaRemove
Click “Run Sequence”

Thanks for all the suggestions and if someone would be gracious enough to let me know how to use the suggestions given I'd be grateful for that too.

Oh yeah and trying to run the first two suggested software from the command prompt keeps erroring with the files not being found.

What you mean running from command prompt and getting errors. That is hardly a description of issue from developer...

for pdfleo (for instance) I get a return message: "missing option FILE"

Well would be beneficial if you actually posted code or what ever you are executing. Not sitting next to you to see on your screen ;)

What code? At command prompt I entered: pdfleo as per directions from the pdfleo pdf, which was downloaded from here.

Never used pdfleo I though you been talking about iText or PDFBox that is why I asked what your code looks like...

For PDFBox, I tried the following

import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;

import org.apache.pdfbox.cos.COSDocument;

import org.apache.pdfbox.pdmodel.PDDocument;

import org.apache.pdfbox.pdfwriter.COSWriter;



public class OpenPDFs {

    public static void main(String[] args) throws IOException {
        // TODO code application logic here
        String fp;
        fp ="S:\\Economic Forecasts\\Fcst13\\SourceForecasts\\";
        PDDocumentInformation info = document.getDocumentInformation();

but the import org... lines tell me:

package org.apache.pdfbox.cos does not exist

You did not imported/provided pdfbox-1.8.2.jar to your IDE (IntelliJ, Eclipse, NetBeans) correctly. Therefore it is complaining about not existing imports.
Tell us what IDE you using and we can give you guidance how to associate library JAR with your project in your IDE, or just google "import library jar IDE_NAME"

Member Avatar for iamthwee

Hi Stuugie, I tried both but prefer the second link as it has no page restrictions. The second link worked. I will post more specific instructions.

Not to take away Peter's great advice I don't think you need to get down and dirty with java, importing libraries and understanding how it works.

The second link should be fine...

Member Avatar for iamthwee

Hi Stuugie.

Here are the instructions.

  1. Extract your downloaded file.
  2. Inside the folder 'linuxpdfinfo' paste in the pdf you want to clean up.

3.Now open a terminal window and navigate to this folder.

4.Once in this folder type in the terminal window exactly as is.

./pdfinfo -itest.pdf -otest2.pdf -removeinfo -removexmp

Let us assume your pdf is called 'test.pdf'

  1. Now you have created a new pdf called test2.pdf and the xmp info has been removed!

Enjoy.

Member Avatar for iamthwee

If you wish to process more than one folder note wild cards are no permitted.

So do a dump of all the pdf files first then reference that list in the terminal window.

E.g
ls -1 *.pdf > list.txt

then
./pdfinfo -ilist.txt -fstuugie -removeinfo -removexmp

commented: Thanks a bunch! +4

@peter, I'm using Netbeans at both work and home. I just wanted to state that I don't work very much (at all really) with Java but have decided to delve into it again like I am a student taking Java courses again. I was programming last night for about 2 hours making up simple classes from my old text book. I really want to strengthen my skills with Java and OO programming, that's my goal!

@iamthwee, I have meetings this morning but I am going to give your suggestions a go when I have time today. I'll get back to you and let you know how I do.

Thanks for your patience with me guys, I really appreciate it!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.