I just wanted to share my thoughts regarding PDFs and how bad they are to work with. I have 8 bank sources that provide my department with quarterly forecast data and their outputs are always in PDF format. I know PDFs are a convenient way to output data and are typically smaller when it comes to file sizes than say MS Word but data extraction from them is a nightmare. I have researched PDF to Excel/Word converters and have been using Nuance PDF converter but even that software is limited in its ability to successfully split the data properly. The only way to split out data from a PDF is to treat sections as tables but this is not ideal for PDFs that are saved in (basically) image formats.
This leaves the users having to copy and paste data from PDF to Excel but even at that, one page worth will paste down one column in Excel. Because of this, my automation plans in my department will be affected and quite possibly, the data integrity will be affected. I'm currently trying to create code that uses a word search to accurately extract specific data and leave what I call the fluff that is left over. This is not an easy task and is burning up time as far as requirements’ planning is concerned. However, once it is done I will be quite happy I'm sure.
I really don't like PDFs and with today’s storage abilities I can't see file sizes as an issue for your average company. If I had my way, I would destroy all PDFs and make people/organizations output their data in formats that are usable in the information industry.
Sorry for the rant but I’m knee deep in this sh.. – hoopla right now just trying to figure out a way for the user to transfer PDF data into Excel so that I can attempt to work with it.

Recommended Answers

All 13 Replies

Oh, and I was able to successfully share this with LinkedIn

Blasphemy! I really like PDFs. Perhaps your issue isn't with the actual format but with the fact that you are using them incorrectly. They're a good portable format when documents need to be read (not so much modified and worked with, which seems to be your dilemma). They're small in file size, can be emailed, can be opened across nearly all platforms and mobile devices without the need for any costly software, can be opened within web browsers, display identically across platforms (amazing for more complicated layouts, or when fonts and graphics are necessary), and even print exactly the way they look on screen. Honestly, I can't think of any other format in existance that could fill the place of PDFs when it comes to ease of use, portability and complete standardization across platforms.

Blasphemy!

LMAO, just had a coworker look at me oddly because I burst out laughing. You make very valid points that I won't argue with. However, my dilemma does stand from a data gathering and db loading point of view. I don't think I'm using them incorrectly though as I am not creating them, I'm only trying machine read them.

As I pointed out, your work is simply requiring that you use the wrong tool for the job. Not to say that a screwdriver isn't loads useful, but it's rather frustrating when you need it to do the job of a hammer. Sure, you can sit there banging its handle into the nail for a year and a day, and ultimately get the job accomplished, but not using the right tool for the job shouldn't dismiss the usefulness of either of the tools.

So is there a right tool that you know of?

Export to Excel?? In my particular situation, I use my bank's Export to Quickbooks feature to manage my books.

Myself, I always keep the original source in the original form, and convert that to PDF for publication. If someone needs to do additional processing and PDF isn't "convenient", then if they ask (nicely), I'll send them the original source document. If it's a book, then it may be in Framemaker, which isn't much more useful than the PDF. If it's a simple article or tech doc, then it will probably be in ODF (Open Office) format. It is still their problem to extract the information they need...

I don't mind PDF for "visual" purposes, but for passing actual data like in a table, I prefer a raw simple CSV file.

I like PDF's too, and having Adobe Acrobat Pro helps a lot.

Download Foxit Reader. This will allow you to save your PDF's as text (*.txt) files. The text file separates the cell data with Tab's, it's very readable and in that format importable.

I'm definitely not dogging the outputting capabilities of a PDF format. Our department secretary creates PDF documents for corporate use every month. I also find them good and easy to create with Acrobat Pro and I just downloaded Foxit Reader and am impressed by that too. I just wish our sources would provide us with better formats for extracting data. The copy/paste method that I was hoping to skirt will still be prevalent and that is unfortunate. Oh well.

Sorry, I should have explained this better. Foxit Reader will save the PDF data into a text file.

Run Excel and choose 'Open' as if to open a file. Change the format that you want to open in the drop-down list from "All Excel Files..." to "Text Files", and locate the text file you saved from Foxit.

When Excel opens the text file you will have a window open with options. Keep the setting 'Fixed width' and click next. Here you can manually change the field widths, if need be, then click next. If there aren't any columns you want to change or delete then just click finish.

What you should have before you is your data in individual cells. That should give you an idea of what to expect.

Now then for the next stage, making this procedure practicable. You would be wise to do this on a copy of the sheet you generally work with, not the original at this stage.

Run the copy of the Excel worksheet that you typically copy and paste your data into. Open a blank sheet using the tab at the bottom of the page there. Click on 'Data' on the Ribbon. Click on 'Get External Data' and choose to open a 'Text File' as you did earlier, and follow those steps until you have your data on the sheet in front of you.

Here's the good bit, hopefully. The values that you would like to appear in particular cells on your main worksheet can be linked between the two sheets. This will only be most effective if the layout of the data on the original PDF from your bank doesn't change. If this is so, then you won't need to copy and paste again, only open the PDF in Foxit, save it as a text file, run Excel, click on 'Data', Click on 'Get External Data, and you're done.

Because the links are made on your blank worksheet you can select all the data on the extra sheet and delete it, but don't delete the sheet because the links will refer to cells in it.

Experiment with it. If your Sheet's requirements for data are slightly different to what I've suggested here, then let us know how it uses them and either me or someone else will help.

commented: That's really good! Thank you. +0

That is awesome BigPaw. I understood the saving as a text file. The route I went (before seeing your last post) was to change the extension to a csv and then open it in Excel. While this is a few less steps, your way links to the file much better and everything is actually separated into cells (whereas my csv way still contained multiple values in some cells). Thanks a lot for your guidance here, I really do appreciate it!

You are more than welcome. Knowing it has been helpful has made my day, thank you. :-)

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.