943,627 Members | Top Members by Rank

Ad:
  • Python Discussion Thread
  • Marked Solved
  • Views: 2535
  • Python RSS
Jan 11th, 2009
0

How can I read a pdf web page?

Expand Post »
I'm trying to get the content of a web page that is written in pdf format.
The following code worked very well for me when I tried to read a regular web page, but it prints all kinds of weird letters when I try it on a pdf page like this one:

url='http://fetac.ie/MODULES/D20120.pdf'
content=urllib2.urlopen(url).read()
print content

Any suggestion? (a brief code example will be great, thanks!)
Similar Threads
Reputation Points: 10
Solved Threads: 0
Newbie Poster
noamjob is offline Offline
4 posts
since Jan 2009
Jan 11th, 2009
0

Re: How can I read a pdf web page?

I suggest this if you want to see the pdf file
python Syntax (Toggle Plain Text)
  1. import webbrowser
  2. webbrowser.open("http://fetac.ie/MODULES/D20120.pdf")
Reputation Points: 930
Solved Threads: 666
Posting Maven
Gribouillis is offline Offline
2,655 posts
since Jul 2008
Jan 11th, 2009
0

Re: How can I read a pdf web page?

If you are just interested to extract the text of a PDF formatted page, then take a close look at:
http://code.activestate.com/recipes/511465/
Reputation Points: 961
Solved Threads: 211
Nearly a Posting Maven
sneekula is offline Offline
2,413 posts
since Oct 2006
Jan 11th, 2009
0

Re: How can I read a pdf web page?

Hi, thanks for the responds...
The first option you suggested is not actually what I want. I want to get the content as a text file for later parsing.
The second option is a little bit complex fro me. I don't see how I open a web page this way. So I tried to save the pdf file first to my computer, then open it with the pyPdf thing.
This is my code:

import urllib2,os
import webbrowser
import pyPdf


url='http://fetac.ie/MODULES/D20120.pdf'
content=urllib2.urlopen(url).read()

filename = "pdfExample.pdf"
fout=open(filename, "wb")
fout.write(content)
fout.close()


def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace("\xa0", " ").strip().split())
return content

print getPDFContent(filename)




When I run it, I get an exception:

"File "c:\appdata\local\temp\easy_install-pbbgen\pyPdf-1.12-py2.5-win32.egg.tmp\pyPdf\pdf.py", line 555, in getObject
raise Exception, "file has not been decrypted"
Exception: file has not been decrypted"

1. Any suggestions for the problem? When I run each part seperately, it works, but for some reason, saving the pdf file this way isn't enough for the pyPdf thing to open it later.
2. Isn't there a better way than saving it to the computer, then open it etc. ? Isn't there something for reading a pdf directly from the web page?
Reputation Points: 10
Solved Threads: 0
Newbie Poster
noamjob is offline Offline
4 posts
since Jan 2009
Jan 12th, 2009
0

Re: How can I read a pdf web page?

first, it is better to use code tags.
Having that said, it seems the PDF is encrypted. So try something not encrypted or check for decryption option if available in pyPDF module (I have never used it)
Reputation Points: 462
Solved Threads: 392
Senior Poster
evstevemd is offline Offline
3,681 posts
since Jun 2007
Jan 12th, 2009
0

Re: How can I read a pdf web page?

Whether or not the text of the PDF is 'available' is an option for the PDF publisher. I remember several tech documents that were designed intentionally to not allow you to copy the text out of them.

It sounds as if your target document might fall into that category. You could check by using a 'normal' browser to save the file to disk. Then open the file with an 'official' reader and check the properties.

(This would also give you a pdf on disk that you could practice the pdf part of your code with.)
Reputation Points: 344
Solved Threads: 116
Practically a Master Poster
Murtan is offline Offline
670 posts
since May 2008
Jan 12th, 2009
0

Re: How can I read a pdf web page?

Thanks for all the responds.
It really is a specific problem with this pdf. With some other pdf files my code works well. So I can say it's preety much "solved".
I'm still looking for a different option that doesn't force me to save the pdf file before opening it again for the parsing.
Reputation Points: 10
Solved Threads: 0
Newbie Poster
noamjob is offline Offline
4 posts
since Jan 2009
Jan 12th, 2009
0

Re: How can I read a pdf web page?

what if you mark it solved?
Reputation Points: 462
Solved Threads: 392
Senior Poster
evstevemd is offline Offline
3,681 posts
since Jun 2007

This thread is solved

Either the thread starter or a moderator has marked this thread as solved. You can most likely trust the responses and answers given. There is most likely no reason for any further responses to be posted here. If you have a related question, please start a new thread in this forum instead.

This thread is more than three months old

No one has posted to this discussion for at least three months. Please let old threads die and do not reply to them unless you feel you have something new and valuable to contribute that absolutely must be added to make the discussion complete. Otherwise, please start a new thread in this forum instead.
Message:
Previous Thread in Python Forum Timeline: Upgrade to python 2.5.4
Next Thread in Python Forum Timeline: Large File Support - Win32





About Us | Contact Us | Advertise | Acceptable Use Policy
Forum Index | Build Custom RSS Feed


Follow us on Twitter


© 2011 DaniWeb® LLC