View Single Post
Join Date: Jan 2009
Posts: 4
Reputation: noamjob is an unknown quantity at this point 
Solved Threads: 0
noamjob noamjob is offline Offline
Newbie Poster

Re: How can I read a pdf web page?

 
0
  #4
Jan 11th, 2009
Hi, thanks for the responds...
The first option you suggested is not actually what I want. I want to get the content as a text file for later parsing.
The second option is a little bit complex fro me. I don't see how I open a web page this way. So I tried to save the pdf file first to my computer, then open it with the pyPdf thing.
This is my code:

import urllib2,os
import webbrowser
import pyPdf


url='http://fetac.ie/MODULES/D20120.pdf'
content=urllib2.urlopen(url).read()

filename = "pdfExample.pdf"
fout=open(filename, "wb")
fout.write(content)
fout.close()


def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace("\xa0", " ").strip().split())
return content

print getPDFContent(filename)




When I run it, I get an exception:

"File "c:\appdata\local\temp\easy_install-pbbgen\pyPdf-1.12-py2.5-win32.egg.tmp\pyPdf\pdf.py", line 555, in getObject
raise Exception, "file has not been decrypted"
Exception: file has not been decrypted"

1. Any suggestions for the problem? When I run each part seperately, it works, but for some reason, saving the pdf file this way isn't enough for the pyPdf thing to open it later.
2. Isn't there a better way than saving it to the computer, then open it etc. ? Isn't there something for reading a pdf directly from the web page?
Reply With Quote