How can I read a pdf web page?

Please support our Python advertiser: Programming Forums - DaniWeb Sister Site
Thread Solved

Join Date: Jan 2009
Posts: 4
Reputation: noamjob is an unknown quantity at this point 
Solved Threads: 0
noamjob noamjob is offline Offline
Newbie Poster

How can I read a pdf web page?

 
0
  #1
Jan 11th, 2009
I'm trying to get the content of a web page that is written in pdf format.
The following code worked very well for me when I tried to read a regular web page, but it prints all kinds of weird letters when I try it on a pdf page like this one:

url='http://fetac.ie/MODULES/D20120.pdf'
content=urllib2.urlopen(url).read()
print content

Any suggestion? (a brief code example will be great, thanks!)
Reply With Quote Quick reply to this message  
Join Date: Jul 2008
Posts: 936
Reputation: Gribouillis is a jewel in the rough Gribouillis is a jewel in the rough Gribouillis is a jewel in the rough 
Solved Threads: 216
Gribouillis's Avatar
Gribouillis Gribouillis is online now Online
Posting Shark

Re: How can I read a pdf web page?

 
0
  #2
Jan 11th, 2009
I suggest this if you want to see the pdf file
  1. import webbrowser
  2. webbrowser.open("http://fetac.ie/MODULES/D20120.pdf")
Reply With Quote Quick reply to this message  
Join Date: Oct 2006
Posts: 2,279
Reputation: sneekula has a spectacular aura about sneekula has a spectacular aura about 
Solved Threads: 176
sneekula's Avatar
sneekula sneekula is offline Offline
Nearly a Posting Maven

Re: How can I read a pdf web page?

 
0
  #3
Jan 11th, 2009
If you are just interested to extract the text of a PDF formatted page, then take a close look at:
http://code.activestate.com/recipes/511465/
No one died when Clinton lied.
Reply With Quote Quick reply to this message  
Join Date: Jan 2009
Posts: 4
Reputation: noamjob is an unknown quantity at this point 
Solved Threads: 0
noamjob noamjob is offline Offline
Newbie Poster

Re: How can I read a pdf web page?

 
0
  #4
Jan 11th, 2009
Hi, thanks for the responds...
The first option you suggested is not actually what I want. I want to get the content as a text file for later parsing.
The second option is a little bit complex fro me. I don't see how I open a web page this way. So I tried to save the pdf file first to my computer, then open it with the pyPdf thing.
This is my code:

import urllib2,os
import webbrowser
import pyPdf


url='http://fetac.ie/MODULES/D20120.pdf'
content=urllib2.urlopen(url).read()

filename = "pdfExample.pdf"
fout=open(filename, "wb")
fout.write(content)
fout.close()


def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace("\xa0", " ").strip().split())
return content

print getPDFContent(filename)




When I run it, I get an exception:

"File "c:\appdata\local\temp\easy_install-pbbgen\pyPdf-1.12-py2.5-win32.egg.tmp\pyPdf\pdf.py", line 555, in getObject
raise Exception, "file has not been decrypted"
Exception: file has not been decrypted"

1. Any suggestions for the problem? When I run each part seperately, it works, but for some reason, saving the pdf file this way isn't enough for the pyPdf thing to open it later.
2. Isn't there a better way than saving it to the computer, then open it etc. ? Isn't there something for reading a pdf directly from the web page?
Reply With Quote Quick reply to this message  
Join Date: Jun 2007
Posts: 1,389
Reputation: evstevemd has a spectacular aura about evstevemd has a spectacular aura about evstevemd has a spectacular aura about 
Solved Threads: 127
evstevemd's Avatar
evstevemd evstevemd is offline Offline
Nearly a Posting Virtuoso

Re: How can I read a pdf web page?

 
0
  #5
Jan 12th, 2009
first, it is better to use code tags.
Having that said, it seems the PDF is encrypted. So try something not encrypted or check for decryption option if available in pyPDF module (I have never used it)
Atheist: God is man made imagination, he doesn't exist!
Theist: It's okay, can you imagine anything else that doesn't exist?
Junior MD --- Python, C++ and PHP
Reply With Quote Quick reply to this message  
Join Date: May 2008
Posts: 560
Reputation: Murtan is a jewel in the rough Murtan is a jewel in the rough Murtan is a jewel in the rough Murtan is a jewel in the rough 
Solved Threads: 90
Murtan Murtan is offline Offline
Posting Pro

Re: How can I read a pdf web page?

 
0
  #6
Jan 12th, 2009
Whether or not the text of the PDF is 'available' is an option for the PDF publisher. I remember several tech documents that were designed intentionally to not allow you to copy the text out of them.

It sounds as if your target document might fall into that category. You could check by using a 'normal' browser to save the file to disk. Then open the file with an 'official' reader and check the properties.

(This would also give you a pdf on disk that you could practice the pdf part of your code with.)
Reply With Quote Quick reply to this message  
Join Date: Jan 2009
Posts: 4
Reputation: noamjob is an unknown quantity at this point 
Solved Threads: 0
noamjob noamjob is offline Offline
Newbie Poster

Re: How can I read a pdf web page?

 
0
  #7
Jan 12th, 2009
Thanks for all the responds.
It really is a specific problem with this pdf. With some other pdf files my code works well. So I can say it's preety much "solved".
I'm still looking for a different option that doesn't force me to save the pdf file before opening it again for the parsing.
Reply With Quote Quick reply to this message  
Join Date: Jun 2007
Posts: 1,389
Reputation: evstevemd has a spectacular aura about evstevemd has a spectacular aura about evstevemd has a spectacular aura about 
Solved Threads: 127
evstevemd's Avatar
evstevemd evstevemd is offline Offline
Nearly a Posting Virtuoso

Re: How can I read a pdf web page?

 
0
  #8
Jan 12th, 2009
what if you mark it solved?
Atheist: God is man made imagination, he doesn't exist!
Theist: It's okay, can you imagine anything else that doesn't exist?
Junior MD --- Python, C++ and PHP
Reply With Quote Quick reply to this message  
Reply

This thread has been marked solved.
Perhaps start a new thread instead?
Message:


Thread Tools Search this Thread



About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC