| | |
How can I read a pdf web page?
Please support our Python advertiser: Programming Forums - DaniWeb Sister Site
Thread Solved |
•
•
Join Date: Jan 2009
Posts: 4
Reputation:
Solved Threads: 0
I'm trying to get the content of a web page that is written in pdf format.
The following code worked very well for me when I tried to read a regular web page, but it prints all kinds of weird letters when I try it on a pdf page like this one:
url='http://fetac.ie/MODULES/D20120.pdf'
content=urllib2.urlopen(url).read()
print content
Any suggestion? (a brief code example will be great, thanks!)
The following code worked very well for me when I tried to read a regular web page, but it prints all kinds of weird letters when I try it on a pdf page like this one:
url='http://fetac.ie/MODULES/D20120.pdf'
content=urllib2.urlopen(url).read()
print content
Any suggestion? (a brief code example will be great, thanks!)
I suggest this if you want to see the pdf file
python Syntax (Toggle Plain Text)
import webbrowser webbrowser.open("http://fetac.ie/MODULES/D20120.pdf")
If you are just interested to extract the text of a PDF formatted page, then take a close look at:
http://code.activestate.com/recipes/511465/
http://code.activestate.com/recipes/511465/
No one died when Clinton lied.
•
•
Join Date: Jan 2009
Posts: 4
Reputation:
Solved Threads: 0
Hi, thanks for the responds...
The first option you suggested is not actually what I want. I want to get the content as a text file for later parsing.
The second option is a little bit complex fro me. I don't see how I open a web page this way. So I tried to save the pdf file first to my computer, then open it with the pyPdf thing.
This is my code:
import urllib2,os
import webbrowser
import pyPdf
url='http://fetac.ie/MODULES/D20120.pdf'
content=urllib2.urlopen(url).read()
filename = "pdfExample.pdf"
fout=open(filename, "wb")
fout.write(content)
fout.close()
def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace("\xa0", " ").strip().split())
return content
print getPDFContent(filename)
When I run it, I get an exception:
"File "c:\appdata\local\temp\easy_install-pbbgen\pyPdf-1.12-py2.5-win32.egg.tmp\pyPdf\pdf.py", line 555, in getObject
raise Exception, "file has not been decrypted"
Exception: file has not been decrypted"
1. Any suggestions for the problem? When I run each part seperately, it works, but for some reason, saving the pdf file this way isn't enough for the pyPdf thing to open it later.
2. Isn't there a better way than saving it to the computer, then open it etc. ? Isn't there something for reading a pdf directly from the web page?
The first option you suggested is not actually what I want. I want to get the content as a text file for later parsing.
The second option is a little bit complex fro me. I don't see how I open a web page this way. So I tried to save the pdf file first to my computer, then open it with the pyPdf thing.
This is my code:
import urllib2,os
import webbrowser
import pyPdf
url='http://fetac.ie/MODULES/D20120.pdf'
content=urllib2.urlopen(url).read()
filename = "pdfExample.pdf"
fout=open(filename, "wb")
fout.write(content)
fout.close()
def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace("\xa0", " ").strip().split())
return content
print getPDFContent(filename)
When I run it, I get an exception:
"File "c:\appdata\local\temp\easy_install-pbbgen\pyPdf-1.12-py2.5-win32.egg.tmp\pyPdf\pdf.py", line 555, in getObject
raise Exception, "file has not been decrypted"
Exception: file has not been decrypted"
1. Any suggestions for the problem? When I run each part seperately, it works, but for some reason, saving the pdf file this way isn't enough for the pyPdf thing to open it later.
2. Isn't there a better way than saving it to the computer, then open it etc. ? Isn't there something for reading a pdf directly from the web page?
first, it is better to use code tags.
Having that said, it seems the PDF is encrypted. So try something not encrypted or check for decryption option if available in pyPDF module (I have never used it)
Having that said, it seems the PDF is encrypted. So try something not encrypted or check for decryption option if available in pyPDF module (I have never used it)
Atheist: God is man made imagination, he doesn't exist!
Theist: It's okay, can you imagine anything else that doesn't exist?
Junior MD --- Python, C++ and PHP
Theist: It's okay, can you imagine anything else that doesn't exist?
Junior MD --- Python, C++ and PHP
•
•
Join Date: May 2008
Posts: 560
Reputation:
Solved Threads: 90
Whether or not the text of the PDF is 'available' is an option for the PDF publisher. I remember several tech documents that were designed intentionally to not allow you to copy the text out of them.
It sounds as if your target document might fall into that category. You could check by using a 'normal' browser to save the file to disk. Then open the file with an 'official' reader and check the properties.
(This would also give you a pdf on disk that you could practice the pdf part of your code with.)
It sounds as if your target document might fall into that category. You could check by using a 'normal' browser to save the file to disk. Then open the file with an 'official' reader and check the properties.
(This would also give you a pdf on disk that you could practice the pdf part of your code with.)
![]() |
Similar Threads
- sql query problem with MS Access and C# (C#)
- class to read any kind of file format e.g .doc,.pdf,.txt (Java)
- can't see c drive, virus alert in taskbar, same fixes as i read won't work (Viruses, Spyware and other Nasties)
- Read PDF content (ASP.NET)
- help with hijack this log, i just post it in windows xp but then read that it should (Viruses, Spyware and other Nasties)
- Super slow computer PLEASE HELP (Viruses, Spyware and other Nasties)
Other Threads in the Python Forum
- Previous Thread: Upgrade to python 2.5.4
- Next Thread: Large File Support - Win32
| Thread Tools | Search this Thread |
advanced aliased bash beginner bits calling casino changecolor class clear command convert corners count csv cturtle cursor def definedlines dictionary digital dynamic dynamically events examples external file float format frange function google gui hints homework i/o iframe import info input java line linux list lists loop matching mouse multiple number numbers obexftp output parsing path port prime programming projects py py2exe pygame pygtk python random rational raw_input recursion return scrolledtext signal singleton skinning stderr string strings subprocess table tails terminal text thread threading time tkinter tlapse tuple tutorial ubuntu unicode urllib urllib2 valueerror variable voip web-scrape whileloop windows word wxpython






