I need a program which counts the pages of all the pdf files in a directory.
i found this scrpit which is supposet to do exactly that, but when i try to run it it just doesnt work for me it outputs nothing. i dont really know what to change in the code so that it will work for me when i replace the vPath with the diecetory of the file it just give me an error, please help im new to this and trying to get the hang of it. Thanks in advance

"""
This module contains a function to count
the total pages for all PDF files in one directory.
"""
#from time import clock as __c
from glob import glob as __g
from re import search as __s

cdef dict __count( char *vPath ):
	#
	cdef list vPDFfiles = __g( vPath + "\\" + '*.pdf' )
	cdef int vPages = 0
	cdef dict vMsg = {}
	#
	for vPDFfile in vPDFfiles:
		vFileOpen = open( vPDFfile, 'rb', 1 )

		for vLine in vFileOpen.readlines():
			if "/Count " in vLine:
				vPages = int( __s("/Count \d*", vLine).group()[7:] )

		vMsg[vPDFfile] = vPages
		vFileOpen.close()
	#
	return vMsg
	#

def count( vPath ):
	"""
	Takes one argument: the path where you want to search the files.
	Returns a dictionary with the file name and number of pages for each file.
	"""
	#cdef double ti = __c()
	cdef dict v = __count( vPath )
	#cdef double tf = __c()
	print tf-ti
	return v

Recommended Answers

All 5 Replies

This is Cython code. It seems to be a version of Prahaai's pure python code here http://www.daniweb.com/software-development/python/threads/152831. See if the pure python version works.

Also notice that a pyPdf module exists here http://pybrary.net/pyPdf/ . Here is an example code to count the pages in a pdf file with python 2 and pyPdf (a python 3 compatible version seems to exist as well).

from pyPdf import PdfFileReader

reader = PdfFileReader(open("ginac_tutorial.pdf"))
print reader.getNumPages()

""" my output -->
124
"""

cdef means it is prepared to run with cython. You must remove those.

Something like:

import os
import re

d = 'O:\Documents and Settings\Veijalainen\Omat tiedostot'
totpages = 0
for f in (pf for pf in os.listdir(d) if pf.endswith('.pdf')):
    fn = os.path.join(d,f)
    with open(fn, 'rb') as pdf:
          for line in pdf:
              if "/Count" in line:
                  pages = int(re.search("/Count \d*", line).group()[7:])
                  totpages += pages
                  print('Count found: file: %s,line: %s\n %i pages\n' % (fn, line, pages))
##              else:
##                  # debug print plain text lines
##                  if  all(ord(c) <= 128 for c in line):
##                      print('%s: %s' %(f, line.rstrip()))

print('Total pages found %i' % totpages)

cdef means it is prepared to run with cython. You must remove those.

Something like:

import os
import re

d = 'O:\Documents and Settings\Veijalainen\Omat tiedostot'
totpages = 0
for f in (pf for pf in os.listdir(d) if pf.endswith('.pdf')):
    fn = os.path.join(d,f)
    with open(fn, 'rb') as pdf:
          for line in pdf:
              if "/Count" in line:
                  pages = int(re.search("/Count \d*", line).group()[7:])
                  totpages += pages
                  print('Count found: file: %s,line: %s\n %i pages\n' % (fn, line, pages))
##              else:
##                  # debug print plain text lines
##                  if  all(ord(c) <= 128 for c in line):
##                      print('%s: %s' %(f, line.rstrip()))

print('Total pages found %i' % totpages)

Without regular expression this my code would be (use anyway ready tested modules):

import os
from itertools import takewhile
d = 'O:\Documents and Settings\Veijalainen\Omat tiedostot'
totpages = 0
for f in (pf for pf in os.listdir(d) if pf.endswith('.pdf')):
    fn = os.path.join(d,f)
    with open(fn, 'rb') as pdf:
          for line in pdf:
              if "/Count" in line:
                  pages = int(''.join(takewhile(lambda c: c.isdigit(), line[line.find('/Count ')+7:].lstrip())))
                  totpages += pages
                  print('Count found: file: %s,line: %s\n %i pages\n' % (fn, line, pages))

print('Total pages found %i' % totpages)
commented: thank you so much +0

Without regular expression this my code would be (use anyway ready tested modules):

import os
from itertools import takewhile
d = 'O:\Documents and Settings\Veijalainen\Omat tiedostot'
totpages = 0
for f in (pf for pf in os.listdir(d) if pf.endswith('.pdf')):
    fn = os.path.join(d,f)
    with open(fn, 'rb') as pdf:
          for line in pdf:
              if "/Count" in line:
                  pages = int(''.join(takewhile(lambda c: c.isdigit(), line[line.find('/Count ')+7:].lstrip())))
                  totpages += pages
                  print('Count found: file: %s,line: %s\n %i pages\n' % (fn, line, pages))

print('Total pages found %i' % totpages)

This script seemed to overcount pages very much, it is the last number of count only seems to matter, so:

import os
from itertools import takewhile
from operator import methodcaller

d = r'O:\Documents and Settings\Veijalainen\Omat tiedostot'
totpages = 0
for f in (pf for pf in os.listdir(d) if pf.endswith('.pdf')):
    fn = os.path.join(d,f)
    with open(fn, 'rb') as pdf:
        text = pdf.read()
        pages = int(''.join(takewhile(methodcaller('isdigit'), text[text.rfind('/Count ')+7:].lstrip())))
    totpages += pages
    print('File %s: %i pages' % (f,pages))
print('-'*50)
print('Total pages found %i' % totpages)

This seems to kind of work, but I have one 4 page pdf, which gives result of 0 pages, both with this my script and Gribouillis' code from the old thread. Notice also using raw string for windows file name. I use alternative way to making lambda for calling method from object.

This interesting math pdf:http://www.google.fi/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cts=1331033312884&ved=0CCMQFjAA&url=http%3A%2F%2Festoyanov.net%2Ffiles%2FMATAMATIKA%2FChristopher%2520Bradley%2520-%2520Challenges%2520in%2520Geometry%2520for%2520Mathematical%2520Olympians%2520Past%2520and%2520Present.pdf&ei=3vRVT62RD8XOsgasnczxBg&usg=AFQjCNERYyw4pm6nmB1kVHaQnr5JfxqUBg&sig2=KsoakLZ4GhEm66YHqotUYQ, comes up as 5 page document, which it is not, by my first version of script reported it over 800 pages, in reality, it is 218 pages.

pyPDF works, otherwise than our simple scripts:

import os
from pyPdf import PdfFileReader

def count(d):
    """
    Takes one argument: the path where you want to search the files.
    Returns a dictionary with the file name and number of pages for each file.
    """
    totpages = 0
    for f in (pf for pf in os.listdir(d) if pf.endswith('.pdf')):
        fn = os.path.join(d,f)
        with open(fn, 'rb') as pdf:
            reader = PdfFileReader(pdf)
            pages = reader.getNumPages()
            
        print('File %s: %i pages' % (f,pages))
        totpages += pages
    return totpages

count(r'O:\Documents and Settings\Veijalainen\Omat tiedostot\Downloads')
"""Output:
File 0198566913-oxford-university-press-usa-challenges-in-geometry-for-mathematical-olympians-past-andpdf2455.pdf: 218 pages
"""

Does not seem to like all kinds of files however, so prepare to catch errors:

O:\Documents and Settings\Veijalainen\Omat tiedostot\Q&A_Moon_V3.pdf

Traceback (most recent call last):
  File "I:\test\pdf_count_pypdf.py", line 21, in <module>
    count(r'O:\Documents and Settings\Veijalainen\Omat tiedostot')
  File "I:\test\pdf_count_pypdf.py", line 14, in count
    reader = PdfFileReader(pdf)
  File "I:\python27\lib\site-packages\pyPdf\pdf.py", line 374, in __init__
    self.read(stream)
  File "I:\python27\lib\site-packages\pyPdf\pdf.py", line 788, in read
    assert xrefstream["/Type"] == "/XRef"
TypeError: 'NumberObject' object is not subscriptable
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.