Good day.
I wanted to share with you the code for an application I've created. It can read the number of pages in all the PDF files from one directory.

The question I have is this : does it need more optimisation ? Can I make it work faster ? For now, it is pretty fast, but I feel it can do better.

This is the code in Python.

"""
This module contains a function to count
the total pages for all PDF files in one directory.
"""

#from time import clock as __c #Used for benchmark.
from glob import glob as __g
from re import search as __s

def count( vPath ):
	"""
	Takes one argument: the path where you want to search the files.
	Returns a dictionary with the file name and number of pages for each file.
	"""
	#
	#cdef double ti = __c() #Used for benchmark.
	#
	vPDFfiles = __g( vPath + "\\" + '*.pdf' )
	vPages = 0
	vMsg = {}
	#
	for vPDFfile in vPDFfiles:
		vFileOpen = open( vPDFfile, 'rb', 1 )

		for vLine in vFileOpen.readlines():
			if "/Count " in vLine:
				vPages = int( __s("/Count \d*", vLine).group()[7:] )

		vMsg[vPDFfile] = vPages
		vFileOpen.close()
	#
	#cdef double tf = __c() #Used for benchmark.
	#
	#print tf-ti
	return vMsg
	#

I also wrote the code in Cython and this is the code:

"""
This module contains a function to count
the total pages for all PDF files in one directory.
"""
#from time import clock as __c
from glob import glob as __g
from re import search as __s

cdef dict __count( char *vPath ):
	#
	cdef list vPDFfiles = __g( vPath + "\\" + '*.pdf' )
	cdef int vPages = 0
	cdef dict vMsg = {}
	#
	for vPDFfile in vPDFfiles:
		vFileOpen = open( vPDFfile, 'rb', 1 )

		for vLine in vFileOpen.readlines():
			if "/Count " in vLine:
				vPages = int( __s("/Count \d*", vLine).group()[7:] )

		vMsg[vPDFfile] = vPages
		vFileOpen.close()
	#
	return vMsg
	#

def count( vPath ):
	"""
	Takes one argument: the path where you want to search the files.
	Returns a dictionary with the file name and number of pages for each file.
	"""
	#cdef double ti = __c()
	cdef dict v = __count( vPath )
	#cdef double tf = __c()
	print tf-ti
	return v
	#

Both work in the same way : you call ' count( 'C:\\Path_to_your_PDF_files' ) '. You should use double backslash in order to be sure it really works. The function returns a dictionary with the name of the PDF as key and the number of pages as value.

So... does anyone find something that could be optimised?

Recommended Answers

All 6 Replies

I suggest this

from glob import glob as __g
import re
pattern = re.compile(r"/Count\s+(\d+)")

def count(vPath):
    """
    Takes one argument: the path where you want to search the files.
    Returns a dictionary with the file name and number of pages for each file.
    """
    vPDFfiles = __g( vPath + "\\" + '*.pdf' )
    vMsg = {}
    for vPDFfile in vPDFfiles:
        vPages = 0
        content = open( vPDFfile, 'rb', 1 ).read()
        for match in pattern.finditer(content):
            vPages = int(match.group(1))
        vMsg[vPDFfile] = vPages
    return vMsg

The main difference is that I compile the pattern once for all, and I dont read the files line by line. Tell me if it works, and how faster it is !

commented: :) +4

Good day.

Thank you very much for your answer.

Indeed, your method is faster!
When i call the function for the first time, it's just a little faster (15-20%), but when i call it several times after that, it only needs 0.02 seconds to re-count all the pages! Excellent result!

The problem is that i never use the program to re-calculate, i just read the PDF pages once, so only first time is important.
I think this is the maxim performance i can get... :) or not ?

Thank you very much.

If you read a PDF document in a viewer, and you want to know the number of pages (assuming the browser doesn't tell you), a good way is to go directly read the page number on the last page. You could try to read only the end of the file, using the method seek of file objects.

If you read a PDF document in a viewer, and you want to know the number of pages (assuming the browser doesn't tell you), a good way is to go directly read the page number on the last page. You could try to read only the end of the file, using the method seek of file objects.

Good day.
If you open a PDF file with notepad++ or your favorite editor, you can see that the page number appears in the file in differend places. Just search for "/Count". It will never be on the last line... it would be so easy...
Well. Thank you very much. :)

I am using this code to read the number of pages of all pdf files in one directory, because i don't want to open each file and write the page and then make the summ... if i have like 500 PDF files in one directory, it's real madness!!! That's why i need the speed reading program. The program can also generate a raport with the name of each file and how many pages it has. :)

I installed the program pdftk (on my linux distribution, it was a package). I think you can be interested in this little script

import os
from os.path import join as pjoin, expanduser
import subprocess
from pprint import pprint

directory = expanduser("~/MyPdfs")

for name in os.listdir(directory):
  if name[-4:] == ".pdf":
    p = pjoin(directory, name)
    child = subprocess.Popen("pdftk %s dump_data output" % p, shell=True,
                           stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    sout, serr = child.communicate()
    if serr:
      print p
      print(serr)
    else:
      D = dict((t[0].strip(), t[1].strip()) for t in
                     (t.split(":") for t in sout.split("\n")[:-1]))
      pprint(D)

Thank you very much for your response.
I played with PDFTK, it's a great tool !
Anyway, for counting pages, python code is 20 times faster.

'''
D:\Kits\PDF\pdftk_v1.12\CEB_fara_PCP.pdf : [5217] pag.
D:\Kits\PDF\pdftk_v1.12\CEB_cu_PCP.pdf : [974] pag.
D:\Kits\PDF\pdftk_v1.12\CEC.pdf : [2] pag.
D:\Kits\PDF\pdftk_v1.12\ABN.pdf : [16] pag.

There is a total of [4] PDF files and [6197] pages.
Processed in 0.0679847208363 seconds.
'''

versus

'''
{'BookmarkLevel': '1',
'BookmarkPageNumber': '16',
'BookmarkTitle': 'calin remus.pdf',
'InfoKey': 'CreationDate',
'InfoValue': 'D',
'NumberOfPages': '16',
'PdfID0': 'e8d48e4ac99b794bb04b20905404050',
'PdfID1': '81547120667f6d459e1369b9e0514461'}

{'InfoKey': 'CreationDate',
'InfoValue': 'D',
'NumberOfPages': '974',
'PdfID0': 'fc5cd447f64dbdb5aa1ac2d250edc',
'PdfID1': 'fc5cd447f64dbdb5aa1ac2d250edc'}

{'InfoKey': 'CreationDate',
'InfoValue': 'D',
'NumberOfPages': '5217',
'PdfID0': 'cefc2fdb4577a7392b8e9a72bd66b80',
'PdfID1': 'cefc2fdb4577a7392b8e9a72bd66b80'}

{'InfoKey': 'CreationDate',
'InfoValue': 'D',
'NumberOfPages': '2',
'PdfID0': '262f7a62bacecd608a18a31bfc17c9',
'PdfID1': '262f7a62bacecd608a18a31bfc17c9'}

Processed in 18.4285751608 seconds.
'''

Excellent. :)
Have a nice day.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.