Iterating over large number of files

Question

Xydric 0 Newbie Poster

15 Years Ago

Hi Guys and Gals! I come to you today requesting assistance! Let me explain what I am trying to do first, and then I will show you my code.

The purpose is to read a list of regular expressions in from a text file (one regex per line, only about 6 regex so far, but it will probably be a lot later on), compile them and then begin finding all files that match a specific pattern (this regex pattern is in the code itself not within the regex patterns file) and iterating over them. When a file is looped over I want to check to see if the regexes from the regex pattern file match anywhere in the file and report the regex that matched and the file that it matched in (as well as span, etc. although that particular part is trivial).

So what I want to do, in SHORT:

1. Read in an ever growing list of regular expressions from a text file and compile them.
2. Find all files with specific extensions (website files mainly, html, htm, php, js, css, etc).
3. Loop over the list of files, slurp in each file one at a time (read it all in as a string), and check to see if the regular expressions match anywhere in the string/file.

Without further ado, I present my current script.

#!/usr/bin/python

#Import needed modules.
import re
import os
import itertools

vars = {}

#Regex compile - list of file extensions and regexes
r = re.compile("\.(php\d?|inc|p|[a-z]?html?|tm?pl|class|cgi|p[lm]|js|aspx?|cfm|ht.*|pym?)$")
regexList = [re.compile(y) for y in [x.rstrip() for x in open("regexes.txt")]]

#Create a list generator, so we can begin checking files as they are found.

generators = [os.walk(d) for d in os.getcwd()]

finalfiles = (
    os.path.join(dir,file)
    for (dir, subdirs, files) in itertools.chain(*generators)
    for file in files if r.search(file.lower())
)
#We use the generator that was created to make a loop, and work with it.
for x in finalfiles:
    myFile = open(x).read()
    for a,b in enumerate(regexList):
        rem = b.search(myFile)
        if rem != None:
            print "%s - regex #%s, matched at characters %s" % (x, a, rem.span())

So um, well that code *DID* work until I started screwing with it, now the generators = [os.walk(d) for d in os.getcwd()] doesn't work.. argh. Ah well, pretty sure you can see what is going on.

So when it was working not long ago, it actually did do exactly what I wanted it to, except it used a WHOLE lot of CPU (or was it memory.. argh, pretty sure it was CPU), the process would be anywhere from 80 to over 100 (multiple processors with multiple cores on centos 5.2+ servers). This mainly happens when theres a large number of files that need to be iterated over, although I am sure it can happen with a low file count and the files being rather large.

I am assuming that somewhere in the script (I know it isn't the greatest) garbage collection isn't cleaning up because I am doing something odd or inefficient.

I am not asking for anyone to write the code, or rewrite it, or insult, etc. I am merely asking for tips, insight into what the problem could be, and with help from this community we can chip away at the inefficient code and possibly ALL learn something :)

Thanks for reading over this extremely long post, and for (hopefully) seeing the problem :)

html-css os-x python regex

2 Contributors
1 Reply
111 Views
20 Hours Discussion Span
Latest Post 15 Years Ago Latest Post by jlm699

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

jlm699 320 Veteran Poster · Answer 1 · 2009-11-12T20:28:11+00:00

The method os.getcwd() can only ever return a single string. There's ever only one current working directory, so when you're saying for d in os.getcwd() , you're actually iterating over the string character by character. What you really want is just generators = os.walk(os.getcwd())