Searching for multiple items

Question

cableguy31 0 Light Poster

14 Years Ago

I'm writing a script that will search file names looking from certain file extensions. The thing is, is that I am looking for multiple extensions, and the list may change.

I could write an if statement using "and", but the line just gets a bit long and can become difficult to manage.

I was hoping that I might be able to create a list or use regex to make one place where I can list all of the extension and then use that list (or regex) to search the file names.

I'm still a Python newbie and I'm not sure how I could implement something like this.

Any help would be appreciated.

Thanks.

Jason

python

5 Contributors
8 Replies
149 Views
1 Day Discussion Span
Latest Post 14 Years Ago Latest Post by TrustyTony

snippsat 661 Master Poster

14 Years Ago

Here is an alternative with "endswith" thats works fine for this.

aFile = 'test.jpg'
for ext in ['.txt', '.jpg', '.zip']:
    if aFile.lower().endswith(ext):
        print 'Do something'

Edited 14 Years Ago by snippsat because: n/a

snippsat 661 Master Poster

14 Years Ago

Just one small thing is that it is quite rare for the filename to end with with both .txt and .jpg, so I would fix logic little:

See the 2 previos post i just copy that list,and i think most understand it`s just used as an example.

does it. .tar.gz file is however gz file, not tar file

This will check for gz file.

aFile = 'file.tar.gz'
for ext in ['.rar', '.gz', '.zip']:
    if aFile.lower().endswith(ext):
        print 'Do something'

There are many way do this,i think this is quit readable.
"Readability counts" as taken out from the zen of python.
http://www.python.org/dev/peps/pep-0020/

Edited 14 Years Ago by snippsat because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Beat_Slayer 17 Posting Pro in Training · Answer 1 · 2010-06-29T02:28:33+00:00

If I understood.

extensions = ['.txt', '.jpg', '.zip']

f_name, f_extension = os.path.splitext(filename)

for extension in extensions:
    if f_extension == extension:
        do_something()

griswolf 304 Veteran Poster · Answer 2 · 2010-06-29T09:34:23+00:00

Another way, with regular expressions

import re
extensions = ['txt', 'jpg', 'zip']
pat = '\.('+'|'.join(extensions)+')$'
extRE = re.compile(pat)
for f in getFilenames():
  if extRE.search(f):
    do_something(f)

note the absence of the '.' in the extensions: I wanted to put it once, but you could duplicate it if that makes the UI simpler. I have no idea if a full search is faster than a splitext followed by a short search. If it would be useful, you could do this instead: if extRE.search(os.path.splitext(f)[1]) You can also parenthesisze the individual extensions, and check the groups() of the match object if it would be nice to do a sub-dispatch on each extension.

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 3 · 2010-06-29T12:44:20+00:00

Here is an alternative with "endswith" thats works fine for this.

Just one small thing is that it is quite rare for the filename to end with with both .txt and .jpg, so I would fix logic little:

aFile = 'test.jpg'
for ext in ['.txt', '.jpg', '.zip']:
    if aFile.lower().endswith(ext):
        print('Do something')
        break #done, or return if this is deep inside function

If there is many extensions, I would consider using rpartition (or os.path function, see Beat_Slayer) to separate the extension and do dict lookup from that to get processing function for that filetype. This does the checking direct lookup instead of linear search. The .lower() call is quite essential, as many times files can be .jpg or .JPG, that is good catch.

There is however case of .tar.gz etc files in Linux/*nix environments, where better way is to split extension from first point not the last one as I suppose the os.path.splitext(filename) does it. .tar.gz file is however gz file, not tar file.

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 4 · 2010-06-29T15:33:29+00:00

Looks only for me looks more logically correct that program stops checking the other alternative when only one is possible and it is found. Your code continues checking the .zip ending even it finds .gz, for example. Don't take offense.

My comment about .tar.gz was considering uniformity between linux and windows style environments as that file has 'own file type': tgz, even it is just one file type inside other file type.

I love to use .endswith myself, It makes nice code. Also putting the file type list to separate module and importing that and formatting the list logically in many lines will improve final code, if the number of filetypes is big.

Here is start for function dict based solution:

import os

def textfunc(filename):
    print('Text processing %s' % filename)

def rtffunc(filename):
    print('RTF processing %s' % filename)

def pyfunc(filename):
    print('PY processing %s' % filename)

def jpgfunc(filename):
    print('JPG processing %s' % filename)

def gzfunc(filename):
    print('gz processing %s' % filename)

def zipfunc(filename):
    print('zip processing %s' % filename)

filefuncs={'.txt' : textfunc, '.rtf' : rtffunc,'.py' : pyfunc, # text files
            '.jpg' : jpgfunc, # pictures
            '.gz' : gzfunc, '.zip': zipfunc, # compressed
             # comma in the end helps updating
            }

for this_file in os.listdir(os.curdir):
    _,ext = os.path.splitext(this_file)
    if ext in filefuncs:
        filefuncs[ext](this_file)
    else:
        print('Handler not written for %s filetype' % ext)

input('Ready')

cableguy31 0 Light Poster · Answer 5 · 2010-06-29T18:44:43+00:00

Thank you all for your input.

Here is what I ended up doing, and it seems to be working.

doNotSearch = re.compile(r"[0-9a-zA-Z]*. \
(?i)(exe|gif|jpeg|jpg|png|dll|jar|wpc|sys|ocx|cnv|cpl \
|sdb|ime|hlp|mp3|wav|mpeg|chm|msi|msp|mst|olb)")

if not re.search(doNotSearch,j):

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 6 · 2010-06-30T00:11:17+00:00

Simple current directory printing files not in disallowed (added pyc):

from __future__ import print_function
import os
ignore_filetypes=set("exe|gif|jpeg|jpg|png|dll|jar|wpc|sys|ocx|cnv|cpl\
|sdb|ime|hlp|mp3|wav|mpeg|chm|msi|msp|mst|olb|pyc".split('|'))

for i in (f for f in os.listdir(os.curdir) if os.path.isfile(f) ): ## no directories like '.', '..' or normal ones
    _,ext = os.path.splitext(i)
    if ext[1:] in ignore_filetypes: ## take out '.'
        print ('Not', i)
        continue
    print('-'*30, " %18s " % i, '-'*30)
    print(open(i).read()) ## print printable file