Hi!totally new, i did an undergrad in non-python related science, and I'm trying to move into incorporating python into my postgrad, so my questions right now are fairly basic.

I basically want to run muscle (a sequence alignment tool) on a large number of files.how do i start a python script that basically says "open each of these files in turn, then use this PBS script on each of them"?


What are you aiming to get out of this? What is your end result?

The basic framework would look something like this ...

# find selected files in working folder using module glob
# glob takes care of upper/lower case
# glob works with Windows and Unix

import os
import glob

# pick a directory/folder where your .dat data files are
folder = "C:/temp"
# change to that folder

# process all the .dat data files in that folder
for fname in glob.glob('*.dat'):
    fh = open(fname, "r")
    # creates a list of data line items as strings
    data_list = fh.readlines()
    # now process the data_list
    for count, item in enumerate(data_list):
        # do something with each data line item
        # and save the result
        # optionally show progress
        print( "file % s line %d processed" % (fname, count+1) )

It assumes that the data file may contain one piece of processable data per line.

In the future give your thread a more meaningful title, more people will help. This sounds more like a last minute homework problem.

thank you so much, defo give that a try.Also thanks for tip about it sounding like a homework problem!i wish i HAD done it in school, i wouldn't be having so many problems now that way!!


What are you aiming to get out of this? What is your end result?

i have a over 400,000 DNA sequences in one file.I've divided this one file into 2,000 files,based on different parameters, like their homology to each other.

so now i want to run a program called muscle on all 2,000 files (as in open file 1, run muscle, close file 1,open file 2...etc) which are in one directory.

for the cluster im on(stokes), the system is through PBS scripts.

So my aim is to somehow combine python and pbs scipts and say:
python: open each of these files in sequence and then...
pbs:run muscle on each.

so ultimately, ill have another 2000 files, each of which has been processed using muscle.

if that makes sense??!

Some info are missing in your posts:

1) What is the command that you want to run for each file ? I read a few google results about pbs scripts, and may be you want to run commands like qsub script on the command line. So the question is how would you run muscle by hand if you had only one file to process. Also if the job must be completed for 1 file before you start the next file, how do you know that the job is finished ?

2) What are your file names ? are they all in the same directory ?

ok so my files are all in the same directory.

they are labelled fam1.mcl.fas up to fam2221.mcl.fas

im aware of the qsub command, but my problem is a pbs script (which would be qsubbed) only works on one file at a time(as in a typical pbs script would be do muscle on file 1), unless the command in the pbs script is a python or similar script.

so i was hoping to write a python script to say:

open the directory with the collection of files
open a file
run muscle, the command being similar to "blastall -p blastp -i query -d db"

write an output file
close file
move on to next file
open file...run muscle...

basically i would then write a pbs script for the python script and qsub the pbs script.

i suppose my only way of knowing the thing is finished is if i have an equal number of output input files.

Perhaps you could follow this kind of design (if I understand correctly the meanings of the arguments to blastall)

#!/usr/bin/env python
# myscript.py
import subprocess as sp
from os path import join as pjoin

def name_pairs():
    "yield a sequence of pairs (input file name, output file name)"
    for i in xrange(1, 2222):
        yield ("fam%d.mcl.fas" % i, "output%d" %i)

def path_pairs(input_dir, output_dir):
    "yield a sequence of pairs (input path, output path)"
    for iname, oname in name_pairs:
        yield pjoin(input_dir, iname), pjoin(output_dir, oname)

def commands(input_dir, output_dir, db_path):
    "yield the commands to run"
    for ipath, opath in path_pairs:
        yield "blastall -p blastp -i %s -d %s -o %s" % (ipath, db_path, opath)

def run_commands(input_dir, output_dir, db_path):
    for cmd in commands(input_dir, output_dir, db_path):
        process = sp.Popen(cmd, shell=True)

def main():
    input_dir, output_dir, db_path = sys.argv[-3:]
    run_commands(input_dir, output_dir, db_path)

if __name__ == "__main__":

You could run this script in a shell as myscript.py input_dir output_dir db_path Also you should create a fresh directory as output_dir.