I have a list of files of similar format. I want to extract all lines form each file that are containing exact character for example I need to extract only lines with TPO and saving to another file also I need to extract only lines with HOH and saving to another file. Regarding each file may contain different character that I need to extract separately (not collectively)

Example


HETATM 1271 N TPO A 160 65.018 41.561 0.708 1.00 18.62 N
HETATM 1272 CA TPO A 160 63.724 42.043 1.187 1.00 20.80 C
HETATM 1273 C ACE A 160 65.018 41.561 0.708 1.00 18.62 C
HETATM 1274 O ACE A 160 65.018 41.561 0.708 1.00 18.62 O
HETATM 1275 CH3 ACE A 160 65.018 41.561 0.708 1.00 18.62 C
HETATM 1276 O HOH A 160 65.018 41.561 0.708 1.00 18.62 O
HETATM 1277 H HOH A 160 63.724 42.043 1.187 1.00 20.80 H

You will want to use readlines() unless you have huge files. Once you have the array of lines in, say lines:

tpos = []
hohs = []
aces = []
others = []
for line in lines:
  if 'TPO' in line:
    tpos.append(line)
  elif 'HOH' in line:
    hohs.append(line)
  # elif as many times as needed
  else:
    others.append(line)

Then you can save each of the arrays to a specific file.

It would be possible to have the output files all open at once, and instead of saving lines to an array, you could just do this: print >> someOpenFile, line Whenever you open files, you need to assure that they are closed when you are done, which will happen automatically if the script ends. Better practice is to use a with statement as explained here: http://effbot.org/zone/python-with-statement.htm unless you are stuck in some Python dark ages, in which case use try/finally blocks.

The others array is there to help you debug problems which is usually spelled 'unexpected input' in this context. Instead of saving to that array and eventually doing something with it, you might want to just abort (unless it takes significant time to parse your input file) like this:

...
    print >> sys.stderr, "Quitting with unexpected line [%s]%(line.strip())
    sys.exit(3) # any value greater than 0

Of course you can log the problem instead of aborting if this script needs to run without human oversight... in which case you might also consider sending yourself email for errors.

Edited 6 Years Ago by griswolf: n/a

Oops. Missing double quote on line 2, which should be print >> sys.stderr, "Quitting with unexpected line [%s]"%(line.strip())

Edited 6 Years Ago by griswolf: n/a

you could use the glob module to read anything with a specified extension i would use something like such:

import glob
if __name__ == '__main__':
    output_tpo = open('tpo.csv', 'wb')
    output_hoh = open('hoh.csv', 'wb')
    for filename in glob.glob('*.txt'): # if the extensions are txt
        input_file = open(filename, 'rb')
        for line in input_file:
            if 'TPO' in line:
                output_tpo.write(line)
            elif 'HOH' in line:
                output_hoh.write(line)
        input_file.close()
    output_tpo.close()
    output_hoh.close()

Edited 6 Years Ago by baki100: n/a

csv files are text, so maybe the b for binary operation is overkill. But example input was plain text. It would be good though that output files would not become inputs if program is run again.

I usually create output folder to put the results.

If you like to do it yourself alternative for glob.glob is:

import os
if __name__ == '__main__':
    if not os.path.isdir('results'):
        os.mkdir('results')
    output_tpo = open('results/tpo.txt', 'w')
    output_hoh = open('results/hoh.txt', 'w')
    for filename in [filn for filn in os.listdir('.') if filn.endswith('.txt')]:
        for line in open(filename):
            if 'TPO' in line:
                output_tpo.write(line)
            elif 'HOH' in line:
                output_hoh.write(line)
    output_tpo.close()
    output_hoh.close()

I also use that methods but i went for something super simple and quick ideally kelokely could also create an input folder and work from there and have the results in the result folder

This article has been dead for over six months. Start a new discussion instead.