I am new to this forum (and python), but since I found excellent post here, I figured it would be the best place to post this question which is two part:

I have many (+100) .txt files which are space delimited (not csv or tab, possibly referred to as ASCI?) containing 2 columns of data by ~90,000 rows. The first column is consecutive values from 800-10,000. I have attached a file as an example which is only 800-3,627 (38,430 rows).

Part (1): I would like to remove rows that have a first column value between 1569 and 1575. I assume this is accomplished by reading the lines in, then writing them out to another file if they are not between 1569 and 1575.

Part (2): Can I set this to run on all text files in a path/directory versus running it for each file individually?

Thanks again for the help and assistance, and I promise one day I am going to get better at programming.

Recommended Answers

All 7 Replies

I am new to this forum (and python), but since I found excellent post here, I figured it would be the best place to post this question which is two part:

I have many (+100) .txt files which are space delimited (not csv or tab, possibly referred to as ASCI?) containing 2 columns of data by ~90,000 rows. The first column is consecutive values from 800-10,000. I have attached a file as an example which is only 800-3,627 (38,430 rows).

Part (1): I would like to remove rows that have a first column value between 1569 and 1575. I assume this is accomplished by reading the lines in, then writing them out to another file if they are not between 1569 and 1575.

Part (2): Can I set this to run on all text files in a path/directory versus running it for each file individually?

Thanks again for the help and assistance, and I promise one day I am going to get better at programming.

Regarding your First part

Readline function gives you all the data of the file in a list. So all you have to do is that traversing the list you call the split function with space delimiter.
Read more on http://www.java2s.com/Code/Python/String/String-Split.htm

REgarding your 2nd part there is an option that you store the name of all the files in a seperate filename.txt and in your program open this file and use it to open all the other files.

Maybe something like this:

import os
for textfile in (filename for filename in os.listdir(os.curdir) if filename.endswith('.txt')):
    oklines = [line for line in open(textfile) if not (1569 < float(line.split()[0]) < 1575)]
    with open(textfile,'w') as outfile:
        outfile.write(''.join(oklines))

Maybe something like this:

import os
for textfile in (filename for filename in os.listdir(os.curdir) if filename.endswith('.txt')):
    oklines = [line for line in open(textfile) if not (1569 < float(line.split()[0]) < 1575)]
    with open(textfile,'w') as outfile:
        outfile.write(''.join(oklines))

Hey tony, this is a beginner's question, not the obfuscated python contest !

Maybe something like this:

import os
for textfile in (filename for filename in os.listdir(os.curdir) if filename.endswith('.txt')):
    oklines = [line for line in open(textfile) if not (1569 < float(line.split()[0]) < 1575)]
    with open(textfile,'w') as outfile:
        outfile.write(''.join(oklines))

tonyjv,
That is some intense code. I was going to post what I had started, but then saw yours. Any way for you to maybe put some explanations in the code? My biggest question is that although I see how it should read all text files in a specified directory, is it writing it out as a different file name in that directory, or is at it seems that it just re-writes the existing file? And does it matter that each file is 90,000 rows and 2-2.5 MB in size?

(Here is what I had so far, which didn't address the second needs of my post for going through a batch of files)

# read the data file in as a list
f = open( '*.txt', "r" )

data_list = f.readlines()
f.close()

# remove list items, x>1569 and x<1575
	for line in data_list
		if not '1569' or '1571' or '1572' or '1573' or '1574' in line:
			print line
		
# write the changed data (list) to new file
f = open("*.txt", "w")
f.writelines(data_list)
f.close()

You must use either filtering by endswith or the module glob You can not put '*' in filename.

On request of clarification (though I assumed that the variable names were quite self documenting):

# we need module os to access the os.listdir for the list of files that are in same directory as this program
import os
# lets only take those filenames from the directory which end in '.txt'
for textfile in (filename for filename in os.listdir(os.curdir) if filename.endswith('.txt')):
    # we can not use generator of lines, as we are overwriting the original files, this is dangerous practice though
    # we are filtering out the lines whose first (0th) columnt as float is between given limits per OP's request
    oklines = [line for line in open(textfile) if not (1569 < float(line.split()[0]) < 1575)]
    # here we overwrite the original file assuming that it is copy of original or we never need the full original data
    # we could make separate directory for the processed file with os.mkdir for more safe code in practical application
    # we will let with to close the file for us safely
    with open(textfile,'w') as outfile:
        # it is enough just to join the lines as they have their original '\n' at end
        outfile.write(''.join(oklines))

You must use either filtering by endswith or the module glob You can not put '*' in filename.

On request of clarification (though I assumed that the variable names were quite self documenting):

# we need module os to access the os.listdir for the list of files that are in same directory as this program
import os
# lets only take those filenames from the directory which end in '.txt'
for textfile in (filename for filename in os.listdir(os.curdir) if filename.endswith('.txt')):
    # we can not use generator of lines, as we are overwriting the original files, this is dangerous practice though
    # we are filtering out the lines whose first (0th) columnt as float is between given limits per OP's request
    oklines = [line for line in open(textfile) if not (1569 < float(line.split()[0]) < 1575)]
    # here we overwrite the original file assuming that it is copy of original or we never need the full original data
    # we could make separate directory for the processed file with os.mkdir for more safe code in practical application
    # we will let with to close the file for us safely
    with open(textfile,'w') as outfile:
        # it is enough just to join the lines as they have their original '\n' at end
        outfile.write(''.join(oklines))

Here version, which creates subdirectory for results:

# we need module os to access many usefull file functions
import os
# we create output directory if it does not exist
if not os.path.isdir('output'):
    os.mkdir('output')

# lets only take those filenames from the directory which end in '.txt'
for textfile in (filename for filename in os.listdir(os.curdir) if filename.endswith('.txt')):
    with open(os.path.join('output',textfile),'w') as outfile:
        # it is enough just to join the lines as they have their original '\n' at end
        outfile.write(''.join(line
                              for line in open(textfile)
                              if not (1569 < float(line.split()[0]) < 1575)
                              )
                      )

The size of few megabytes is negligible, even it is lot of data when you print it on paper. These days the computers have Gigabytes of memory, and only one file's oklines even are in memory at one time. In this corrected post version, we do not even need to keep those in memory but let the generator expression generate them 'on the fly'

Tony,

That script worked perfectly. I am sorry to have asked you to explain it since I could have looked up everything myself. Thank you so much; wish I could buy you a beer or something.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.