Hi I'm wondering how I would go about creating a program that will input data from about 1,000 files containing data such as this:

2.825       1.00697992588
2.875       0.952989176901
2.925       0.91428970229
2.975       0.890110513425
3.025       0.879731596138
3.075       0.959217137445
3.125       1.07391392796
3.175       1.04874407027
3.225       0.857693793906

I'm wanting to generate an average value for each element in the file. for example it would probably do this:

loop over all files(named output10000 - output10000000)
get all the values at all positions
generate an average value for each value
output new data file with all averages

I just can't figure out how to handle the input

Recommended Answers

All 12 Replies

Here's how to open a file and parse the contents as per your example:

def main():
    """ test_data.txt contents (minus leading tabs):
        2.825       1.00697992588
        2.875       0.952989176901
        2.925       0.91428970229
        2.975       0.890110513425
        3.025       0.879731596138
        3.075       0.959217137445
        3.125       1.07391392796
        3.175       1.04874407027
        3.225       0.857693793906
    """
    # We can keep track of our data in this structure;
    #  however you may make use of any data structure that suits you (classes, list, etc)
    running_avgs = {}
    #
    fh = open('test_data.txt')
    lines = fh.readlines()
    fh.close()
    #
    for line in lines:
        line_data = line.strip().split()
        if len(line_data) == 2:
            element = line_data[0]
            value = line_data[1]
            if running_avgs.get(element):
                # IF it's in our dictionary we can calculate the new avg here
                pass # This section depends on how you want to handle your avgs
            else:
                # The first element can be the running sum of values
                # The second element will be the count of elements
                running_avgs[element] = [value,1]
    #
    print running_avgs

if __name__ == '__main__':
    main()
    raw_input('Press enter to exit...')

I haven't tested that so it could contain a syntax error or two... but I hope it helps get you on your way.

EDIT: Also keep in mind that you'll want to loop this whole operation for all your files... So if the files are all under a directory you could simply do this:

import os

my_dir = '/home/usr/data_collection/'
file_list = os.listdir(my_dir)
#
for file_name in file_list:
    # Protect us from sub directories
    # os.path.join will also be needed for open() if we're not
    #   currently within my_dir as a working directory
    # To remove the need for os.path.join we could also do
    #   os.chdir( my_dir )
    if os.path.isfile(os.path.join(my_dir,file_name)):
        # Perform the above code on the file as long as it's valid.

EDIT: Also keep in mind that you'll want to loop this whole operation for all your files... So if the files are all under a directory you could simply do this:

import os

my_dir = '/home/usr/data_collection/'
file_list = os.listdir(my_dir)
#
for file_name in file_list:
    # Protect us from sub directories
    # os.path.join will also be needed for open() if we're not
    #   currently within my_dir as a working directory
    # To remove the need for os.path.join we could also do
    #   os.chdir( my_dir )
    if os.path.isfile(os.path.join(my_dir,file_name)):
        # Perform the above code on the file as long as it's valid.

Thanks for getting back to me. I'm just a bit confused as to what I need to put below your final if statement

Alright, well here's the code's merged together....

import os

def main():
    """ Example file contents from the /home/usr/data_collection directory:
        2.825       1.00697992588
        2.875       0.952989176901
        2.925       0.91428970229
        2.975       0.890110513425
        3.025       0.879731596138
        3.075       0.959217137445
        3.125       1.07391392796
        3.175       1.04874407027
        3.225       0.857693793906
    """
    # We can keep track of our data in this structure;
    #  however you may make use of any data structure 
    running_avgs = {}
    my_dir = '/home/usr/data_collection/'
    file_list = os.listdir(my_dir)
    #
    for file_name in file_list:
        # Protect us from sub directories
        # os.path.join will also be needed for open() if we're not
        #   currently within my_dir as a working directory
        # To remove the need for os.path.join we could also do
        #   os.chdir( my_dir )
        file_path = os.path.join(my_dir,file_name)
        if os.path.isfile(file_path):
        # Perform the above code on the file as long as it's valid.
            fh = open(file_path)
            lines = fh.readlines()
            fh.close()
            #
            for line in lines:
                line_data = line.strip().split()
                if len(line_data) == 2:
                    element = line_data[0]
                    value = line_data[1]
                    if running_avgs.get(element):
                        # IF it's in our dictionary we can calculate the new avg here
                        pass # This section depends on how you want to handle your avgs
                    else:
                        # The first element can be the running sum of values
                        # The second element will be the count of elements
                        running_avgs[element] = [value,1]
            #
    print running_avgs

if __name__ == '__main__':
    main()
    raw_input('Press enter to exit...')

Again, I haven't tested this so it could contain a syntax error or two... but I hope it helps get you on your way.

Sorry It's all a bit above me. Which variable is for the sum of all the files?

I wasn't sure of what type of averaging you were looking to do so I assumed something that sounds nothing like what you want. Basically here's the template version:

# Initialize your variables here for avg, sum, count, etc..
    # You may also declare what directory you're working with
    # Example my_dir = '/home/usr/data_collection'
    file_list = os.listdir(my_dir)
    #
    for file_name in file_list:
        # os.path.join will be needed for open() if we're not
        #   currently within my_dir as a working directory
        # To remove the need for os.path.join we could also do
        #   os.chdir( my_dir )
        file_path = os.path.join(my_dir,file_name)
        # example file_path = '/home/usr/data_collection/test1.txt'
        # To protect us from sub directories we'll check if file_path points to a regular file or a directory
        if os.path.isfile(file_path):
        # Open the file, read it's contents into a list, and close it
            fh = open(file_path)
            lines = fh.readlines()
            fh.close()
            #
            # Iterate line-by-line
            for line in lines:
                # Split each line up by white space and strip off line separators (\n or \r\n)
                line_data = line.strip().split()
                # Check that we have two values per line to avoid error
                if len(line_data) == 2:
                    Assign each segment of the line to a variable
                    element = line_data[0]
                    value = line_data[1]
                    # This is where you update your average, sum, count, etc...
                    # Again, this all depends on what you're looking to average
            #
    # Print the results here at the end

Please, try to understand each and every step of this. Don't be afraid to ask questions about a specific piece if you don't understand.

loop over all files(named output10000 - output10000000)
get all the values at all positions

For this example of data
2.825 1.00697992588
2.875 0.952989176901
2.925 0.91428970229
2.975 0.890110513425
If it is the first number, "2.825", that you want to use as the key with the second number being the values, I would suggest using a dictionary of lists with "2.825" as the key, and appending the "1.00697992588" to the list for that key. You can then sum each list and average. So, assuming that all of the data will fit into memory:

input_data = [ "2.825       1.00697992588", \
               "3.005       0.91428970", \
               "2.925       0.91428970229", \
               "2.825       0.952989176901", \
               "2.925       0.890110513425" ]

ave_dic = {}
for rec in input_data:
   key, value = rec.strip().split()
   if key not in ave_dic:
      ave_dic[key] = []
   ave_dic[key].append(value)
for key in ave_dic:
   print key, ave_dic[key]

Finally, you may want to use Python's decimal class instead of float() if float is not accurate enough.

# Check that we have two values per line to avoid error
                if len(line_data) == 2:
                    Assign each segment of the line to a variable
                    element = line_data[0]
                    value = line_data[1]
                    # This is where you update your average, sum, count, etc...
                    # Again, this all depends on what you're looking to average
            #
    # Print the results here at the end

Please, try to understand each and every step of this. Don't be afraid to ask questions about a specific piece if you don't understand.

I think I now understand the parts that iterate over all the files in the directory now. I'm still not sure about how to calculate my averages and sums. I have about 1,000 data files all the same length(101 rows of 2 columns) The first column remains the same, so i don't need to calculate the average of that. I want to calculate the mean of all the results in the right hand column. So my output will be a list of 101 averages.

i understand i would put something in the if len(line_data)==2 loop like:
count=count+1
sum = sum+line_data[1]
average=sum/count

but how would I go about doing it for every item in the column?

Can anyone help? I'm really at a loss with this.

May be this will work

try:
  range = xrange
except NameError:
  pass

def filenames():
  for i in range(1, 1001):
    yield "output%d" % (i * 10000)

def average_list():
  result = [0.0] * 101
  cntfiles = 0
  for name in filenames():
    cntfiles += 1
    file_in = open(name)
    for index, line in enumerate(file_in):
      key, value = line.strip().split()
      result[index] += float(value)
  for i in range(len(result)):
    result[i] /= cntfiles
  return result

print average_list()

That's fantastic! thankyou very much. All i needed to do was to add the following to make the output a nice list:

output = open('outputAvg', 'w')
for index in range(len(average_list())):
            output.write( "%s\n" % (average_list()[index]))
output.close()						
print 'done!!'

That's fantastic! thankyou very much. All i needed to do was to add the following to make the output a nice list:

output = open('outputAvg', 'w')
for index in range(len(average_list())):
            output.write( "%s\n" % (average_list()[index]))
output.close()						
print 'done!!'

Yous should call average_list() only once. Each time that you call the function, you open and read the 1000 files !

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.