Processing multiple csv files with python

Question

jamison234 0 Newbie Poster

11 Years Ago

Hi All,

I am a newbie programmer in python and I am trying to read multiple csv files from a folder, perform some data sums based on timestamps for each row for all the csv files and then output the sums to one result csv file. So far I am stuck at the beginning. Please can someone point out some tips?

Thanks

csv python

4 Contributors
9 Replies
4K Views
2 Weeks Discussion Span
Latest Post 11 Years Ago Latest Post by hughesadam_87

All 9 Replies

jajarvin 0 Newbie Poster

11 Years Ago

##Maybe someting like this:
## CSV-data is for example file test_1.csv
## a line in file consist of  timestamp and data
## or
## 04.11.2012,40
##
## There are two files: test_1.csv and test_2.csv
## The lines in file rest_1.csv are:
## 12.10.2012,4500
## 04.11.2012,1000
## 29.01.2012,2500
## 04.11.2012,4000
##
## The lines in file rest_2.csv are:
## 12.10.2012,4500
## 04.11.2012,5000
## 09.02.2012,4500
## 12.04.2012,4000
##
##===>
##===> This calculates sum of day's (04.11.2012) data and saves this sum in a file
##===>
## There are two ways to run  this code:
## "filedialog" - method --> user is promted to give files
## "files are given in this code" - method
def daniweb_test():
    method = input("Which method: dialog(d) or file included(f)  ")
    #### Method: filedialog
    if(method == 'd'):
        print("Method: filedialog")
        import tkinter.filedialog
        file_count = 0
        ## Give the number of files
        file_count = input("How many files you have  ")
        file_count = int(file_count)
        print("=====> ", file_count)
        sum = 0
        while file_count > 0:
            al_filename = tkinter.filedialog.askopenfilename()
            file = open(al_filename, "r")
            file_sum = 0
            for line in file:
                comma_pos = line.find(",")
                time_stamp = line[0:comma_pos]
                data  = line[comma_pos + 1:].rstrip("\n")
                if(time_stamp[0:2] == '04'):
                    print("4th Day - ", time_stamp, data)
                    file_sum = file_sum + int(data)
            file_count = file_count - 1
            file.close()
            sum = sum + file_sum
            print("Sum in file ",al_filename, " is ", sum)
        print("\nsum is ", sum)

    #### Method: files included in script
    if(method == 'f'):
        print("Method: files are given in this  code")
        ## files are in a list
        file_lst = ['test_1.csv', 'test_2.csv']
        sum = 0 
        for i in range(0, len(file_lst)):
            file_name = file_lst[i];
            file = open(file_name, "r")
            for line in file:
                comma_pos = int(line.find(","))
                time_stamp = line[0:comma_pos]
                data  = line[comma_pos + 1:].rstrip("\n")
                if(time_stamp[0:2] == '04'):
                    print("4th Day - ", time_stamp, data)
                    sum = sum + int(data)
            file.close()
        print("Sum is  ", sum)

    ## Save today's sum in a file
    file_name = "save_sum.txt"
    fn = open(file_name, "w")
    fn.write(str(sum))
    fn.close()
    print("\n===========> Sum saved in file save_sum.txt <============")

rrashkin 41 Junior Poster in Training

11 Years Ago

A dictionary isn't really a natural implementation of a database. I suppose you could have each row be it's own dictionary such that row1={key1:value1, key2:value2,...} and likewise for all rows. Then you'd get the value in column3, row3 from row3[key3] or something like that. I think a better implementation is just to use a list of lists where the first element is a list of column headings.

hughesadam_87 54 Junior Poster

11 Years Ago

Your best bet is to load all the datafiles into a pandas dataframe. This is the de-facto structure for handling labeled array data, and also has a ton of utilities for data manipulation ESPECIALLY timestamp.

post an example of one of your files, maybe the first 30 lines, including header, and I can help you get it into a pandas dataframe.

http://pandas.pydata.org/

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

jamison234 0 Newbie Poster · Answer 1 · 2012-11-07T21:09:00+00:00

Thank you very much jajarvin, I am working on this now and would get back with some questions.

Thanks

jamison234 0 Newbie Poster · Answer 2 · 2012-11-08T15:29:32+00:00

Hi Jajarvin, thanks for the bit of code. It's helped me greatly, however I am progressing quite slow at the moment.

I have an excel spreadsheet with loads of keys (40) across columns and corresponding data going down along the columns. This starts from date to other keys. I want to be able to store the data for key and corresponding values in a dictionary. I will then use the timestamp from the date column as a check for summing up various values in different columns for the respective time. I would do this for several files with a years worth of data.

I would really appreciate some extra guidance here.

Thanks

jamison234 0 Newbie Poster · Answer 3 · 2012-11-10T14:06:51+00:00

The lines in the file are:

Date Time Value1 Value2 Value3 Value4 Value5 Value6 Value7 Value8 ...Value40
10/11/2012 12:11 237.8 239.7 239 12.5 29.9 18.4 4401 0.99 290
10/11/2012 12:12 237.8 239.7 239 12.5 29.9 18.4 4401 0.99 -290
10/11/2012 12:14 237.8 239.7 239 12.5 29.9 18.4 4401 0.99 -290
10/11/2012 12:15 237.8 239.7 239 12.5 29.9 18.4 4401 0.99 -290
10/11/2012 12:16 237.8 239.7 239 12.5 29.9 18.4 4401 0.99 -290
10/11/2012 12:17 237.8 239.7 239 12.5 29.9 18.4 4401 0.99 -290
10/11/2012 12:18 237.8 239.7 239 12.5 29.9 18.4 4401 0.99 -290
10/11/2012 12:19 237.8 239.7 239 12.5 29.9 18.4 4401 0.99 -290
10/11/2012 12:20 237.8 239.7 239 12.5 29.9 18.4 4401 0.99 -290
10/11/2012 12:21 237.8 239.7 239 12.5 29.9 18.4 4401 0.99 -290

etc up to 1 million rows (years worth of data) in the spread sheet. I am going to perform some calculations on some of the columns based on date and time and perform the same for successive spreadsheets. All the spreadsheets will then be summed together based on date and time and the respective volumn, e.g add value 1 for all 40 spreadsheets, etc. Also some rows have values missen. If so then the days value will be used.

jamison234 0 Newbie Poster · Answer 4 · 2012-11-12T23:41:05+00:00

Thanks guys. I would really appreciate some more feedback on this. I have read all the data into the pandas database but I am unsure on how to perform calculations on each individual file in the database (column calculations based on date and timestamp) and sum up all the solutions based on timestamps into one big csv file.

jamison234 0 Newbie Poster · Answer 5 · 2012-11-19T15:09:19+00:00

jamison234 0 Newbie Poster

11 Years Ago

Thanks guys I managed to sort this out

hughesadam_87 54 Junior Poster · Answer 6 · 2012-11-19T15:55:56+00:00

Good to hear. Perhaps mark this thread as solved? If you have further pandas issues, I'd recommend posting to the pandas mailing list, which is very active and extremely helpful.

Processing multiple csv files with python

Recommended Answers Collapse Answers

All 9 Replies

Recommended Answers