Using Pandas to Merge/Concatenate multiple CSV files into one CSV file

Question

Saran_1 0 Junior Poster in Training

9 Years Ago

I have scoured Stack over flow and the Pandas documentation for a solution to this issue.

I am attempting to recursively move through a directory and concatenate all of the headers and their respective row values.

Below is what I have so far after much experimentation with other libraries:

import pandas as pd
import csv
import glob
import os

path = '.'
files_in_dir = [f for f in os.listdir(path) if f.endswith('csv')]

for filenames in files_in_dir:
    df = pd.read_csv(filenames)
    df.to_csv('out.csv', mode='a')

However all of the headers and their corresponding values are stacked upon each other. In addition, the files' headers and their corresponding value repeat twice (something to do with the for loop). My constraints are:

Writing out the headers and their corresponding values (without "stacking") - essentially concatenated one after the other
If the column headers in one file match another files then their should be no repetition. Only the values should be appended as they are written to the one CSV file.
Since each file has different column headers and different number of column headers these should all be added sequentially during processing. Nothing should be deleted.

I am wondering if the best method is to merge, concatenate or perform another method using pandas? Thanks.

file-system python

2 Contributors
3 Replies
18K Views
4 Years Discussion Span
Latest Post 5 Years Ago Latest Post by m_26

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Saran_1 0 Junior Poster in Training · Answer 1 · 2015-07-10T11:23:48+00:00

Just as an update, I have resorted to a non-Pandas approach.

from glob import iglob
import csv
from collections import OrderedDict

files = sorted(iglob('*.csv'))
data = OrderedDict()
for filename in files:
    with open(filename, 'r') as fin:
        csvin = csv.DictReader(fin)
        data.update((colname, data.get(colname, '')) for colname in csvin.fieldnames)
        data.update(next(csvin, {}))

with open('output_filename.csv', 'w') as fout:
    csvout = csv.DictWriter(fout, fieldnames=list(data))
    csvout.writeheader()
    csvout.writerow(data)

The caveat is that all of the keys/column names that repeat in other csv files (and have different corresponding row values) are not appended sequentially. For example, in the above two samples, there are two different values for the column header "Type": UMember and Query. Ideally Query should be appended after UMember. If there are other similar column headers, then their values should follow the same rule. The other requirements are still controlled for.

Any suggestions for how to works around this?Thank you again for your time.

Saran_1 0 Junior Poster in Training · Answer 2 · 2015-07-10T19:58:48+00:00

For those of you who are interested, here is my solution:

from collections import OrderedDict
import os
from glob import iglob
import csv

files = sorted(iglob('*.csv'))
header = OrderedDict()
data = []

for filename in files:

    with open(filename, 'r') as fin:
        csvin = csv.DictReader(fin)
        header.update(OrderedDict.fromkeys(csvin.fieldnames))
        data.append(next(csvin))

with open('output_filename_version2.csv', 'w', newline='') as fout:
    csvout = csv.DictWriter(fout, fieldnames=list(header))
    csvout.writeheader()
    csvout.writerows(data)

m_26 0 Newbie Poster · Answer 3 · 2019-12-06T02:33:21+00:00

Hi sran
I have two CSV files.file1 having header with only one record and file 2 having header with no of records
below is my sample data
file1:
dept_id dept_role dept_name deptdate dept service
10 man develop 10/5/2019 13:24 user

file2:
id name starttime end time user manufacturing 5487 abc 10/5/2019 14:24 10/5/2019 17:24 loc king
5896 egdt 10/5/2019 14:24 10/5/2019 17:24 rgf king
6305 abc 10/5/2019 14:24 10/5/2019 17:24 srf king
6714 egdt 10/5/2019 14:24 10/5/2019 17:24 sd king
7123 abc 10/5/2019 14:24 10/5/2019 17:24 loc king
7532 egdt 10/5/2019 14:24 10/5/2019 17:24 rgf king
7941 abc 10/5/2019 14:24 10/5/2019 17:24 srf king

expected output like below:

dept_id dept_role dept_name deptdate dept service id name starttime end time user manufacturing 10 man develop 10/5/2019 13:24 user 5487 abc 10/5/2019 14:24 10/5/2019 17:24 loc king
10 man develop 10/5/2019 13:24 user 5896 egdt 10/5/2019 14:24 10/5/2019 17:24 rgf king
10 man develop 10/5/2019 13:24 user 6305 abc 10/5/2019 14:24 10/5/2019 17:24 srf king
10 man develop 10/5/2019 13:24 user 6714 egdt 10/5/2019 14:24 10/5/2019 17:24 sd king
10 man develop 10/5/2019 13:24 user 7123 abc 10/5/2019 14:24 10/5/2019 17:24 loc king
10 man develop 10/5/2019 13:24 user 7532 egdt 10/5/2019 14:24 10/5/2019 17:24 rgf king
10 man develop 10/5/2019 13:24 user 7941 abc 10/5/2019 14:24 10/5/2019 17:24 srf king

when i run the below code am getting wrongly populated columns file one last column and file2 last column is merging.
her i have addedd my script:
import os, sys, csv
from itertools import product
def main():
with open('file1.csv', 'r') as f1, open('file2.csv', 'r') as f2:
reader1 = csv.reader(f1, dialect=csv.excel_tab)
reader2 = csv.reader(f2, dialect=csv.excel_tab)
header1, header2 = next(reader1),next(reader2)
with open('filen.csv', 'w',newline='\n') as out:
writer = csv.writer(out, dialect=csv.excel_tab)
writer.writerow(header1+ header2)
writer.writerows(row1 + row2 for row1, row2 in product(reader1,reader2))
main()
print('file merge completed')

am getting wrogly populated columns , could you please suggest the fixing or better code.