I have scoured Stack over flow and the Pandas documentation for a solution to this issue.

I am attempting to recursively move through a directory and concatenate all of the headers and their respective row values.

Below is what I have so far after much experimentation with other libraries:

import pandas as pd
import csv
import glob
import os

path = '.'
files_in_dir = [f for f in os.listdir(path) if f.endswith('csv')]

for filenames in files_in_dir:
    df = pd.read_csv(filenames)
    df.to_csv('out.csv', mode='a')

However all of the headers and their corresponding values are stacked upon each other. In addition, the files' headers and their corresponding value repeat twice (something to do with the for loop). My constraints are:

  1. Writing out the headers and their corresponding values (without "stacking") - essentially concatenated one after the other

  2. If the column headers in one file match another files then their should be no repetition. Only the values should be appended as they are written to the one CSV file.

  3. Since each file has different column headers and different number of column headers these should all be added sequentially during processing. Nothing should be deleted.

I am wondering if the best method is to merge, concatenate or perform another method using pandas? Thanks.

Just as an update, I have resorted to a non-Pandas approach.

from glob import iglob
import csv
from collections import OrderedDict

files = sorted(iglob('*.csv'))
data = OrderedDict()
for filename in files:
    with open(filename, 'r') as fin:
        csvin = csv.DictReader(fin)
        data.update((colname, data.get(colname, '')) for colname in csvin.fieldnames)
        data.update(next(csvin, {}))

with open('output_filename.csv', 'w') as fout:
    csvout = csv.DictWriter(fout, fieldnames=list(data))
    csvout.writeheader()
    csvout.writerow(data)

The caveat is that all of the keys/column names that repeat in other csv files (and have different corresponding row values) are not appended sequentially. For example, in the above two samples, there are two different values for the column header "Type": UMember and Query. Ideally Query should be appended after UMember. If there are other similar column headers, then their values should follow the same rule. The other requirements are still controlled for.

Any suggestions for how to works around this?Thank you again for your time.

For those of you who are interested, here is my solution:

from collections import OrderedDict
import os
from glob import iglob
import csv

files = sorted(iglob('*.csv'))
header = OrderedDict()
data = []

for filename in files:

    with open(filename, 'r') as fin:
        csvin = csv.DictReader(fin)
        header.update(OrderedDict.fromkeys(csvin.fieldnames))
        data.append(next(csvin))

with open('output_filename_version2.csv', 'w', newline='') as fout:
    csvout = csv.DictWriter(fout, fieldnames=list(header))
    csvout.writeheader()
    csvout.writerows(data)

Edited 1 Year Ago by Saran_1

This question has already been answered. Start a new discussion instead.