Good Afternoon:

I would like to have this JSON object written out to a CSV file so that the keys are header fields (for each of the columns) and the values are values that are associated with each header field. Is there a best practice for working with this? Ideally I would like to recursively iterate through the key value pairs. Thank you in advance. I am using Python 3.4 on Windows. My editor is Sublime 2.

I originally parsed this from an XML file.

{ 
"Fee": { 
"A": "5", 
"FEC": "1/1/0001 12:00:00 AM", 
"TE": null, 
"Locator": null, 
"Message": "Transfer Fee", 
"AT": null, 
"FT": null, 
"FR": "True", 
"FY": null, 
"FR": null, 
"FG": "0", 
"Comment": null, 
"FUD": null, 
"cID": null, 
"GEO": null, 
"ISO": null, 
"TRID": null, 
"XTY": "931083", 
"ANM": null, 
"NM": null 
}, 
"CF": "Fee", 
"ID": "2" 
} 

The value, "Fee" associated with the key, "CF" does should not be included as a column header.

The CSV file, when opened with an application such as MS Excel, should be as follows (for exanmple):

(Column Header)----> CF A FEC

(Field Value)----> Fee 5 1/1/0001 12:00:00 AM

I have tried this:

import os
import json
import csv

def readAndWrite(inputFileName, primaryKey=""):
    input = open(inputFileName+".json")
    data = json.load(input)
    input.close()

    header = set()

    if primaryKey != "":
        outputFileName = inputFileName+"-"+primaryKey
        if inputFileName == "data":
            for i in data:
                for j in i["fields"].keys():
                    if j not in header:
                        header.add(j)
    else:
        outputFileName = inputFileName
        for i in data:
            for j in i.keys():
                if j not in header:
                    header.add(j)

    with open(outputFileName+".csv", 'wb') as output_file:
        fieldnames = list(header)
        writer = csv.DictWriter(output_file, fieldnames, delimiter=',', quotechar='"')
        writer.writeheader()
        for x in data:
            row_value = {}
            if primaryKey == "":
                for y in x.keys():
                    yValue = x.get(y)
                    if type(yValue) == int or type(yValue) == bool or type(yValue) == float or type(yValue) == list:
                        row_value[y] = str(yValue).encode('utf8')
                    elif type(yValue) != dict:
                        row_value[y] = yValue.encode('utf8')
                    else:
                        if inputFileName == "data":
                            row_value[y] = yValue["codename"].encode('utf8')
                            readAndWrite(inputFileName, primaryKey="codename")
                writer.writerow(row_value)
            elif primaryKey == "codename":
                for y in x["fields"].keys():
                    yValue = x["fields"].get(y)
                    if type(yValue) == int or type(yValue) == bool or type(yValue) == float or type(yValue) == list:
                        row_value[y] = str(yValue).encode('utf8')
                    elif type(yValue) != dict:
                        row_value[y] = yValue.encode('utf8')
                writer.writerow(row_value)

readAndWrite("data")

However at line 24 ("for j in i.keys()): AttributeError: 'str' object has no attribute 'keys'.

ANy advice for addressing this. I am also open to other strategies.

Recommended Answers

All 13 Replies

I think you should definitely separate read and write. Start with a function that reads the json file and generates tuples such as

('Fee', 5, '1/1/0001 12:00:00 AM')

Use the yield statement to generate the tuples, and don't worry about the types, they can be post-processed. Your generator may generate the tuple

('CF', 'A', 'FEC')

at the beginning. Once this generator works, it will be very easy to write the csv file.

@Gribouillis:

Here is another (far more simple) attempt that addresses your suggestion:

import csv 
import json 
import sys 

def hook(obj): 
    return obj 

def flatten(obj): 
    for k, v in obj: 
        if isinstance(v, list): 
            yield from flatten(v) 
        else: 
            yield k, v 

if __name__ == "__main__": 
    with open("data.json") as f: 
        data = json.load(f, object_pairs_hook=hook) 

    pairs = list(flatten(data)) 

    writer = csv.writer(sys.stdout) 
    header = writer.writerow([k for k, v in pairs]) 
    row = writer.writerow([v for k, v in pairs]) 

I am wondering if there is a more efficient way to do this that is also scalable? I am wondering how to use a generator?

Can you explain what you mean by scalable here ? Do you want to apply this to a file having a very large number of keys ? Does it make sense to write a csv with 2 rows and a large number of columns ? I would rather write a csv with 2 colums named KEY and VALUE, or perhaps I misunderstand what you mean.

Here you could write

writer.writerows(zip(*pairs))

By the way, your process may yield several columns having the same name if the subobjects repeat keys.

The keys may be added arbitrarily - hence scalable for greater data processing -so yes, there will be a necessity for having many keys. I would like the headers (ie. the keys) to be the headers. The values that associated should be the tuples in the hypothetical CSV structure. I want to avoid having repeated columns. There should be only one instance of this. When I ran this in the interpretor, I got the following:

A,FEC,TE,Locator,Message,AT,FT,FR,FY,FR,FG,Comment,FUD,cID,GEO,ISO,TRID,XTY,ANM,NM,CF,ID
5,1/1/0001 12:00:00 AM,,,Transfer Fee,,,True,,,0,,,,,,,931083,,,Fee,2

Where would you place?

writer.writerows(zip(*pairs))

I mean

import csv 
import json 
import sys 

def hook(obj): 
    return obj 

def flatten(obj): 
    for k, v in obj: 
        if isinstance(v, list): 
            yield from flatten(v) 
        else: 
            yield k, v 

if __name__ == "__main__": 
    with open("data.json") as f: 
        data = json.load(f, object_pairs_hook=hook) 

    writer = csv.writer(sys.stdout) 
    writer.writerows(zip(*flatten(data)))

Can you write an example json file with repeated keys in subobject and your expected output ?

Hey Grib,
I was actually looking into json files a couple of days ago and I came across a problem that I really was looking forward to address to you. The file that I was generating was bigger than my memory so loading it all at once is basically a bad idea, especially that when loaded it actually occupies more memory than the actual size of the file. Perhaps an example of the data would be a good idea, it was a general dictionary, each key having multiple values. Is there a way to handle such big files, say 10GB, because if I think of a normal text file, readline() reads a single line at a time and is a good solution in a way but I couldn't really find anything about json format. Any ideas?

@slavi
Yes, you need an incremental json parser. By googling a few minutes, I found ijson and its mother yajl-py. There may be others.

@saran_1
Using an incremental json parser, you could parse the file twice and write the first row, then the second row, even if the file is very large. Alternatively, you could write one file per row in one pass, then concatenate these files.

commented: Thanks I will look into in =] +6

@Gribouillis Is it possible for you to demonstrate this or provide an exemplar. I am not familiar with this particular methodology. Thank you foir your feedback.

Well, here is a basic exemple of incremental json parsing with the ijson module. I added a subobject foo inside Fee in your json file.

#!/usr/bin/env python3
# -*-coding: utf8-*-
'''demonstrates incremental json parsing
'''

import ijson
import io

ifh = io.open('data.json', encoding='utf8')
parser = ijson.parse(ifh)
for prefix, event, value in parser:
    print(prefix, event, value)

""" my output -->
('', 'start_map', None)
('', 'map_key', 'Fee')
('Fee', 'start_map', None)
('Fee', 'map_key', 'A')
('Fee.A', 'string', '5')
('Fee', 'map_key', 'FEC')
('Fee.FEC', 'string', '1/1/0001 12:00:00 AM')
('Fee', 'map_key', 'TE')
('Fee.TE', 'null', None)
('Fee', 'map_key', 'Locator')
('Fee.Locator', 'null', None)
('Fee', 'map_key', 'Message')
('Fee.Message', 'string', 'Transfer Fee')
('Fee', 'map_key', 'AT')
('Fee.AT', 'null', None)
('Fee', 'map_key', 'FT')
('Fee.FT', 'null', None)
('Fee', 'map_key', 'FR')
('Fee.FR', 'string', 'True')
('Fee', 'map_key', 'FY')
('Fee.FY', 'null', None)
('Fee', 'map_key', 'FR')
('Fee.FR', 'null', None)
('Fee', 'map_key', 'FG')
('Fee.FG', 'string', '0')
('Fee', 'map_key', 'Comment')
('Fee.Comment', 'null', None)
('Fee', 'map_key', 'FUD')
('Fee.FUD', 'null', None)
('Fee', 'map_key', 'cID')
('Fee.cID', 'null', None)
('Fee', 'map_key', 'GEO')
('Fee.GEO', 'null', None)
('Fee', 'map_key', 'foo')
('Fee.foo', 'start_map', None)
('Fee.foo', 'map_key', 'bar')
('Fee.foo.bar', 'string', 'baz')
('Fee.foo', 'map_key', 'qux')
('Fee.foo.qux', 'string', 'spam')
('Fee.foo', 'end_map', None)
('Fee', 'map_key', 'ISO')
('Fee.ISO', 'null', None)
('Fee', 'map_key', 'TRID')
('Fee.TRID', 'null', None)
('Fee', 'map_key', 'XTY')
('Fee.XTY', 'string', '931083')
('Fee', 'map_key', 'ANM')
('Fee.ANM', 'null', None)
('Fee', 'map_key', 'NM')
('Fee.NM', 'null', None)
('Fee', 'end_map', None)
('', 'map_key', 'CF')
('CF', 'string', 'Fee')
('', 'map_key', 'ID')
('ID', 'string', '2')
('', 'end_map', None)
"""

So there is a flow of parsing events generated by the parse() method. You can interpose code in the loop to generate the key/value pairs. The difference with json module is that the input file is not loaded in memory.

What's with map_key/end_map/null?

Of course you must do a bit of parsing, for example here is how to generate the pairs in the above example (the code must be modified if the json file contains lists)

#!/usr/bin/env python3
# -*-coding: utf8-*-
'''demonstrates incremental json parsing
'''

import ijson
import io

values = set(['null', 'boolean', 'number', 'string'])

class UnexpectedSyntax(RuntimeError):
    pass

def gen_pairs(jsonfile):
    parser = ijson.parse(jsonfile)
    p, e, v = next(parser)
    if e != 'start_map':
        raise UnexpectedSyntax
    map_depth = 1
    for p, e, v in parser:
        if e == 'end_map':
            map_depth -= 1
            if map_depth == 0:
                next(parser)
                raise UnexpectedSyntax('Expected end of json source after map')
        elif e == 'map_key':
            key = v
            p, e, v = next(parser)
            if e in values:
                yield key, v
            elif e == 'start_map':
                map_depth += 1
            else:
                raise UnexpectedSyntax
        else:
            raise UnexpectedSyntax
    if map_depth > 0:
        raise UnexpectedSyntax('Incomplete map in json source')

ifh = io.open('data.json', encoding='utf8')
for key, value in gen_pairs(ifh):
    print(key, value)

""" my output -->
A 5
FEC 1/1/0001 12:00:00 AM
TE None
Locator None
Message Transfer Fee
AT None
FT None
FR True
FY None
FR None
FG 0
Comment None
FUD None
cID None
GEO None
bar baz
qux spam
ISO None
TRID None
XTY 931083
ANM None
NM None
CF Fee
ID 2
"""
commented: awesome +6

Ahhhh, I see!!! This works when you are trying to flatten a JSON object, and will map k to the key and v to the value. However, when the loop hits the lists it throws an eror. Lists do not have keys and values, just items.

How would I modify the above excerpt to include lists?

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.