Reassigning keys in dictionary of lists and then writing out to CSV file?

Question

Saran_1 0 Junior Poster in Training

10 Years Ago

I am currently attempting to work on converting a fairly sizeable JSON object and convert it into a CSV format. However, when I attempt to do so, using a conventional approach (that seems to work with other files). I am presented with a "ValueError: too many values to unpack"

I have tried to flatten the JSON object using this function:

# def flatten(d, parent_key=''): 
#     items = [] 
#     for k, v in d.items(): 
#         try: 
#             items.extend(flatten(v, '%s%s_' % (parent_key, k)).items()) 
#         except AttributeError: 
#             items.append(('%s%s' % (parent_key, k), v)) 
#     return dict(items)

However this concatenates the keys. I am now attempting to remove the nested key and reassigning to the outer key.

Here are some samples. I would remove Int32, Double and DateTime. I am wondering if there is a function that would then allow me to assign the new keys as column headers in a CSV and concatenate all of the values within the list (as corresponding fields). I hope that I was clear in my description. Thank you all for your help.

"FC": {"Int32": ["0","0","0","0","0","0"]}

and

"PBA": {"Double": ["0","0","0","0","0","0","0","0"]}

and such examples:

"PBDD": { "DateTime": ["1/1/0001 12:00:00 AM", "1/1/0001 12:00:00 AM","1/1/0001 12:00:00 AM","1/1/0001 12:00:00 AM","1/1/0001 12:00:00 AM", "1/1/0001 12:00:00 AM", "1/1/0001 12:00:00 AM", "1/1/0001 12:00:00 AM"] },

json python

2 Contributors
68 Replies
2K Views
6 Days Discussion Span
Latest Post 10 Years Ago Latest Post by Saran_1

All 68 Replies

Gribouillis 1,391 Programming Explorer

10 Years Ago

Here is the kind of code that you can try

import csv 
import json 
import sys 

def shrink(v):
    while True:
        if isinstance(v, list):
            if len(v) == 1:
                v = v[0]
            elif v:
                assert all(isinstance(x, str) for x in v)
                v = ''.join(v)
            else:
                raise ValueError('Empty list')
        elif isinstance(v, dict) and len(v) == 1:
            v = next(iter(v.values()))
        else:
            return v

def flatten(obj):
    assert isinstance(obj, dict)
    for k, v in obj.items(): 
        v = shrink(v)
        if isinstance(v, dict):
            yield from flatten(v)
        else:
            yield k, v

if __name__ == "__main__": 
    with open("data2.json") as f: 
        data = json.load(f) 

    pairs = list(flatten(data))
    print(pairs)

    writer = csv.writer(sys.stdout) 
    header = writer.writerow([k for k, v in pairs]) 
    row = writer.writerow([v for k, v in pairs])

The idea is to shrink the values with the following rules: if a value is a list with a single element, it is replaced by this element (shrunk). If the value is a list with more than 1 element, it is assumed that all the elements are strings, and they are concatenated and the result replaces the value. If the value is a dict with a single key, it is replaced by the value of this single item.

My output with the modified json file above is

[('DLA', '0'), ('FC', '00000'), ('PC', '0'), ('WC', '0'), ('CN', None), ('Description', None), ('Code', '0'), ('CMC', '0')]
DLA,FC,PC,WC,CN,Description,Code,CMC
0,00000,0,0,,,0,0

Edited 10 Years Ago by Gribouillis

Gribouillis 1,391 Programming Explorer

10 Years Ago

Yes put the snippet in a file named postprocess.py then write

from postprocess import post_process

Gribouillis 1,391 Programming Explorer

10 Years Ago

Here is a variant which handles lists differently by enumerating the list items and removing their common tag. It works for MonthDayCount in your example, but I'm not sure it will work the way you want for all the lists in your files.

import xml.etree.cElementTree as ElementTree 
from xml.etree.ElementTree import XMLParser

def flatten_list(aList, prefix=''):
    for i, element in enumerate(aList, 1):
        eprefix = "{}{}".format(prefix, i)
        if element:
            # treat like dict 
            if len(element) == 1 or element[0].tag != element[1].tag: 
                yield from flatten_dict(element, eprefix+'.')
            # treat like list 
            elif element[0].tag == element[1].tag: 
                yield from flatten_list(element, eprefix+'.')
        elif element.text: 
            text = element.text.strip() 
            if text: 
                yield eprefix, text


def flatten_dict(parent_element, prefix=''):
    prefix = prefix + parent_element.tag + '.'
    if parent_element.items():
        for k, v in parent_element.items():
            yield prefix + k, v
    for element in parent_element:
        eprefix = prefix + element.tag + '.'
        if element:
            # treat like dict - we assume that if the first two tags 
            # in a series are different, then they are all different. 
            if len(element) == 1 or element[0].tag != element[1].tag: 
                yield from flatten_dict(element, prefix=prefix)
            # treat like list - we assume that if the first two tags 
            # in a series are the same, then the rest are the same. 
            else: 
                # here, we put the list in dictionary; the key is the 
                # tag name the list elements all share in common, and 
                # the value is the list itself
                yield from flatten_list(element, prefix=eprefix)
            # if the tag has attributes, add those to the dict
            if element.items():
                for k, v in element.items():
                    yield eprefix+k, v 
        # this assumes that if you've got an attribute in a tag, 
        # you won't be having any text. This may or may not be a 
        # good idea -- time will tell. It works for the way we are 
        # currently doing XML configuration files... 
        elif element.items(): 
            for k, v in element.items():
                yield eprefix+k, v
        # finally, if there are no child tags and no attributes, extract 
        # the text 
        else:
            yield eprefix, element.text


def main():
    with open('source.xml', 'r', encoding='utf-8') as f: 
        xml_string = f.read() 
    xml_string= xml_string.replace('&#x0;', '') #optional to remove ampersands. 
    root = ElementTree.XML(xml_string) 
    for item in flatten_dict(root):
        print(item)

if __name__ == "__main__":
    main()

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 1 · 2015-06-19T05:53:53+00:00

The description is not clear at all. We need a minimal working example: a small json object, and a short piece of code which raises the described exception (ValueError ...). You can add the expected result.

Saran_1 0 Junior Poster in Training · Answer 2 · 2015-06-19T13:41:43+00:00

Here is an example snapshot JSON sample where my code fails:

"PAC": {
            "Account": [{
                "PC": "0",
                "CMC": "0",
                "WC": "0",
                "DLA": "0",
                "CN": null,
                "FC": {
                    "Int32": ["0",
                    "0",
                    "0",
                    "0",
                    "0"]
                },
                "F": {
                    "Description": null,
                    "Code": "0"
                }

Here is the original function that I have attempted to utilize that works fairly well with other JSON objects in files:

import csv 
import json 
import sys 

def hook(obj): 
    return obj 

def flatten(obj): 
    for k, v in obj: 
        if isinstance(v, list): 
            yield from flatten(v) 
        else: 
            yield k, v 

if __name__ == "__main__": 
    with open("somefileneame.json") as f: 
        data = json.load(f, object_pairs_hook=hook) 

    pairs = list(flatten(data)) 

    writer = csv.writer(sys.stdout) 
    header = writer.writerow([k for k, v in pairs]) 
    row = writer.writerow([v for k, v in pairs]) #writer.writerows for any other iterable object

Here is the XML to JSON parser that originally produced the above JSON (I am not certain if this is helpful):

import xml.etree.cElementTree as ElementTree
from xml.etree.ElementTree import XMLParser
import json 
import csv
import tokenize
import token
try:
    from collections import OrderedDict
    import json
except ImportError:
    from ordereddict import OrderedDict
    import simplejson as json
import itertools
import six
import string
from csvkit import CSVKitWriter


class XmlListConfig(list):
    def __init__(self, aList):
        for element in aList:
            if element:
                # treat like dict
                if len(element) == 1 or element[0].tag != element[1].tag:
                    self.append(XmlDictConfig(element))
                # treat like list
                elif element[0].tag == element[1].tag:
                    self.append(XmlListConfig(element))
            elif element.text:
                text = element.text.strip()
                if text:
                    self.append(text)


class XmlDictConfig(dict):
    '''
    Example usage:

    >>> tree = ElementTree.parse('your_file.xml')
    >>> root = tree.getroot()
    >>> xmldict = XmlDictConfig(root)

    Or, if you want to use an XML string:

    >>> root = ElementTree.XML(xml_string)
    >>> xmldict = XmlDictConfig(root)

    And then use xmldict for what it is..a dictionary.
    '''
    def __init__(self, parent_element):
        if parent_element.items():
            self.update(dict(parent_element.items()))
        for element in parent_element:
            if element:
                # treat like dict - we assume that if the first two tags
                # in a series are different, then they are all different.
                if len(element) == 1 or element[0].tag != element[1].tag:
                    aDict = XmlDictConfig(element)
                # treat like list - we assume that if the first two tags
                # in a series are the same, then the rest are the same.
                else:
                    # here, we put the list in dictionary; the key is the
                    # tag name the list elements all share in common, and
                    # the value is the list itself 
                    aDict = {element[0].tag: XmlListConfig(element)}
                # if the tag has attributes, add those to the dict
                if element.items():
                    aDict.update(dict(element.items()))
                self.update({element.tag: aDict})
            # this assumes that if you've got an attribute in a tag,
            # you won't be having any text. This may or may not be a 
            # good idea -- time will tell. It works for the way we are
            # currently doing XML configuration files...
            elif element.items():
                self.update({element.tag: dict(element.items())})
            # finally, if there are no child tags and no attributes, extract
            # the text
            else:
                self.update({element.tag: element.text})



def main():

    #Lines 88-89stantiate the class Elementree
    #and applies the method to recursively traverse from the root node
    #XmlDictConfig is instantiated in line 90

    with open('C:\\Users\\wynsa2\\Desktop\\Python Folder\\PCSU\\Trial2_PCSU\\2-Response.xml', 'r', encoding='utf-8') as f:
        xml_string = f.read()
    xml_string= xml_string.replace('&#x0;', '')
    root = ElementTree.XML(xml_string)
    xmldict = XmlDictConfig(root)
    json_str = json.dumps(xmldict, sort_keys=True, indent=4, separators=(',', ': '))
    newly_formatted_data = json.loads(json_str) #encode into JSON 
    with open('data2.json', 'w') as f:  #writing JSON file
        json.dump(newly_formatted_data, f)

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 3 · 2015-06-19T14:09:17+00:00

Ok, it looks easy. The ValueError is thrown by the json parser (function json.load()) because your json file is an invalid json file. It contains a pair key/value outside of an object. If I paste your json code in http://jsonlint.com for example, the parser says that a { or a [ is expected at the beginning. Also there are missing closing delimiters at the end

The json grammar can be seen at http://json.org/

Here is a modified valid json file

{
"PAC": {
            "Account": [{
                "PC": "0",
                "CMC": "0",
                "WC": "0",
                "DLA": "0",
                "CN": null,
                "FC": {
                    "Int32": ["0",
                    "0",
                    "0",
                    "0",
                    "0"]
                },
                "F": {
                    "Description": null,
                    "Code": "0"
                }
}]
}
}

Unfortunately, this is not sufficient because your code is not ready to handle arrays in json data (I mean [...]). For this, you should probably try to call load() without the object hook and change the flatten() generator.

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 4 · 2015-06-19T15:02:44+00:00

If you want to adapt the flatten function to the case where the json code contains arrays, you must first describe which key/value pairs you expect as output from the above json file and why. This question is independent from python. It can be coded only if it can be described precisely.

Saran_1 0 Junior Poster in Training · Answer 5 · 2015-06-19T15:33:54+00:00

The rationale for the following: the out put will need to be a CSV. All of the keys should be headers in the CSV. The values are all corresponding with the headers. However, Int32 should be deleted and the key reassigned to FC. Account should be deleted and reassigned to PAC.

Essentially, where all of the instances where you have a key with a list of dictionary, that outer key should be deleted and reassigned to the outermost key.

Finally (in this example) the "0"s should be all concatenated in the output CSV so accessing the lists will need to be addressed first. I know that the following:

def hook(obj): 
    return obj 
def flatten(obj): 
    for k, v in obj: 
        if isinstance(v, list): 
            yield from flatten(v) 
        else: 
            yield k, v 
if __name__ == "__main__": 
    with open("somefileneame.json") as f: 
        data = json.load(f, object_pairs_hook=hook) 
    pairs = list(flatten(data)) 
    writer = csv.writer(sys.stdout) 
    header = writer.writerow([k for k, v in pairs]) 
    row = writer.writerow([v for k, v in pairs]) #writer.writerows for any other iterable object

is not the best strategy. The list objects cannot be accessed (hence, the "ValueError: too many values to unpack") is caught at the function json.loads(). I am wondering how to add the exception when 1)when one encounters the list and 2) how to delete and rasssign the keys (as described above).

Saran_1 0 Junior Poster in Training · Answer 6 · 2015-06-19T19:13:09+00:00

Ahhhh I see: I also worked on this and found that if I approached the problem of just merging the keys associated with the values of lists with the outer keys with this:

def flatten(d, parent_key=''):
    items = []
    for k, v in d.items():
        try:
            items.extend(flatten(v, '%s%s_' % (parent_key, k)).items())
        except AttributeError:
            items.append(('%s%s' % (parent_key, k), v))
    return dict(items)

final = (flatten(data2, parent_key =''))

With this sample JSON:

data2 = {

    "OTF": "0",
    "F": "False",
    "F": {
        "Int32": ["0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0"]
    },
                "D": {
                    "B": ["0",
                    "0",
                    "0",
                    "0",
                    "0",
                    "0",
                    "0",
                    "0",
                    "0",
                    "0",
                    "0"]
                },

                "PBDS": {
                    "DateTime": ["1/1/0001 12:00:00 AM",
                    "1/1/0001 12:00:00 AM",
                    "1/1/0001 12:00:00 AM",
                    "1/1/0001 12:00:00 AM",
                    "1/1/0001 12:00:00 AM",
                    "1/1/0001 12:00:00 AM",
                    "1/1/0001 12:00:00 AM",
                    "1/1/0001 12:00:00 AM"]
                },

                "PBDS": {
                    "Double": ["0",
                    "0",
                    "0",
                    "0",
                    "0",
                    "0",
                    "0",
                    "0"]
                },

                "SCS": {
                    "String": ["1",
                    "2"]
                }

            }

I get the following:

{'D_B': ['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0'],
 'F_Int32': ['0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0'],
 'OTF': '0',
 'PBDS_Double': ['0', '0', '0', '0', '0', '0', '0', '0'],
 'SCS_String': ['1', '2']}

Interestingly, with your function on other files I get the following error:

AssertionError                            Traceback (most recent call last)
<ipython-input-163-f3412d9ed512> in <module>()
     28     with open("data2.json") as f:
     29         data = json.load(f)
---> 30     pairs = list(flatten(data))
     31     print(pairs)
     32     writer = csv.writer(sys.stdout)

<ipython-input-163-f3412d9ed512> in flatten(obj)
     22         v = shrink(v)
     23         if isinstance(v, dict):
---> 24             yield from flatten(v)
     25         else:
     26             yield k, v

<ipython-input-163-f3412d9ed512> in flatten(obj)
     22         v = shrink(v)
     23         if isinstance(v, dict):
---> 24             yield from flatten(v)
     25         else:
     26             yield k, v

<ipython-input-163-f3412d9ed512> in flatten(obj)
     20     assert isinstance(obj, dict)
     21     for k, v in obj.items():
---> 22         v = shrink(v)
     23         if isinstance(v, dict):
     24             yield from flatten(v)

<ipython-input-163-f3412d9ed512> in shrink(v)
      9                 v = v[0]
     10             elif v:
---> 11                 assert all(isinstance(x, str) for x in v)
     12                 v = ''.join(v)
     13             else:

AssertionError:

Thoughts?

One final comment to make: Is it possible to write a function that applies to each key such that if the key is associated with a value, the key is replicated n number of times for n number of items in its associated list.

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 7 · 2015-06-19T19:24:19+00:00

Interestingly, with your function on other files I get the following error

I made the assumption that if the json file contains arrays with more than one item, all these items are strings. It fails in the files that you tried. Only strings can be concatenated. You must define the list of pairs that must be generated in these cases.

One final comment to make: Is it possible to write a function that applies to each key such that if the key is associated with a value, the key is replicated n number of times for n number of items in its associated list.

Again, this must be clarified with a json example and the complete list of pairs key/value that you want to produce with this json data.

Saran_1 0 Junior Poster in Training · Answer 8 · 2015-06-19T19:31:30+00:00

Ah, I see:

Let's use this case example:

{'D_B': ['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0'],
 'F_Int32': ['0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0'],
 'OTF': '0',
 'PBDS_Double': ['0', '0', '0', '0', '0', '0', '0', '0'],
 'SCS_String': ['1', '2']}

There would be an output header DB1: 0, DB2: 0, DB3: 0 etc. and F1: 0, F1: 0. Ideally this could be then used as a header and column format (maybe a list)?

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 9 · 2015-06-19T20:07:47+00:00

the key is replicated n number of times for n number of items in its associated list

Every output is possible, only the rules which govern the production of items must be very carefuly defined. If you want DB1, DB2, etc, you must explain by which rule D_B becomes DB1, DB2 etc. In the same way, which rule transforms F_Int32 into F1, F2, etc.

Python can implement any precise rule that you define, but it cannot define the transformation rules for you.

Saran_1 0 Junior Poster in Training · Answer 10 · 2015-06-19T20:17:47+00:00

Sorry, I meant D_B, not DB. The main goal is to replicate the keys along with the number and then associate this with the number of items in the array that is its value.

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 11 · 2015-06-19T20:23:03+00:00

Here is a simple transformation snippet

>>> key = 'spam'
>>> L = ['foo', 'bar', 'baz']
>>> [('{}{}'.format(key, i), value) for i, value in enumerate(L, 1)]
[('spam1', 'foo'), ('spam2', 'bar'), ('spam3', 'baz')]

Saran_1 0 Junior Poster in Training · Answer 12 · 2015-06-19T20:38:42+00:00

Would I be able to insert this into the following function:

def mapping(dict, key, list_in_dict):
    for k, v in dict.items():
        if isinstance(v, list):
        assignment = [('{}{}'.format(key, i), value) for i, value in enumerate(list, 1)]
        return assignment

I am not sure how I would structure this into my oringal function....

I also just want to express how much I appreciate your guidance and support. As a novice to programming, I have learned much from you. Thank you.

Saran_1 0 Junior Poster in Training · Answer 13 · 2015-06-19T20:39:19+00:00

Of course this is a static version. Would I need to iterate through this?

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 14 · 2015-06-19T21:01:36+00:00

Here is an example of what you can do. This function transforms a dictionary by exploding the inner lists if they contain only strings

from postprocess import post_process

@post_process(dict)
def explode_lists(adict):
    for key, value in adict.items():
        if isinstance(value, list):
            if all(isinstance(x, str) for x in value):
                for i, x in enumerate(value, 1):
                    yield ('{}{}'.format(key, i), x)
                continue
        yield key, value

if __name__ == '__main__':
    D = {'D_B': ['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0'],
        'F_Int32': ['0','0','0','0'],
        'OTF': '0',
        'PBDS_Double': ['0', '0', '0', '0', '0', '0', '0', '0'],
        'SCS_String': ['1', '2']}
    print(explode_lists(D))


""" my output -->
{'SCS_String2': '2', 'SCS_String1': '1', 'PBDS_Double1': '0', 'PBDS_Double3': '0', 'PBDS_Double2': '0', 'PBDS_Double5': '0', 'PBDS_Double4': '0', 'PBDS_Double7': '0', 'PBDS_Double6': '0', 'PBDS_Double8': '0', 'F_Int321': '0', 'F_Int323': '0', 'F_Int322': '0', 'F_Int324': '0', 'D_B5': '0', 'D_B4': '0', 'D_B7': '0', 'D_B6': '0', 'D_B1': '0', 'D_B3': '0', 'D_B2': '0', 'D_B9': '0', 'D_B8': '0', 'OTF': '0', 'D_B11': '0', 'D_B10': '0'}
"""

In this code, I used a very useful snippet that I wrote long ago in the following file postprocess.py

# postprocess.py

def post_process(*filters):
    """Decorator to post process a function's return value through a
    sequence of filters (functions with a single argument).

    Example:

        @post_process(f1, f2, f3)
        def f(*args, **kwd):
            ...
            return value

        then calling f(...) will actually return f3( f2( f1( f(...)))).

        This can also be used to convert a generator to a function
        returning a sequence type:

        @post_process(dict)
        def my_generator():
            ...
            yield key, value

    """

    def decorate(func):
        from functools import wraps
        @wraps(func)
        def wrapper(*args, **kwd):
            rv = func(*args, **kwd)
            for f in filters:
                rv = f(rv)
            return rv
        return wrapper

In this example, I use it to transform a function generating pairs key/value into a function returning a dict.

Saran_1 0 Junior Poster in Training · Answer 15 · 2015-06-19T21:14:57+00:00

Python is so cool. I am attempting to implement this. Do I need to create another file to import process as well. I receive the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-207-a2b99a0f9418> in <module>()
----> 1 @post_process(dict)
      2 def explode_lists(adict):
      3     for key, value in adict.items():
      4         if isinstance(value, list):
      5             if all(isinstance(x, str) for x in value):

TypeError: 'NoneType' object is not callable

Saran_1 0 Junior Poster in Training · Answer 16 · 2015-06-21T15:36:30+00:00

Thank you. Okay, so let me recap (brace yourself):

To convert from XML to JSON we have used the following function:

import xml.etree.cElementTree as ElementTree 
from xml.etree.ElementTree import XMLParser 
import json 
import csv 
import tokenize 
import token 
try: 
    from collections import OrderedDict 
    import json 
except ImportError: 
    from ordereddict import OrderedDict 
    import simplejson as json 
import itertools 
import six 
import string 
from csvkit import CSVKitWriter 


class XmlListConfig(list): 
    def __init__(self, aList): 
        for element in aList: 
            if element: 
                # treat like dict 
                if len(element) == 1 or element[0].tag != element[1].tag: 
                    self.append(XmlDictConfig(element)) 
                # treat like list 
                elif element[0].tag == element[1].tag: 
                    self.append(XmlListConfig(element)) 
            elif element.text: 
                text = element.text.strip() 
                if text: 
                    self.append(text) 


class XmlDictConfig(dict): 
    ''' 
    Example usage: 

    >>> tree = ElementTree.parse('your_file.xml') 
    >>> root = tree.getroot() 
    >>> xmldict = XmlDictConfig(root) 

    Or, if you want to use an XML string: 

    >>> root = ElementTree.XML(xml_string) 
    >>> xmldict = XmlDictConfig(root) 

    And then use xmldict for what it is..a dictionary. 
    ''' 
    def __init__(self, parent_element): 
        if parent_element.items(): 
            self.update(dict(parent_element.items())) 
        for element in parent_element: 
            if element: 
                # treat like dict - we assume that if the first two tags 
                # in a series are different, then they are all different. 
                if len(element) == 1 or element[0].tag != element[1].tag: 
                    aDict = XmlDictConfig(element) 
                # treat like list - we assume that if the first two tags 
                # in a series are the same, then the rest are the same. 
                else: 
                    # here, we put the list in dictionary; the key is the 
                    # tag name the list elements all share in common, and 
                    # the value is the list itself 
                    aDict = {element[0].tag: XmlListConfig(element)} 
                # if the tag has attributes, add those to the dict 
                if element.items(): 
                    aDict.update(dict(element.items())) 
                self.update({element.tag: aDict}) 
            # this assumes that if you've got an attribute in a tag, 
            # you won't be having any text. This may or may not be a 
            # good idea -- time will tell. It works for the way we are 
            # currently doing XML configuration files... 
            elif element.items(): 
                self.update({element.tag: dict(element.items())}) 
            # finally, if there are no child tags and no attributes, extract 
            # the text 
            else: 
                self.update({element.tag: element.text}) 



def main(): 

    #Lines 88-89stantiate the class Elementree 
    #and applies the method to recursively traverse from the root node 
    #XmlDictConfig is instantiated in line 90 

    with open('C:\\Users\\somefile.xml', 'r', encoding='utf-8') as f: 
        xml_string = f.read() 
    xml_string= xml_string.replace('&#x0;', '') #optional to remove ampersands. 
    root = ElementTree.XML(xml_string) 
    xmldict = XmlDictConfig(root) 
    json_str = json.dumps(xmldict, sort_keys=True, indent=4, separators=(',', ': ')) 
    newly_formatted_data = json.loads(json_str) #encode into JSON 
    with open('data2.json', 'w') as f:  #writing JSON file 
        json.dump(newly_formatted_data, f)

To convert a non-nested dicitonary to CSV, we have used this:

import csv 
import json 
import sys 

def hook(obj): 
    return obj 

def flatten(obj): 
    for k, v in obj: 
        if isinstance(v, list): 
            yield from flatten(v) 
        else: 
            yield k, v 

if __name__ == "__main__": 
    with open("somefileneame.json") as f: 
        data = json.load(f, object_pairs_hook=hook) 

    pairs = list(flatten(data)) 

    writer = csv.writer(sys.stdout) 
    header = writer.writerow([k for k, v in pairs]) 
    row = writer.writerow([v for k, v in pairs]) #writer.writerows for any other iterable object

To de-nest by level the nested dictionaries, I use this function:

def flatten(d, parent_key=''): 
    items = [] 
    for k, v in d.items(): 
        try: 
            items.extend(flatten(v, '%s%s_' % (parent_key, k)).items()) 
        except AttributeError: 
            items.append(('%s%s' % (parent_key, k), v)) 
    return dict(items) 
final = (flatten(data2, parent_key =''))

Finally, after this level of processing of the nested keys as dictionaries with values as dictionaries, we have used a decorator to blow up lists:

from postprocess import post_process 
@post_process(dict) 
def explode_lists(adict): 
    for key, value in adict.items(): 
        if isinstance(value, list): 
            if all(isinstance(x, str) for x in value): 
                for i, x in enumerate(value, 1): 
                    yield ('{}{}'.format(key, i), x) 
                continue 
        yield key, value 


# postprocess.py 
def post_process(*filters): 
    """Decorator to post process a function's return value through a 
    sequence of filters (functions with a single argument). 
    Example: 
        @post_process(f1, f2, f3) 
        def f(*args, **kwd): 
            ... 
            return value 
        then calling f(...) will actually return f3( f2( f1( f(...)))). 
        This can also be used to convert a generator to a function 
        returning a sequence type: 
        @post_process(dict) 
        def my_generator(): 
            ... 
            yield key, value 
    """ 
    def decorate(func): 
        from functools import wraps 
        @wraps(func) 
        def wrapper(*args, **kwd): 
            rv = func(*args, **kwd) 
            for f in filters: 
                rv = f(rv) 
            return rv 
        return wrapper 
return decorate

I know that this is alot of sequential steps. I am wondering if I could insert or conditionally pass these functions when originally parsing the XML, so that the JSON is formatted for more recursive reading of the JSON dictionary and then writing to CSV? I welcome constructive feedback for refactoring...

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 17 · 2015-06-21T17:15:00+00:00

I see that you are using an old activestate recipe to parse xml and transform it into a dictionary. It means that the comments in this code don't have anything to do with your specific xml files.

You could traverse the parsed xml tree directly and generate on the fly the items of your final flat dictionary. It would be much more hard-hitting. This would probably mean two or three generators invoking each other recursively through the use of yield from statements.

The key algorithmic points remain the same: what is your actual xml file's structure and what are your rules to create target key, value pairs ?

Saran_1 0 Junior Poster in Training · Answer 18 · 2015-06-21T17:46:14+00:00

Unfortunately, I cannot share with you the actual contents of the XML file (NDA). However, I can give you this sample file like this assume lots of similar structure:

<Response ID="24856-775" RequestType="Moverview">        
        <MonthDayCount>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
        </MonthDayCount>
        <Warnings />
        <SList />
        <LList />
        <EA>Y</EA>
        <EHA>Y</EHA>
        <EBY>Y</EBY>
        <EOTH>Y</EOTH>
        <EIL>Y</EIL>
        <EM>Y</EM>
        <ED>Y</ED>
        <EQ>Y</EQ>
        <ERS>Y</ERS>
        <ECCS>Y</ECCS>
        <EES>Y</EES>
        <UAS>Y</UAS>
        <PA>False</PA>
        <PL>False</PL>
        <PC>False</PC>
        <PCs>False</PCs>
        <PJ>False</PJ>
        <OITC>0</OITC>
        <MG />
        <R />
        <CCGoods />
    </MO>
</Response>

THe dictionary keys, using the ActiveState recipe convert the above to (I am representing just the first three tags):

{
    "TransactionID": "24856-775",
    "RequestType": "Moverview",
    "MonthDayCount": {
            "Int32": ["0",
            "0",
            "0",
            "0",
            "0",
            "0",
            "0",
            "0",
            "0",
            "0",
            "0",
            "0",
            "0",
            "0",
            "0",
            "0",
            "0",
            "0",
            "0",
            "0",
            "0",
            "0",
            "0",
            "0",
            "0"]
        },

etc.....

My rules for key value pairs are the same as the Activestate recipe. It's just that some of the nested keys have unnecessary nesting of dicitonaries and the conversion of the repetitious items into lists. Hence the requirement for cleaning up with the flatten function and your decorator. I hope that I am clear and that this information is helpful in bringing clarity?

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 19 · 2015-06-21T20:58:59+00:00

I transformed the activestate recipe into a generator which flattens the xml structure. Look at the ouput, then try to define how you would transform the keys to get correct column names

import xml.etree.cElementTree as ElementTree 
from xml.etree.ElementTree import XMLParser 
import json 
import csv 
import tokenize 
import token 
try: 
    from collections import OrderedDict 
    import json 
except ImportError: 
    from ordereddict import OrderedDict 
    import simplejson as json 
import itertools 
import six 
import string 
#from csvkit import CSVKitWriter 


def flatten_list(aList, prefix=''): 
    for element in aList:
        if element: 
            # treat like dict 
            if len(element) == 1 or element[0].tag != element[1].tag: 
                yield from flatten_dict(element, prefix)
            # treat like list 
            elif element[0].tag == element[1].tag: 
                yield from flatten_list(element, prefix)
        elif element.text: 
            text = element.text.strip() 
            if text: 
                yield prefix, text


def flatten_dict(parent_element, prefix=''):
    prefix = prefix + parent_element.tag + '.'
    if parent_element.items():
        for k, v in parent_element.items():
            yield prefix + k, v
    for element in parent_element:
        eprefix = prefix + element.tag + '.'
        if element:
            # treat like dict - we assume that if the first two tags 
            # in a series are different, then they are all different. 
            if len(element) == 1 or element[0].tag != element[1].tag: 
                yield from flatten_dict(element, prefix=prefix)
            # treat like list - we assume that if the first two tags 
            # in a series are the same, then the rest are the same. 
            else: 
                # here, we put the list in dictionary; the key is the 
                # tag name the list elements all share in common, and 
                # the value is the list itself
                yield from flatten_list(element, prefix=eprefix+element[0].tag+'.')
            # if the tag has attributes, add those to the dict
            if element.items():
                for k, v in element.items():
                    yield eprefix+k, v 
        # this assumes that if you've got an attribute in a tag, 
        # you won't be having any text. This may or may not be a 
        # good idea -- time will tell. It works for the way we are 
        # currently doing XML configuration files... 
        elif element.items(): 
            for k, v in element.items():
                yield eprefix+k, v
        # finally, if there are no child tags and no attributes, extract 
        # the text 
        else:
            yield eprefix, element.text


def main():
    with open('source.xml', 'r', encoding='utf-8') as f: 
        xml_string = f.read() 
    xml_string= xml_string.replace('&#x0;', '') #optional to remove ampersands. 
    root = ElementTree.XML(xml_string) 
    for item in flatten_dict(root):
        print(item)

if __name__ == "__main__":
    main()

""" my output -->
('Response.RequestType', 'Moverview')
('Response.ID', '24856-775')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.Warnings.', None)
('Response.SList.', None)
('Response.LList.', None)
('Response.EA.', 'Y')
('Response.EHA.', 'Y')
('Response.EBY.', 'Y')
('Response.EOTH.', 'Y')
('Response.EIL.', 'Y')
('Response.EM.', 'Y')
('Response.ED.', 'Y')
('Response.EQ.', 'Y')
('Response.ERS.', 'Y')
('Response.ECCS.', 'Y')
('Response.EES.', 'Y')
('Response.UAS.', 'Y')
('Response.PA.', 'False')
('Response.PL.', 'False')
('Response.PC.', 'False')
('Response.PCs.', 'False')
('Response.PJ.', 'False')
('Response.OITC.', '0')
('Response.MG.', None)
('Response.R.', None)
('Response.CCGoods.', None)
"""

Saran_1 0 Junior Poster in Training · Answer 20 · 2015-06-21T23:35:04+00:00

Huh...that's an interesting take on this. At this point, I am playing around with the following (I welcome your feedback on this as well):

1.

(**'Response.RequestType'**, 'Moverview') 
(**'Response.ID', **'24856-775')

The ** 'sample.text.in_tag ** corresponds with the column headers. The values would be the second item in the tuple.

All of the "Int32" should be stripped away and replaced with an enumeration (just like the example with the previous generator:

f all(isinstance(x, list) for x in value):
for i, x in enumerate(value, 1):
yield ('{}{}'.format(key, i), x)
continue
yield key, value

Maybe this snippet would suffice?

The Responses for the rest of the elements (save for the first two above could be removed). I am still trying to figure how to do that as well...
The first and second item in the tuple would be the header and corresponding value, respectively.

Thank you again for your help.

Saran_1 0 Junior Poster in Training · Answer 21 · 2015-06-22T11:39:34+00:00

It certainly works for the sample that I gave you. However, as the one moves down the tree, the string increases as more tags are chained to the root element. Here is an example:

('Response.MemberO.PMembers.PMembers.Member.CurrentEmployer.EmployerAddress.TimeAtPreviousAddress.', None)

In theory, "TimeAtPreviousAddress" should be a separate header from "EmployerAddress, and "CurrentEmployer". The parent tags (ie.Response, MO etc) should be separate headers and be repeated only once. What are your thoughts on a wrapper function or decorator that could remove those parent elements, before the final step for writing to a CSV?

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 22 · 2015-06-22T11:57:24+00:00

The problem is that I dont understand your rule for key generation. If you want to keep only the last word, it is very easy to do

    for key, value in flatten_dict(root):
        key = key.rstrip('.').rsplit('.', 1)[-1]
        print(key,  value)

edit: also, you can start with my generator and change the code the way you want to generate a different key.

Saran_1 0 Junior Poster in Training · Answer 23 · 2015-06-22T12:10:07+00:00

I think I need to speak with my database administrator to figure that out as well. Is it possible for you to rephrase "rule for key generation"? From what I understand, we went back to drawing board and scrapped the dicionary as the intermediate format before writing to a CSV (after all, we are now generating lists, correct?)

The output CSV, should be formatted in such a way that the headers are the tags and the text should be a row value:

"a", "b", "b" // headers
"", 1, 2 // data row

Thanks again for your patience and feedback.

Saran_1 0 Junior Poster in Training · Answer 24 · 2015-06-22T15:21:08+00:00

@Gribouillis: For now it seems that they are okay with the following (here is an example):

('Response.MemberO.PMembers.PMembers.Member.CurrentEmployer.EmployerAddress.TimeAtPreviousAddress.', None)

to be written out in CSV format as:

'Response.MemberO.PMembers.PMembers.Member.CurrentEmployer.EmployerAddress.TimeAtPreviousAddress' /// column header

'0' /// value row (data row)

Is it best to wse csv.writerows() or writerow() for the output? (using the sample.xml as a case study):

"
('Response.RequestType', 'Moverview')
('Response.ID', '24856-775')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.Warnings.', None)
('Response.SList.', None)
('Response.LList.', None)
('Response.EA.', 'Y')
('Response.EHA.', 'Y')
('Response.EBY.', 'Y')
('Response.EOTH.', 'Y')
('Response.EIL.', 'Y')
('Response.EM.', 'Y')
('Response.ED.', 'Y')
('Response.EQ.', 'Y')
('Response.ERS.', 'Y')
('Response.ECCS.', 'Y')
('Response.EES.', 'Y')
('Response.UAS.', 'Y')
('Response.PA.', 'False')
('Response.PL.', 'False')
('Response.PC.', 'False')
('Response.PCs.', 'False')
('Response.PJ.', 'False')
('Response.OITC.', '0')
('Response.MG.', None)
('Response.R.', None)
('Response.CCGoods.', None)

Finally, is it possible to assign a specific dialect, like "excel"?

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 25 · 2015-06-22T16:11:04+00:00

If there is still a dot at the end of a column header, it would be better to remove it ('Response.R.' becomes 'Response.R'). For this, use the rstrip('.') method.

By default, the csv module selects the 'excel' dialect.

It is not better to use writerows(). It is only shorter if you only want to call writerow() several times. Shorter code looks better.

Saran_1 0 Junior Poster in Training · Answer 26 · 2015-06-22T16:59:56+00:00

I have attempted to use the following function to after the flatten methods for the lists and dictionary functions, respectively:

def main():
    with open('source.xml', 'r', encoding='utf-8') as f: 
        xml_string = f.read() 
    xml_string= xml_string.replace('&#x0;', '') #optional to remove ampersands. 
    root = ElementTree.XML(xml_string) 

    writer = csv.writer(open("test_out.csv", 'wt'))
    l1 = [k for k in flatten_dict(root)]
    l2 = [v for v in range(len(l1))]
    for row in xrange(len(l1)):
        writer.writerow([l1[row], l2[row]])
    if __name__ == "__main__":
        main()

However, I do not receive an output file...just a printout to the console

Reassigning keys in dictionary of lists and then writing out to CSV file?

Recommended Answers Collapse Answers

All 68 Replies

Recommended Answers