1

You don't understand the output of flatten_dict(). You can do this

def main():
    with open('source.xml', 'r', encoding='utf-8') as f: 
        xml_string = f.read() 
    xml_string= xml_string.replace('�', '') #optional to remove ampersands. 
    root = ElementTree.XML(xml_string) 

    writer = csv.writer(open("test_out.csv", 'wt'))
    writer.writerows(zip(*flatten_dict(root)))

if __name__ == "__main__":
        main()

Also main() was not called in your code, due to indentation.

Edited by Gribouillis

0

Is it possible for you to elaborate on the output; I may not understand the complete output of the flatten_dict(). Thank for catching the indentation - I just debugged that just now.

Edited by Saran_1

1

The flatten_dict() function returns an iterable sequence of (key, value) pairs. You can turn this to a list of pairs with list(flatten_dict(root)).

A list of pairs L (or a iterable of pairs) can be transposed with zip(*L).

>>> L = [('A', 1), ('B', 2), ('C', 3), ('D', 4)]
>>> list(zip(*L))
[('A', 'B', 'C', 'D'), (1, 2, 3, 4)]
0

Ah, I see now. However, if there is duplicate set of items in a list; for example, using your example:

>>> L = [('A', 1), ('B', 2), ('C', 3), ('D', 4),('A', 5), ('B', 6), ('C', 7), ('D', 8)]

The analogy is that the key (header) is the same, but the values (row values are different is it best to use zip(*L).

My goal is that the the second set of values be written below the 1st; however the header should be only written once. Is it best to write out seaprate values adnd headers?

For example, I was thinking:

 root = ElementTree.XML(xml_string) 
 header = writer.writerow([k for k, v in root]) //this should be only called once
 row = writer.writerow([v for k, v in root]) // based off of the number of new values these git printed in different rows
0

I don't understand your problem. If you have the list

L = [('A', 1), ('B', 2), ('C', 3), ('D', 4),('A', 5), ('B', 6), ('C', 7), ('D', 8)]

and you want a excel sheet such as

A B C D A B C D
1 2 3 4 5 6 7 8

you can do either

writer.writerows(zip(*L))

or

writer.writerow[(k for k, v in L)]
writer.writerow([v for k, v in L])

and of course, in your case, do first

L = list(flatten_dict(root))

All this will work very well as long as you dont have very large xml files (but in this case, the code can be adapted)

Edited by Gribouillis

0

Forgive me for the lack of clarity; I had meant the following:

A B C D
1 2 3 4

A B C D //or any column header
5 6 7 8
0

So you mean the headers A B C D, then a first row of values 1 2 3 4, then a blank row, then a row A B C D again and finally a new row with a set of values 5 6 7 8 ?

0

And if we have

L = [('A', 1), ('B', 2), ('C', 3), ('D', 4),('A', 5), ('B', 6), ('D', 8)]

you want

A B C D
1 2 3 4

A B   D
5 6   8

?

What about

L = [('A', 1), ('B', 2), ('C', 3), ('D', 4),('D', 5), ('B', 6), ('A', 8)]

?

Edited by Gribouillis

0

The columns should match the keys - if there are additions and subtractions then the column headers should reflect this.

0

This should work

def makerows(pairs):
    headers = []
    columns = {}
    for k, v in pairs:
        if k in columns:
            columns[k].extend(('', k, v))
        else:
            headers.append(k)
            columns[k] = [k, v]
    m = max(len(c) for c in columns.values())
    for c in columns.values():
        c.extend('' for i in range(len(c), m))
    L = [columns[k] for k in headers]
    rows = list(zip(*L))
    return rows

if __name__ == '__main__':
    XX = [
        [('A', 1), ('B', 2), ('C', 3), ('D', 4),('A', 5), ('B', 6), ('C', 7), ('D', 8)],
        [('A', 1), ('B', 2), ('C', 3), ('D', 4),('A', 5), ('B', 6), ('D', 8)],
        [('A', 1), ('B', 2), ('C', 3), ('D', 4),('D', 5), ('B', 6), ('A', 8)],
    ]
    from pprint import pprint
    for data in XX:
        print(data)
        pprint(makerows(data))

""" my output -->
[('A', 1), ('B', 2), ('C', 3), ('D', 4), ('A', 5), ('B', 6), ('C', 7), ('D', 8)]
[('A', 'B', 'C', 'D'),
 (1, 2, 3, 4),
 ('', '', '', ''),
 ('A', 'B', 'C', 'D'),
 (5, 6, 7, 8)]
[('A', 1), ('B', 2), ('C', 3), ('D', 4), ('A', 5), ('B', 6), ('D', 8)]
[('A', 'B', 'C', 'D'),
 (1, 2, 3, 4),
 ('', '', '', ''),
 ('A', 'B', '', 'D'),
 (5, 6, '', 8)]
[('A', 1), ('B', 2), ('C', 3), ('D', 4), ('D', 5), ('B', 6), ('A', 8)]
[('A', 'B', 'C', 'D'),
 (1, 2, 3, 4),
 ('', '', '', ''),
 ('A', 'B', '', 'D'),
 (8, 6, '', 5)]
 """

Now you work yourself, all this is getting more and more sophisticated.

0

Wow you covered all of the bases! I only got the first test running! How would you write this out to a CSV.

Would this be changed to writer.writerows((makerows(flatten_dict(root))))

Edited by Saran_1

0

I followed the protocol that you ascribe to:

def main():
    with open('2-Response_duplicate.xml', 'r', encoding='utf-8') as f: 
        xml_string = f.read() 
    xml_string= xml_string.replace('�', '') #optional to remove ampersands. 
    root = ElementTree.XML(xml_string) 
    for item in root:
        print(root)
    writer = csv.writer(open("test_out.csv", 'wt'))
    writer.writerows(makerows(flatten_dict(root)))

if __name__ == "__main__":
        main()

output (in iPython Notebook):

<Element 'Response' at 0x00000000057D6728>
<Element 'Response' at 0x00000000057D6728>

Looks okay when opened in Excel

0

Is it possible just to have the values below the initial column headers. My apologies but my ETL process rejected the output CSV because of the repetition of the headers.

    A B C D
    1 2 3 4
    5 6 7 8
    9 10  12 

Edited by Saran_1

0

Why not updating the code by yourself ? Here is the solution

def makerows(pairs):
    headers = []
    columns = {}
    for k, v in pairs:
        if k in columns:
            columns[k].extend((v,))
        else:
            headers.append(k)
            columns[k] = [k, v]
    m = max(len(c) for c in columns.values())
    for c in columns.values():
        c.extend('' for i in range(len(c), m))
    L = [columns[k] for k in headers]
    rows = list(zip(*L))
    return rows
0

I am too mentally fatigued - 72 hours non-stop (30 minutes of shuteye....)

0

Hi Griboullis: I retried the following script (after much R&R):

def makerows(pairs):
    headers = []
    columns = {}
    for k, v in pairs:
        if k in columns:
            columns[k].extend((v,))
        else:
            headers.append(k)
            columns[k] = [k, v]
    m = max(len(c) for c in columns.values())
    for c in columns.values():
        c.extend('' for i in range(len(c), m))
    L = [columns[k] for k in headers]
    rows = list(zip(*L))
    return rows

However the headers (and corresponding values) are repeated side by side. I am not sure what happened between today and yesterday. I started with the original analogy of using the following set of lists:

if __name__ == '__main__':

    lists = [
        [('A', 1), ('B', 2), ('C', 3), ('D', 4),('A', 5), ('B', 6), ('C', 7), ('D', 8)],
        [('A', 1), ('B', 2), ('C', 3), ('D', 4),('A', 5), ('B', 6), ('D', 8)],
        [('A', 1), ('B', 2), ('C', 3), ('D', 4),('D', 5), ('B', 6), ('A', 8)],
    ]
    from pprint import pprint
    for data in lists:
        print(data)
        pprint(makerows(data))

output:

[('A', 1), ('B', 2), ('C', 3), ('D', 4), ('A', 5), ('B', 6), ('C', 7), ('D', 8)]
[('A', 'B', 'C', 'D'), (1, 2, 3, 4), (5, 6, 7, 8)]
[('A', 1), ('B', 2), ('C', 3), ('D', 4), ('A', 5), ('B', 6), ('D', 8)]
[('A', 'B', 'C', 'D'), (1, 2, 3, 4), (5, 6, '', 8)]
[('A', 1), ('B', 2), ('C', 3), ('D', 4), ('D', 5), ('B', 6), ('A', 8)]

[('A', 'B', 'C', 'D'), (1, 2, 3, 4), (8, 6, '', 5)]

However, when I extrapolate that logic to this XML structure with repeating elements with this function call:

def makerows(pairs):
    headers = []
    columns = {}
    for k, v in pairs:
        if k in columns:
            columns[k].extend((v,))
        else:
            headers.append(k)
            columns[k] = [k, v]
    m = max(len(c) for c in columns.values())
    for c in columns.values():
        c.extend(' ' for i in range(len(c), m))
    L = [columns[k] for k in headers]
    rows = list(zip(*L))
    return rows


def main():
    with open('sample.xml', 'r', encoding='utf-8') as f: 
        xml_string = f.read() 
    xml_string= xml_string.replace('&#x0;', '') #optional to remove ampersands. 
    root = ElementTree.XML(xml_string) 
    for item in root:
        print(root)
    writer = csv.writer(open("test_out.csv", 'wt'))
    writer.writerows(makerows(flatten_dict(root)))

if __name__ == "__main__":
        main()

XML sample:

<Response TransactionID="1234567" RequestType="Cfee">
<ChargeFee>
    <FeeAmount>5</FeeAmount>
    <FeeCharged>True</FeeCharged>
    <FeeChargedDate>1/1/0001 12:00:00 AM</FeeChargedDate>
    <Message>Transfer Fee</Message>
    <AccountNumber />
    <AccountType />
    <AccountSuffix />
    <MPIAccountType />
    <Amount>0</Amount>
    <Comment />
    <GLCode />
    <SystemTrace>0</SystemTrace>
    <FeeTypeCode />
    <Locator />
    <InstitutionISO />
    <MemberID />
    <MemberPass />
    <TransactionID />
    <CallID />
    <TellerErrors />
</ChargeFee>

<ChargeFee>
    <FeeAmount>5</FeeAmount>
    <FeeCharged>True</FeeCharged>
    <FeeChargedDate>1/1/0001 12:00:00 AM</FeeChargedDate>
    <Message>Transfer Fee</Message>
    <AccountNumber />
    <AccountType />
    <AccountSuffix />
    <MPIAccountType />
    <Amount>0</Amount>
    <Comment />
    <GLCode />
    <SystemTrace>0</SystemTrace>
    <FeeTypeCode />
    <Locator />
    <InstitutionISO />
    <MemberID />
    <MemberPass />
    <TransactionID />
    <CallID />
    <TellerErrors />
</ChargeFee>


<ChargeFee>
    <FeeAmount>5</FeeAmount>
    <FeeCharged>True</FeeCharged>
    <FeeChargedDate>1/1/0001 12:00:00 AM</FeeChargedDate>
    <Message>Transfer Fee</Message>
    <AccountNumber />
    <AccountType />
    <AccountSuffix />
    <MPIAccountType />
    <Amount>0</Amount>
    <Comment />
    <GLCode />
    <SystemTrace>0</SystemTrace>
    <FeeTypeCode />
    <Locator />
    <InstitutionISO />
    <MemberID />
    <MemberPass />
    <TransactionID />
    <CallID />
    <TellerErrors />
</ChargeFee>  
</Response>

I receive an output CSV with repeating headers and values all printed in one long string. There are no rows, as was represented yesterday. Thoughts?

0

This works very well for me: first a file parsexml2.py

# parsexml2.py

import xml.etree.cElementTree as ElementTree 
import csv


def flatten_list(aList, prefix=''): 
    for element in aList:
        if element: 
            # treat like dict 
            if len(element) == 1 or element[0].tag != element[1].tag: 
                yield from flatten_dict(element, prefix)
            # treat like list 
            elif element[0].tag == element[1].tag: 
                yield from flatten_list(element, prefix)
        elif element.text: 
            text = element.text.strip() 
            if text: 
                yield prefix.rstrip('.'), text


def flatten_dict(parent_element, prefix=''):
    prefix = prefix + parent_element.tag + '.'
    if parent_element.items():
        for k, v in parent_element.items():
            yield prefix + k, v
    for element in parent_element:
        eprefix = prefix + element.tag + '.'
        if element:
            # treat like dict - we assume that if the first two tags 
            # in a series are different, then they are all different. 
            if len(element) == 1 or element[0].tag != element[1].tag: 
                yield from flatten_dict(element, prefix=prefix)
            # treat like list - we assume that if the first two tags 
            # in a series are the same, then the rest are the same. 
            else: 
                # here, we put the list in dictionary; the key is the 
                # tag name the list elements all share in common, and 
                # the value is the list itself
                yield from flatten_list(element, prefix=eprefix+element[0].tag+'.')
            # if the tag has attributes, add those to the dict
            if element.items():
                for k, v in element.items():
                    yield eprefix+k, v 
        # this assumes that if you've got an attribute in a tag, 
        # you won't be having any text. This may or may not be a 
        # good idea -- time will tell. It works for the way we are 
        # currently doing XML configuration files... 
        elif element.items(): 
            for k, v in element.items():
                yield eprefix+k, v
        # finally, if there are no child tags and no attributes, extract 
        # the text 
        else:
            yield eprefix.rstrip('.'), element.text

Then almost the same code as yours in another file

import csv
from parsexml2 import flatten_dict, ElementTree

def makerows(pairs):
    headers = []
    columns = {}
    for k, v in pairs:
        if k in columns:
            columns[k].extend((v,))
        else:
            headers.append(k)
            columns[k] = [k, v]
    m = max(len(c) for c in columns.values())
    for c in columns.values():
        c.extend('' for i in range(len(c), m)) # why do you want to change '' into ' ' ?
    L = [columns[k] for k in headers]
    rows = list(zip(*L))
    return rows


def main():
    with open('sample.xml', 'r', encoding='utf-8') as f: 
        xml_string = f.read() 
    xml_string= xml_string.replace('&#x0;', '') #optional to remove ampersands. 
    root = ElementTree.XML(xml_string) 
    for item in root:
        print(root)
    writer = csv.writer(open("test_out.csv", 'wt'))
    writer.writerows(makerows(flatten_dict(root)))

if __name__ == "__main__":
        main()

I opened the csv with libreoffice by choosing encoding utf8, language english and separator comma.

Minor points: some column are empty because there was a None value in python. One could change things by writing for example

for k, v in pairs:
    v = str(v) # or some other value transformation
    ... etc

in makerows().

Your argument about the values being side to side in makerows() does not make sense. Python lists have no notion of vertical and horizontal direction. Makerows() returns a list of rows for the csv file. Each element is a new row.

Edited by Gribouillis

0

It was the delimiter that I needed! It seems that with iPython Notebook behaves very differently than from the Command Prompt on Windows 7! Sorry for the false alarm. On another note, how do I mark a post solved?

Edited by Saran_1

0

Hi Griboullis:

I am currently attmepting to reverse engineer the Active State recipe in such a way that it maintains the previous requirements but instead does not "chain the root node/element" (key) to the rest of the children/sublelements (e.g):

Instead of

`'Response.MemberO.PMembers.PMembers.Member.CurrentEmployer.EmployerAddress.TimeAtPreviousAddress'

TimeAtPreviousAddress

Right now, using Pdb, I have been to isolate lines 23-34

prefix = prefix + parent_element.tag + '.'
    if parent_element.items():
        for k, v in parent_element.items():
            yield prefix + k, v
    for element in parent_element:
        eprefix = prefix + element.tag + '.'
        if element:

May be I misunderstand how the "yield from" statement works within this contex. I have attempted to deconconstruct the generator, but either get the elements stacked (but have the chained tags removed) or vice versa. I welcome any suggestions/guidance.

Edited by Saran_1

0

The question is what do you want to have instead of 'Response.MemberO.PMembers.PMembers.Member.CurrentEmployer.EmployerAddress.TimeAtPreviousAddress'
? If you want to have only 'TimeAtPreviousAddress' you can obtain this without changing the generator, but instead by wrapping it in a modified generator:

def pairs_from_root(element):
    for k, v in flatten_dict(element):
        kk = k.rsplit('.', 1)[-1]
        yield kk, v

Edited by Gribouillis

0

Where would I call this. I still would like to have out put CSV maintain the structure. Would this not affect how the lists are treated when they are passed in?

0

Instead of calling flatten_dict(root), you would call pairs_from_root(root). You need to do this only once at the bottom of the program. This will change the CSV structure in the sense that if you have two columns A.TimeAtPreviousAddress and B.TimeAtPreviousAddress, you will now have only one column TimeAtPreviousAddress.

Edited by Gribouillis

0

I tried the following:

import xml.etree.cElementTree as ElementTree 
import csv

def flatten_list(aList, prefix=''):

    for i, element in enumerate(aList, 1):
        eprefix = "{}{}".format(prefix, i)
        if element:
            # treat like dict 
            if len(element) == 1 or element[0].tag != element[1].tag: 
                yield from flatten_dict(element, eprefix)
            # treat like list 
            elif element[0].tag == element[1].tag: 
                yield from pairs_from_root(element)
        elif element.text: 
            text = element.text.strip() 
            if text: 
                yield eprefix.rstrip('.'), text     

def flatten_dict(parent_element, prefix=''):
    prefix = prefix + parent_element.tag
    if parent_element.items():
        for k, v in parent_element.items():
            yield prefix + k, v
    for element in parent_element:
        eprefix = prefix + element.tag 
        if element:
            # treat like dict - we assume that if the first two tags 
            # in a series are different, then they are all different. 
            if len(element) == 1 or element[0].tag != element[1].tag: 
                yield from flatten_dict(element, prefix=prefix)
            # treat like list - we assume that if the first two tags 
            # in a series are the same, then the rest are the same. 
            else: 
                # here, we put the list in dictionary; the key is the 
                # tag name the list elements all share in common, and 
                # the value is the list itself
                yield from flatten_list(element, prefix=eprefix)
            # if the tag has attributes, add those to the dict
            if element.items():
                for k, v in element.items():
                    yield eprefix+k, v 
        # this assumes that if you've got an attribute in a tag, 
        # you won't be having any text. This may or may not be a 
        # good idea -- time will tell. It works for the way we are 
        # currently doing XML configuration files... 
        elif element.items(): 
            for k, v in element.items():
                yield eprefix+k, v
        # finally, if there are no child tags and no attributes, extract 
        # the text 
        else:
            yield eprefix, element.text

def pairs_from_root(element):
    for k, v in flatten_dict(element):
        kk = k.rsplit('.', 1)[-1]
        yield kk, v  


def makerows(pairs):
    headers = []
    columns = {}
    for k, v in pairs:
        if k in columns:
            columns[k].extend((v,))
        else:
            headers.append(k)
            columns[k] = [k, v]
    m = max(len(c) for c in columns.values())
    for c in columns.values():
        c.extend(' ' for i in range(len(c), m))
    L = [columns[k] for k in headers]
    rows = list(zip(*L))
    return rows


def main():
    with open('2-Response.xml', 'r', encoding='utf-8') as f: 
        xml_string = f.read() 
    xml_string= xml_string.replace('&#x0;', '') #optional to remove ampersands. 
    root = ElementTree.XML(xml_string) 
    for item in root:
        print(root)
    print(pairs_from_root(root))
    writer = csv.writer(open("test_out.csv", 'wt'))
    writer.writerows(makerows(pairs_from_root(root)))

if __name__ == "__main__":
        main()

However, the output still had the headers as (for example) ResponseRequestType, ResponseTransactionID instead of RequestType and TransactionID

This question has already been answered. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.