What I am trying to acheive is to parse an xml file break it up into useful components and push it to a multi table SQL database. But I cannot get off the ground in the basics.

Take an xml file like this Click Here

Which at the start is

<meeting id="35504" barriertrial="0" venue="Hawkesbury" date="2014-05-13T00:00:00" gearchanges="-1" stewardsreport="-1" gearlist="-1" racebook="0" postracestewards="0" meetingtype="TAB" rail="+4m 1300m to Winning Post, True Remainder" weather="Fine      " trackcondition="Dead      " nomsdeadline="2014-05-07T11:00:00" weightsdeadline="2014-05-08T16:00:00" acceptdeadline="2014-05-09T09:00:00" jockeydeadline="2014-05-09T12:00:00">
  <club abbrevname="Hawkesbury Race Club Limited" code="20" associationclass="2" website="http://" />
  <race id="185360" number="1" nomnumber="1" division="0" name="XXXX GOLD BENCHMARK 70 HANDICAP" mediumname="BM70" shortname="BM70" stage="Acceptances" distance="2000" minweight="55" raisedweight="1" class="BM70      " age="~         " grade="0" weightcondition="HCP       " trophy="0" owner="0" trainer="0" jockey="0" strapper="0" totalprize="22000" first="12250" second="4250" third="2100" fourth="1000" fifth="525" time="2014-05-13T13:03:00" bonustype="BX02      " nomsfee="0" acceptfee="0" trackcondition="          " timingmethod="          " fastesttime="          " sectionaltime="          " formavailable="0" racebookprize="Of $22000. First $12250, second $4250, third $2100, fourth $1000, fifth $525, sixth $375, seventh $375, eighth $375, ninth $375, tenth $375">
    <condition line="1">Of $22000. First $12250, second $4250, third $2100, fourth $1000, fifth $525, sixth $375, seventh $375, eighth $375, ninth $375, tenth $375</condition>
    <condition line="2">Starter Subsidy: $200 for non-prize earning runners.</condition>
    <condition line="3">BenchMark 70, Handicap, For No age restriction, No sex restriction (Weights Raised 1.0kg.)</condition>
    <condition line="4">BOBS&amp;BOBS Extra  Bonus available: $5,000</condition>
    <condition line="5">Apprentices can claim. Field Limit: 12 + 4 EM</condition>
    <nomination number="1" saddlecloth="1" horse="Our Uncle Archie" id="170617" idnumber="" regnumber="" blinkers="1" trainernumber="324" trainersurname="Englebrecht" trainerfirstname="Steve" trainertrack="Warwick Farm" rsbtrainername="Steve Englebrecht" jockeynumber="86428" jockeysurname="Pracey-Holmes" jockeyfirstname="Jake" barrier="2" weight="58" rating="68" description="BR G 3 Duke of Marmalade(IRE) x Nena Candida (Canny Lad)" colours="Red And Green Hoops, Black Sleeves, Red Armbands, Black And Red Seams Cap" owners="A J Watson, Mrs S C Watson, J R Watson, P K Watson, R F Watson, J M Cockburn &amp; Mrs J A Cockburn " dob="2010-09-19T00:00:00" age="4" sex="G" career="8-3-0-0 $53605.00" thistrack="1-1-0-0 $17250.00" thisdistance="0-0-0-0" goodtrack="4-1-0-0 $18655.00" heavytrack="1-1-0-0 $17250.00" slowtrack="0-0-0-0" deadtrack="3-1-0-0 $17700.00" fasttrack="0-0-0-0" firstup="3-0-0-0 $955.00" secondup="2-0-0-0 $900.00" mindistancewin="0" maxdistancewin="0" finished="0" weightvariation="0" variedweight="58" decimalmargin="0.00" penalty="0" pricestarting="" sectional200="0" sectional400="0" sectional600="0" sectional800="0" sectional1200="0" bonusindicator="E" />
    <nomination number="2" saddlecloth="2" horse="Montiro" id="158475" idnumber="" regnumber="" blinkers="0" trainernumber="279" trainersurname="Conners" trainerfirstname="Clarry" trainertrack="Warwick Farm" rsbtrainername="Clarry Conners" jockeynumber="965" jockeysurname="Hammersley" jockeyfirstname="Paul" barrier="1" weight="57.5" rating="65" description="CH G 4 Royal Academy(USA) x Stormy Petrel (Flying Spur)" colours="Yellow, Royal Blue Armbands And Cap" owners="Victory Lodge Syndicate (Mgrs: C &amp; M Conners), P J Collier, B E Collier, A W Rohde, D Thom &amp; Mrs M Gelardi" dob="2009-10-09T00:00:00" age="5" sex="G" career="9-2-1-1 $38975.00" thistrack="2-0-1-0 $4625.00" thisdistance="0-0-0-0" goodtrack="3-0-1-0 $5625.00" heavytrack="0-0-0-0" slowtrack="3-1-0-1 $20350.00" deadtrack="3-1-0-0 $13000.00" fasttrack="0-0-0-0" firstup="2-1-0-0 $17625.00" secondup="2-0-1-1 $6350.00" mindistancewin="0" maxdistancewin="0" finished="0" weightvariation="0" variedweight="57.5" decimalmargin="0.00" penalty="0" pricestarting="" sectional200="0" sectional400="0" sectional600="0" sectional800="0" sectional1200="0" bonusindicator="K" />

So I can read it in just fine. I can grab single elements just fine.

In [10]: %paste
import xmltodict
document = open("/home/sayth/Scripts/va_benefits/20140513HAWK0.xml", "r")
read_doc = document.read()
xml_doc = xmltodict.parse(read_doc)

## -- End pasted text --

In [11]: xml_doc['meeting']['@id']
Out[11]: u'35504'

But I cannot get multiple items out into a list so I can push it into the database table. Well I can get every item out its xml_doc['meeting'].

If I try to specifiy

In [14]: a = []

In [15]: a = xml_doc(['meeting']['@id'],['meeting']['@venue'])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-15-4c706827a308> in <module>()
----> 1 a = xml_doc(['meeting']['@id'],['meeting']['@venue'])

TypeError: list indices must be integers, not str

I can manually do it, but how can I 'automate' it. So that every import filters the same way and can easily update the database.

This is how I can manually do it.

In [17]: a.append(xml_doc['meeting']['@id'])

In [18]: a.append(xml_doc['meeting']['@venue'])

In [19]: print a
[u'35504', u'Hawkesbury']

Recommended Answers

All 9 Replies

You can define a function to do the same easily

@post_process(list)
def gather(xml_doc, paths):
    for p in paths:
        node = xml_doc
        for word in p:
            node = getitem(node, word)
        yield node


a = gather(xml_doc, (('meeting', '@id'), ('meeting', '@venue')))

post_process is defined in this code snippet (very useful bit).

Wow, your a legend. How does this compare to using a Python function such as filter.

How does this compare to using a Python function such as filter.

Can you be more specific ? The role of filter() is to select a subsequence in a sequence. It is a different problem.

No it is different I was trying to use filter after I posted this last night and saw your post use.

a = gather(xml_doc, (('meeting', '@id'), ('meeting', '@venue')))

and thought if that was a function(which is what you created). Don't worry just because it was the last thing I did late lst night.

def gather(a):
    a = gather(xml_doc, (('meeting', '@id'), ('meeting', '@venue')))
    return a

then you could use it in filter

filter(gather, xml_doc)

But anyway that aside and more importantly I have sort of come to understand @decorators when using flask and defining routes, but how does the decorator work in relation to the function.

In this case, the decorator transform a function which generate values into a function which returns a list of these values. Without the decorator, I would write

def gather(xml_doc, paths):
    result = []
    for p in paths:
        node = xml_doc
        for word in p:
            node = getitem(node, word)
        result.append(node)
    return result

This way to compose the resulting list is tedious, and this is a recurrent pattern. That's why I use the decorator.

    1 import xmltodict
    2 
    3 document = open("/home/sayth/Scripts/va_benefits/20140513HAWK0.xml", "r")
    4 read_doc = document.read()
    5 xml_doc = xmltodict.parse(read_doc)
    6 
EE  7 @post_process(list)
    8 def gather(xml_doc, paths):
    9     for p in paths: 
   10         node = xml_doc
   11         for word in p:
EE 12             node = getitem(node, word)
   13         yield node 
   14         
   15 result = gather(xml_doc, (('meeting', '@id'), ('meeting', '@venue')))
   16 print(result)



  1 collect.py|8 error| E0602 undefined name 'post_process' [pyflakes]                                
  2 collect.py|13 error| E0602 undefined name 'getitem' [pyflakes]
~                                                                    

Sorry I meant

@post_process(list)
def gather(xml_doc, paths):
    for p in paths:
        node = xml_doc
        for word in p:
            node = node[word]
        yield node

getitem() exists, but it is in module operator. You must also import post_process() from another file. For example, store it in a module post_process.py and write

from post_process import post_process

Do I need to write anything in post_process.py?

Of course, I said it is this code snippet: http://www.daniweb.com/software-development/python/code/374530/post-process-generated-values-with-a-decorator
Here is the contents of post_process.py

# python >= 2.6
from functools import update_wrapper

def post_process(*filters):
    """Decorator to post process a function's return value through a
    sequence of filters (functions with a single argument).

    Example:

        @post_process(f1, f2, f3)
        def f(*args, **kwd):
            ...
            return value

        then calling f(...) will actually return f3( f2( f1( f(...)))).

        This can also be used to convert a generator to a function
        returning a sequence type:

        @post_process(dict)
        def my_generator():
            ...
            yield key, value

    """

    def decorate(func):
        def wrapper(*args, **kwd):
            rv = func(*args, **kwd)
            for f in filters:
                rv = f(rv)
            return rv
        update_wrapper(wrapper, func)
        return wrapper
    return decorate
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.