Hi i am trying to parse a large xml file and printing the tags to a output file. I am using minidom, my code is working fine for 30Mb files but more than that its getting memory error.So i used buffer reading the file but unable to get the desired output.

XML File

<File>
<TV>Sony</TV>
<FOOD>Burger</FOOD>
<PHONE>Apple</PHONE>
</File>
<File>
<TV>Samsung</TV>
<FOOD>Pizza</FOOD>
<PHONE>HTC</PHONE>
</File>

Desired Output

Sony,Burger,Apple
Samsung,Pizza,HTC

When reading in buffer its giving me an output say :-

*Sony,Bur
Samsung,Pizza,HT*

code:

while 1:
   content = File.read(2048)
    if not len(content):
            break
        for lines in StringIO(content):
            lines = lines.lstrip(' ')
            if lines.startswith("<TV>"):
                TV =  lines.strip("<TV>")
                tvVal = TV.split("</TV>")[0]
                #print tvVal
                w2.writelines(str(tvVal)+",")
            elif lines.startswith("<FOOD>"):
                FOOD =  lines.strip("<FOOD>")
                foodVal = FOOD.split("</FOOD>")[0]
                #print foodVal
                w2.writelines(str(foodVal)+",")
                ............................
                ...........................

I am trying with seek() but still if the buffer is reading till Bur i am not able to get the desired output. Thanks in advance

Edited 3 Years Ago by pyTony: use code formatting

The memory error came probably from minidom trying to build a whole dom tree in memory. This has nothing to do with the way the file is read. I think a good solution is to use a SAX parser, which doesn't store a tree. It is very easy to do: write a subclass of ContentHandler and call the parse() function:

class MyHandler(xml.sax.ContentHandler):
    def startElement(self, name, attrs):
        # this is called when <TV> or <FOOD> ... are met
        some code
    def endElement(self, name):
        # called on </TV> or </FOOD> ...
        some code
    def characters(self, content):
        # some bytes were read
    ... other methods, if needed

with open('myfile') as ifh:
    xml.sax.parse(ifh, MyHandler())

can any one help me with this question using*** Element tree*** [iterparse] or lxml ?????

With lxml, it is as simple as

from lxml import etree
tree = etree.parse('myfile.xml')

but lxml builds a whole tree in memory, which may not fit your large files.

Thanks for your support and i have finally written my code and its working great here it is
lxml is super fast indeed

import lxml import etree    
for event, element in etree.iterparse(the_xml_file):
    if 'TV' in element.tag:
        print element.text
This article has been dead for over six months. Start a new discussion instead.