Hi i am trying to parse a large xml file and printing the tags to a output file. I am using minidom, my code is working fine for 30Mb files but more than that its getting memory error.So i used buffer reading the file but unable to get the desired output.

XML File


Desired Output


When reading in buffer its giving me an output say :-



while 1:
   content = File.read(2048)
    if not len(content):
        for lines in StringIO(content):
            lines = lines.lstrip(' ')
            if lines.startswith("<TV>"):
                TV =  lines.strip("<TV>")
                tvVal = TV.split("</TV>")[0]
                #print tvVal
            elif lines.startswith("<FOOD>"):
                FOOD =  lines.strip("<FOOD>")
                foodVal = FOOD.split("</FOOD>")[0]
                #print foodVal

I am trying with seek() but still if the buffer is reading till Bur i am not able to get the desired output. Thanks in advance

Edited by pyTony: use code formatting

3 Years
Discussion Span
Last Post by vivsshake

The memory error came probably from minidom trying to build a whole dom tree in memory. This has nothing to do with the way the file is read. I think a good solution is to use a SAX parser, which doesn't store a tree. It is very easy to do: write a subclass of ContentHandler and call the parse() function:

class MyHandler(xml.sax.ContentHandler):
    def startElement(self, name, attrs):
        # this is called when <TV> or <FOOD> ... are met
        some code
    def endElement(self, name):
        # called on </TV> or </FOOD> ...
        some code
    def characters(self, content):
        # some bytes were read
    ... other methods, if needed

with open('myfile') as ifh:
    xml.sax.parse(ifh, MyHandler())

can any one help me with this question using*** Element tree*** [iterparse] or lxml ?????


With lxml, it is as simple as

from lxml import etree
tree = etree.parse('myfile.xml')

but lxml builds a whole tree in memory, which may not fit your large files.


Thanks for your support and i have finally written my code and its working great here it is
lxml is super fast indeed

import lxml import etree    
for event, element in etree.iterparse(the_xml_file):
    if 'TV' in element.tag:
        print element.text
This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.