Hey guys,

I'm parsing some XML using minidom and whenever a comment has a "--" within it, I get an ExpatError.

For example, a file may be like this:

<Label> Hello!</Label>
<!-- The above label says Hello.
  -- It is clear, no?  Let's try spicing it up a bit.
  -- Add some color to it.
-->
<Label color="#FF0000">HI I'M RED-Y.</Label>

When I run it through the XML parser, I get an ExpatError. Is there any way I could have it simply ignore the comments and continue parsing the rest of the file and run the script as it's supposed to?

Thanks for the help!

using minidom module you can find the NodeType of each node.

simply use Node.__class__.__name__.lower() to get the node type. A comment node will have the value of "comment"

Thanks for your help.

Can you provide a code example of what you're talking about?

Let's say that I have the XML in a file and here's the code I use:

from xml.dom import minidom
inputFilePath = 'C:\\XML\myXML.xml'
openedFile = open(inputFilePath,'r')
#Parse the XML file
print inputFilePath
xmldoc = minidom.parse(openedFile)

The error occurs when it tries to parse. So how do I have it ignore the comments before it tries to parse?

Parser should skip the comments.
I did a test in BeautifulSoup(3.08) dont use newer version.

xml = """
<Label> Hello!</Label>
<!-- The above label says Hello.
  -- It is clear, no?  Let's try spicing it up a bit.
  -- Add some color to it.
-->
<Label color="#FF0000">HI I'M RED-Y.</Label>
"""

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(xml)
matches = soup.findAll('label')
for i,match in enumerate(matches):
    print i, match.text


'''Out-->
0 Hello!
1 HI I'M RED-Y.
'''

Edited 6 Years Ago by snippsat: n/a

Comments
Very helpful person!

I'm trying to use Python's own minidom.

What is BeautifulSoup? (ImportError: No module named BeautifulSoup) How do I get it, where do I learn it, and how can I package everything so my users don't have to install it if they only have python?

Thanks!

Thanks for the links!

I'm confused about one thing.
I have to install BeautifulSoup and add it to the library path to use it.

When I send MyProject.py to my users, how will they be able to run it without having BeautifulSoup...?

You just have to put or ask someone to put in python path folder.

>>> import sys
>>> sys.path #list path

One fil BeautifulSoup.py in one off that path and it will work.
Normal is to use this folder for 3-party moduls.
Python26\Lib\site-packages

Edited 6 Years Ago by snippsat: n/a

I am afraid you are out of luck. w3 asserts that the string "--" (double-hyphen) MUST NOT occur within comments

You see that my script using beautifulSoup can deal with "--" without problem.
I agree that using "--" inside comment is not the best thing to do.

Edited 6 Years Ago by snippsat: n/a

I hope we would not get into a holy war.
I just want to emphasize that the bug report about double hyphen was filed against expat, and rejected due to the restriction I quoted above. Now, the restriction is uncalled for, counter-intuitive, and even stupid - but it is a part of the Standard. Which makes BeautifulSoup non-compliant.
Funny, isn't it?

You just have to put or ask someone to put in python path folder.

>>> import sys
>>> sys.path #list path

One fil BeautifulSoup.py in one off that path and it will work.
Normal is to use this folder for 3-party moduls.
Python26\Lib\site-packages

Thank you again for the help.

Most of my users are not very tech-savvy. Is there any way I can have the script automatically go out to the download page, and add it to the correct directory on the client's computer?

Why not clean up the XML comment with:

xml='''<Label> Hello!</Label>
<!-- The above label says Hello.
  -- It is clear, no?  Let's try spicing it up a bit.
  -- Add some color to it.
-->
<Label color="#FF0000">HI I'M RED-Y.</Label>'''
print xml.replace(' -- ', ' - ')
Comments
Awesome idea!

How would I do that for an XML file? This is what I tried:

openedFile = open(inputFilePath,'r')
        xml = openedFile.readlines()
        print xml.replace(' -- ', ' - ')
        #Parse the XML file
        print inputFilePath
        #xmldoc = minidom.parse(openedFile)
        xmldoc = minidom.parseString(xml)

It threw an exception because readlines() returns a list, not a string. How can I replace everything in my XML file with that and then run it through the parser?

Thanks for the help!!!

Thanks!
Some people wrote comments like this:

<!--
--Test
-->

The -- has no space before or after... If I try to replace '--' with '-', I get an ExpatError again because now it's not a properly formatted comment. Is there any way to say something like if the '--' is not a part of '<!--' or '-->', then replace it?

Any help is appreciated. Thanks!

You need something bit more complicated then, how about:

xml='''<Label> Hello!</Label>
<!-- The above label says Hello.
  -- It is clear, no?  Let's try spicing it up a bit.
  -- Add some color to it.
-->
<Label color="#FF0000">HI I'M RED-Y.</Label>
<!--
--Test
-->'''
start, end = '<!--', '-->'
result = ''

while start :
    clean,start,xml = xml.partition(start)
    if start:
        clean += start
        comment,tag,xml = xml.partition(end)
        if not tag:
            raise SyntaxError, 'Missing %s tag in XML file' % end
        else:
            clean += comment.replace('--','-')+end
    result += clean
xml = result + xml

print xml

Edited 6 Years Ago by pyTony: n/a

Why not not try BeautifulSoup?

xml = '''
<Label> Hello!</Label>
<!-- The above label says Hello.
  -- It is clear, no?  Let's try spicing it up a bit.
  -- Add some color to it.
-->

#Test
<!--
--Test
-->
<---->
--
<Label>Find me ok?</Label>
<-->
>--<
<-->
------
<Label color="#FF0000">HI I'M RED-Y.</Label>
'''

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(xml)
matches = soup.findAll('label')
for i,match in enumerate(matches):
    print i, match.text

'''Out-->
0 Hello!
1 Find me ok?
2 HI I'M RED-Y.
'''

I don't know how to package up BeautifulSoap so that my users don't have to download it...If you could tell me how to do that, I think it would probably be better to parse this with BeautifulSoap...

tonyjv, thanks for all of your help! Very helpful...

Some questions:
1) Does does while start work? Isn't start a string? I've only seen while with numbers...
2) How does this clean,start,xml = xml.partition(start) syntax work?

Edited 6 Years Ago by PythonNewbie2: n/a

1) while stops when string is false ie. empty string when partition fails
2) partition make string to three strings: part before searched string, which is the whole string if string was not found, the string searched, which is empty if it was not found and the part of string after the searched string:

>>> ('http://www.python.org').partition('://')
('http', '://', 'www.python.org')
>>> ('file:/usr/share/doc/index.html').partition('://')
('file:/usr/share/doc/index.html', '', '')
>>> (u'Subject: a quick question').partition(':')
(u'Subject', u':', u' a quick question')
>>> 'www.python.org'.rpartition('.')
('www.python', '.', 'org')
>>> 'www.python.org'.rpartition(':')
('', '', 'www.python.org')

I don't know how to package up BeautifulSoap so that my users don't have to download it...If you could tell me how to do that,

I have told you,it`s one file beautifulSoup.py
Put that file in the folder with xml files or python path.

#xml_parser.py
from bs import BeautifulSoup

xml = open('test.xml').read()
soup = BeautifulSoup(xml)
matches = soup.findAll('label')
for i,match in enumerate(matches):
    print i, match.text

'''Out-->
0 Hello!
1 Find me ok?
2 HI I'M RED-Y.
'''

Then you can call it whatevere you want,just as a test i called it bs.py
Now bs.py, test.xml, xml_parser.py are in same folder.
Then off course it use the renamed bs.py and not the orginal beautifulSoup in my python path.

>>> import bs
>>> dir(bs)
['BeautifulSOAP', 'BeautifulSoup', 'BeautifulStoneSoup', 'CData', 'Comment', 'DEFAULT_OUTPUT_ENCODING', 'Declaration', 'ICantBelieveItsBeautifulSoup', 'MinimalSoup', 'NavigableString', 'PageElement', 'ProcessingInstruction', 'ResultSet', 'RobustHTMLParser', 'RobustInsanelyWackAssHTMLParser', 'RobustWackAssHTMLParser', 'RobustXMLParser', 'SGMLParseError', 'SGMLParser', 'SimplifyingSOAPParser', 'SoupStrainer', 'StopParsing', 'Tag', 'UnicodeDammit', '__author__', '__builtins__', '__copyright__', '__doc__', '__file__', '__license__', '__name__', '__package__', '__version__', '_match_css_class', 'buildTagMap', 'chardet', 'codecs', 'generators', 'markupbase', 'name2codepoint', 're', 'sgmllib', 'types']
>>> bs.__version__
'3.0.8'

Edited 6 Years Ago by snippsat: n/a

This article has been dead for over six months. Start a new discussion instead.