Ignoring Comments When Parsing XML?

Question

PythonNewbie2 0 Light Poster

14 Years Ago

Hey guys,

I'm parsing some XML using minidom and whenever a comment has a "--" within it, I get an ExpatError.

For example, a file may be like this:

<Label> Hello!</Label>
<!-- The above label says Hello.
  -- It is clear, no?  Let's try spicing it up a bit.
  -- Add some color to it.
-->
<Label color="#FF0000">HI I'M RED-Y.</Label>

When I run it through the XML parser, I get an ExpatError. Is there any way I could have it simply ignore the comments and continue parsing the rest of the file and run the script as it's supposed to?

Thanks for the help!

python xml

5 Contributors
22 Replies
2K Views
3 Days Discussion Span
Latest Post 14 Years Ago Latest Post by snippsat

snippsat 661 Master Poster

14 Years Ago

Parser should skip the comments.
I did a test in BeautifulSoup(3.08) dont use newer version.

xml = """
<Label> Hello!</Label>
<!-- The above label says Hello.
  -- It is clear, no?  Let's try spicing it up a bit.
  -- Add some color to it.
-->
<Label color="#FF0000">HI I'M RED-Y.</Label>
"""

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(xml)
matches = soup.findAll('label')
for i,match in enumerate(matches):
    print i, match.text


'''Out-->
0 Hello!
1 HI I'M RED-Y.
'''

Edited 14 Years Ago by snippsat because: n/a

PythonNewbie2 commented: Very helpful person! +1

snippsat 661 Master Poster

14 Years Ago

BeautifulSoup is a famous python HTML/XML parser.
http://www.crummy.com/software/BeautifulSoup/
BeautifulSoup is only one file BeautifulSoup.py.

build parser like minidom,elementtree should work.
If not 2 of the best is BeautifulSoup and lmxl.
http://codespeak.net/lxml/

Edited 14 Years Ago by snippsat because: n/a

snippsat 661 Master Poster

14 Years Ago

You just have to put or ask someone to put in python path folder.

>>> import sys
>>> sys.path #list path

One fil BeautifulSoup.py in one off that path and it will work.
Normal is to use this folder for 3-party moduls.
Python26\Lib\site-packages

Edited 14 Years Ago by snippsat because: n/a

snippsat 661 Master Poster

14 Years Ago

I am afraid you are out of luck. w3 asserts that the string "--" (double-hyphen) MUST NOT occur within comments

You see that my script using beautifulSoup can deal with "--" without problem.
I agree that using "--" inside comment is not the best thing to do.

Edited 14 Years Ago by snippsat because: n/a

TrustyTony 888 ex-Moderator

14 Years Ago

Why not clean up the XML comment with:

xml='''<Label> Hello!</Label>
<!-- The above label says Hello.
  -- It is clear, no?  Let's try spicing it up a bit.
  -- Add some color to it.
-->
<Label color="#FF0000">HI I'M RED-Y.</Label>'''
print xml.replace(' -- ', ' - ')

PythonNewbie2 commented: Awesome idea! +1

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

ultimatebuster 14 Posting Whiz in Training · Answer 1 · 2010-08-10T19:17:45+00:00

using minidom module you can find the NodeType of each node.

simply use Node.__class__.__name__.lower() to get the node type. A comment node will have the value of "comment"

PythonNewbie2 0 Light Poster · Answer 2 · 2010-08-10T21:04:37+00:00

Thanks for your help.

Can you provide a code example of what you're talking about?

Let's say that I have the XML in a file and here's the code I use:

from xml.dom import minidom
inputFilePath = 'C:\\XML\myXML.xml'
openedFile = open(inputFilePath,'r')
#Parse the XML file
print inputFilePath
xmldoc = minidom.parse(openedFile)

The error occurs when it tries to parse. So how do I have it ignore the comments before it tries to parse?

PythonNewbie2 0 Light Poster · Answer 3 · 2010-08-10T23:00:50+00:00

I'm trying to use Python's own minidom.

What is BeautifulSoup? (ImportError: No module named BeautifulSoup) How do I get it, where do I learn it, and how can I package everything so my users don't have to install it if they only have python?

Thanks!

PythonNewbie2 0 Light Poster · Answer 4 · 2010-08-10T23:27:16+00:00

Thanks for the links!

I'm confused about one thing.
I have to install BeautifulSoup and add it to the library path to use it.

When I send MyProject.py to my users, how will they be able to run it without having BeautifulSoup...?

nezachem 616 Practically a Posting Shark · Answer 5 · 2010-08-11T00:09:17+00:00

I am afraid you are out of luck. w3 asserts that the string "--" (double-hyphen) MUST NOT occur within comments.

nezachem 616 Practically a Posting Shark · Answer 6 · 2010-08-11T01:03:59+00:00

I hope we would not get into a holy war.
I just want to emphasize that the bug report about double hyphen was filed against expat, and rejected due to the restriction I quoted above. Now, the restriction is uncalled for, counter-intuitive, and even stupid - but it is a part of the Standard. Which makes BeautifulSoup non-compliant.
Funny, isn't it?

snippsat 661 Master Poster · Answer 7 · 2010-08-11T01:19:55+00:00

snippsat 661 Master Poster

14 Years Ago

No war,thanks for info nezachem.

PythonNewbie2 0 Light Poster · Answer 8 · 2010-08-11T01:29:23+00:00

You just have to put or ask someone to put in python path folder.
>>> import sys
>>> sys.path #list path
One fil BeautifulSoup.py in one off that path and it will work.
Normal is to use this folder for 3-party moduls.
Python26\Lib\site-packages

Thank you again for the help.

Most of my users are not very tech-savvy. Is there any way I can have the script automatically go out to the download page, and add it to the correct directory on the client's computer?

PythonNewbie2 0 Light Poster · Answer 9 · 2010-08-13T00:39:10+00:00

How would I do that for an XML file? This is what I tried:

openedFile = open(inputFilePath,'r')
        xml = openedFile.readlines()
        print xml.replace(' -- ', ' - ')
        #Parse the XML file
        print inputFilePath
        #xmldoc = minidom.parse(openedFile)
        xmldoc = minidom.parseString(xml)

It threw an exception because readlines() returns a list, not a string. How can I replace everything in my XML file with that and then run it through the parser?

Thanks for the help!!!

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 10 · 2010-08-13T00:50:12+00:00

TrustyTony 888 ex-Moderator

14 Years Ago

Use read() not readlines().

PythonNewbie2 0 Light Poster · Answer 11 · 2010-08-13T03:01:04+00:00

Thanks!
Some people wrote comments like this:

<!--
--Test
-->

The -- has no space before or after... If I try to replace '--' with '-', I get an ExpatError again because now it's not a properly formatted comment. Is there any way to say something like if the '--' is not a part of '', then replace it?

Any help is appreciated. Thanks!

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 12 · 2010-08-13T04:14:15+00:00

You need something bit more complicated then, how about:

xml='''<Label> Hello!</Label>
<!-- The above label says Hello.
  -- It is clear, no?  Let's try spicing it up a bit.
  -- Add some color to it.
-->
<Label color="#FF0000">HI I'M RED-Y.</Label>
<!--
--Test
-->'''
start, end = '<!--', '-->'
result = ''

while start :
    clean,start,xml = xml.partition(start)
    if start:
        clean += start
        comment,tag,xml = xml.partition(end)
        if not tag:
            raise SyntaxError, 'Missing %s tag in XML file' % end
        else:
            clean += comment.replace('--','-')+end
    result += clean
xml = result + xml

print xml

snippsat 661 Master Poster · Answer 13 · 2010-08-13T09:43:46+00:00

Why not not try BeautifulSoup?

xml = '''
<Label> Hello!</Label>
<!-- The above label says Hello.
  -- It is clear, no?  Let's try spicing it up a bit.
  -- Add some color to it.
-->

#Test
<!--
--Test
-->
<---->
--
<Label>Find me ok?</Label>
<-->
>--<
<-->
------
<Label color="#FF0000">HI I'M RED-Y.</Label>
'''

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(xml)
matches = soup.findAll('label')
for i,match in enumerate(matches):
    print i, match.text

'''Out-->
0 Hello!
1 Find me ok?
2 HI I'M RED-Y.
'''

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 14 · 2010-08-13T12:24:36+00:00

xml = result + xml

print xml

lines are rubbish xml=='' , so print result is enough.

PythonNewbie2 0 Light Poster · Answer 15 · 2010-08-13T23:19:25+00:00

I don't know how to package up BeautifulSoap so that my users don't have to download it...If you could tell me how to do that, I think it would probably be better to parse this with BeautifulSoap...

tonyjv, thanks for all of your help! Very helpful...

Some questions:
1) Does does while start work? Isn't start a string? I've only seen while with numbers...
2) How does this clean,start,xml = xml.partition(start) syntax work?

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 16 · 2010-08-14T01:41:56+00:00

1) while stops when string is false ie. empty string when partition fails
2) partition make string to three strings: part before searched string, which is the whole string if string was not found, the string searched, which is empty if it was not found and the part of string after the searched string:

>>> ('http://www.python.org').partition('://')
('http', '://', 'www.python.org')
>>> ('file:/usr/share/doc/index.html').partition('://')
('file:/usr/share/doc/index.html', '', '')
>>> (u'Subject: a quick question').partition(':')
(u'Subject', u':', u' a quick question')
>>> 'www.python.org'.rpartition('.')
('www.python', '.', 'org')
>>> 'www.python.org'.rpartition(':')
('', '', 'www.python.org')

snippsat 661 Master Poster · Answer 17 · 2010-08-14T02:53:51+00:00

I don't know how to package up BeautifulSoap so that my users don't have to download it...If you could tell me how to do that,

I have told you,it`s one file beautifulSoup.py
Put that file in the folder with xml files or python path.

#xml_parser.py
from bs import BeautifulSoup

xml = open('test.xml').read()
soup = BeautifulSoup(xml)
matches = soup.findAll('label')
for i,match in enumerate(matches):
    print i, match.text

'''Out-->
0 Hello!
1 Find me ok?
2 HI I'M RED-Y.
'''

Then you can call it whatevere you want,just as a test i called it bs.py
Now bs.py, test.xml, xml_parser.py are in same folder.
Then off course it use the renamed bs.py and not the orginal beautifulSoup in my python path.

>>> import bs
>>> dir(bs)
['BeautifulSOAP', 'BeautifulSoup', 'BeautifulStoneSoup', 'CData', 'Comment', 'DEFAULT_OUTPUT_ENCODING', 'Declaration', 'ICantBelieveItsBeautifulSoup', 'MinimalSoup', 'NavigableString', 'PageElement', 'ProcessingInstruction', 'ResultSet', 'RobustHTMLParser', 'RobustInsanelyWackAssHTMLParser', 'RobustWackAssHTMLParser', 'RobustXMLParser', 'SGMLParseError', 'SGMLParser', 'SimplifyingSOAPParser', 'SoupStrainer', 'StopParsing', 'Tag', 'UnicodeDammit', '__author__', '__builtins__', '__copyright__', '__doc__', '__file__', '__license__', '__name__', '__package__', '__version__', '_match_css_class', 'buildTagMap', 'chardet', 'codecs', 'generators', 'markupbase', 'name2codepoint', 're', 'sgmllib', 'types']
>>> bs.__version__
'3.0.8'