Two good parser are BeautifulSoup an lxml.
Can you post a part of xml file and tell what info you want out.
So maybe can i show a little about how to parse xml.
snippsat
Practically a Posting Shark
808 posts since Aug 2008
Reputation Points: 353
Solved Threads: 294
Just to parse somthing for a start,i have not read to detailed about your task.
That xml you got is not the easiest xml i have seen.
I want to take out groupe and Thomson.
from BeautifulSoup import BeautifulStoneSoup
import re
xml = '''\
</VPpart>
<w cat="PONCT" ee="PONCT-W" ei="PONCTW" lemma="," subcat="W">,</w>
- <NP fct="SUJ">
<w cat="D" ee="D-def-ms" ei="Dms" lemma="le" mph="ms" subcat="def">le</w>
<w cat="N" ee="N-C-ms" ei="NCms" lemma="groupe" mph="ms" subcat="C">groupe</w>
<w cat="N" ee="N-P-ms" ei="NPms" lemma="Thomson" mph="ms" subcat="P">Thomson</w>
</NP>
- <VN>
<w cat="V" ee="V--P3s" ei="VP3s" lemma="avoir" mph="P3s" subcat="">a</w>
<w cat="V" ee="V--Kms" ei="VKms" lemma="informer" mph="Kms" subcat="">informé</w>
</VN>'''
soup = BeautifulStoneSoup(xml)
r = re.findall(r"C|P", str(soup))
tag = soup.findAll('w', subcat=r)
print [tag[i].string for i in range(len(tag))] #--> [u'groupe', u'Thomson']
snippsat
Practically a Posting Shark
808 posts since Aug 2008
Reputation Points: 353
Solved Threads: 294
Tilde is valid file name in windows and I do not think you meant it. It is quite common to have the script and the work file in same directory, then you can just use the name without directory part. Or you can give proper name of directory that exist (you can do dir directory\path\to\the\file\ in CMD prompt).
pyTony
pyMod
5,359 posts since Apr 2010
Reputation Points: 782
Solved Threads: 852
The problem with with parser like xml.dom is that the xml most be perfekt.
Parser like BeautifulSoup an lxml can handle xml/html even if is not correct.
From BeautifulSoup wewbpage.
You didn't write that awful page.
You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like.
Neither does this parser.
snippsat
Practically a Posting Shark
808 posts since Aug 2008
Reputation Points: 353
Solved Threads: 294
Tomboy python has BeautifulSoup and lxml that are pure(free) python tool and can parse any xml file even if it not validate.
As you see in my example over it works fine,no need to google for none python tools.
snippsat
Practically a Posting Shark
808 posts since Aug 2008
Reputation Points: 353
Solved Threads: 294
Thanks for stating the obvious. We sometimes get too caught up in problem solving and forget to mention good techniques. Data validation should be the first step.
woooee
Nearly a Posting Maven
2,454 posts since Dec 2006
Reputation Points: 777
Solved Threads: 714
Not stone hammered rules but generally PEP8 format suggestions are good to follow in addition to generally follow order imports, global values, definitions, last he main code (sometimes only call to main function)
(Next time feel free to start new thread for any new queastions and do not only forget to mark each time solved threads solved)
pyTony
pyMod
5,359 posts since Apr 2010
Reputation Points: 782
Solved Threads: 852