parsing an XML file

Question

MaYouSHka 0 Newbie Poster

14 Years Ago

hi all,
i have been working for two months on a project and i have come up with an algorithm that
is a mix of R. Mitkov's algorithm on anaphora resolution (robust, knowledge-poor algorithm) and
several filters that are applied first to the xml file (a POS tagged text) to eliminate the non-expletive (impersonal/ non anaphoris) "it" e.g. it rains
(i am working on a french text)
my algorithm is structured as follows:

input: XML text with POS tagging
step 1: search for pronouns that are tagged as "CL" or "PRO" and tag them as
being "expletive" or "non expletive" by applying a list of rules

step 2: search for pronouns that are tagged as being "expletive" and
search for their antecedents applying Mitkov's algorithm : (10 rules
in all) that attribute scores to each antecedent
in a distance of 4 sentences (if its a personal pronoun)
and search for antecedents in the same phrase ( if the pronoun is reflexive or possessive)

output: a list with every line containing: the pronoun, its position
in the XML text (nr of sentence) , the chosen antecedent, and the nr
of sentence in which it was found

so i am wondering how to proceed with the score application,
do i make a matrix for the antecedents of every pronoun (if thats
possible in Python?)
because every antecedent will be attributed scores,
and at the end of every antecedent-search for every expletive pronoun
found,
i have to add up the scores attributed to every antecedent, and pick
the antecedent with the best score !

thank you for any advice you can give me on implementing this
algorithm

P.S
i already read the two previous discussions about xml parsing
but i was told that not all parsers allow detailed search in the xml file, like if i want the
child of a Node (direct child or the second child) :this would be possible using certain parsers
and not others

algorithm python xml

5 Contributors
14 Replies
339 Views
3 Weeks Discussion Span
Latest Post 14 Years Ago Latest Post by TrustyTony

All 14 Replies

snippsat 661 Master Poster

14 Years Ago

Two good parser are BeautifulSoup an lxml.
Can you post a part of xml file and tell what info you want out.
So maybe can i show a little about how to parse xml.

snippsat 661 Master Poster

14 Years Ago

Just to parse somthing for a start,i have not read to detailed about your task.
That xml you got is not the easiest xml i have seen.
I want to take out groupe and Thomson.

from BeautifulSoup import BeautifulStoneSoup
import re

xml = '''\
</VPpart>
<w cat="PONCT" ee="PONCT-W" ei="PONCTW" lemma="," subcat="W">,</w>
- <NP fct="SUJ">
<w cat="D" ee="D-def-ms" ei="Dms" lemma="le" mph="ms" subcat="def">le</w>
<w cat="N" ee="N-C-ms" ei="NCms" lemma="groupe" mph="ms" subcat="C">groupe</w>
<w cat="N" ee="N-P-ms" ei="NPms" lemma="Thomson" mph="ms" subcat="P">Thomson</w>
</NP>
- <VN>
<w cat="V" ee="V--P3s" ei="VP3s" lemma="avoir" mph="P3s" subcat="">a</w>
<w cat="V" ee="V--Kms" ei="VKms" lemma="informer" mph="Kms" subcat="">informé</w>
</VN>'''

soup = BeautifulStoneSoup(xml)
r = re.findall(r"C|P", str(soup))
tag = soup.findAll('w', subcat=r)

print [tag[i].string for i in range(len(tag))] #--> [u'groupe', u'Thomson']

Edited 14 Years Ago by snippsat because: n/a

snippsat 661 Master Poster

14 Years Ago

The problem with with parser like xml.dom is that the xml most be perfekt.
Parser like BeautifulSoup an lxml can handle xml/html even if is not correct.
From BeautifulSoup wewbpage.

You didn't write that awful page.
You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like.
Neither does this parser.

Edited 14 Years Ago by snippsat because: n/a

woooee 814 Nearly a Posting Maven

14 Years Ago

Thanks for stating the obvious. We sometimes get too caught up in problem solving and forget to mention good techniques. Data validation should be the first step.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

MaYouSHka 0 Newbie Poster · Answer 1 · 2011-05-08T20:06:29+00:00

hi,
thanks for your quick reply,

My XML file is made up of sentences, each sentence has a number, i need the sentence nr. to locate the pronoun and the antecedent.

first i have to locate the personal pronouns tagged "CL" or "PRO" in my XML file,
then i need to tag or classify them as being anaphoric or non-anaphoric by applying several filters, which are rules, and
so i need to have as output
all the pronouns, the anaphoric and non-anaphoric ones,
e.g. pronoun, location (sentence nr.), anaphoric, antecedent, location of the antecedent (sentence nr.)
So my answer is in list-form for every pronoun

But,
1- i dont know whether i should create a new XML file with a tag for the pronouns (anaphoric or non-anaphoric) OR insert a tag into my present XML file, cause the step of distinguishing anaphoric and non anaphoric is essential, cause i want to show in my output that my program recognized the non anaphoric, but then the rest of the algorithm is only applied to the anaphoric ones !
2- i dont know how to stock the scores for every pronoun, cause if i find an anaphoric pronoun with 4 antecedents or more, i need to apply lets say 9 or 10 scores for each and then choose and print the antecedent with the highest score!

in this sentence, i have for instance "il" , marked as part of a verb phrase 'VN', with the tag ' <w cat="CL" ' , once i find this , i look for antecedents (i go back upwards searching for Nouns): within a distance of 4 sentences!
here, the correct antecedent is found in the same sentence : its " le group Thomson"

- <SENT nb="525">
- <VPpart fct="MOD">
- <VN>
- <w cat="V" compound="yes" ee="V--G" ei="VG" lemma="mettre un terme" mph="G" subcat="">
<w catint="V">Mettant</w>
<w catint="D">un</w>
<w catint="N">terme</w>
</w>
</VN>
- <PP fct="A-OBJ">
<w cat="P" ee="P" ei="P" lemma="à">à</w>
- <NP>
<w cat="PRO" ee="PRO-dem-3ms" ei="PRO3ms" lemma="ce" mph="3ms" subcat="dem">ce</w>
- <Srel>
- <NP fct="SUJ">
<w cat="PRO" ee="PRO-rel-3ms" ei="PROR3ms" lemma="qui" mph="3ms" subcat="rel">qui</w>
</NP>
<w cat="PONCT" ee="PONCT-W" ei="PONCTW" lemma="," subcat="W">,</w>
- <PP fct="MOD">
- <w cat="P" compound="yes" ee="P" ei="P" lemma="d'après">
<w catint="P">d'</w>
<w catint="P">après</w>
</w>
- <NP>
<w cat="D" ee="D-def-ms" ei="Dms" lemma="le" mph="ms" subcat="def">le</w>
- <w cat="N" compound="yes" ee="N-P-ms" ei="NPms" lemma="Washington Post" mph="ms" subcat="P">
<w catint="N">Washington</w>
<w catint="N">Post</w>
</w>
</NP>
</PP>
<w cat="PONCT" ee="PONCT-W" ei="PONCTW" lemma="," subcat="W">,</w>
- <VN>
<w cat="V" ee="V--I3s" ei="VI3s" lemma="avoir" mph="I3s" subcat="">avait</w>
<w cat="V" ee="V--Kms" ei="VKms" lemma="constituer" mph="Kms" subcat="">constitué</w>
</VN>
<w cat="PONCT" ee="PONCT-W" ei="PONCTW" lemma=""" subcat="W">"</w>
- <NP fct="OBJ">
<w cat="D" ee="D-def-fs" ei="Dfs" lemma="le" mph="fs" subcat="def">la</w>
<w cat="N" ee="N-C-fs" ei="NCfs" lemma="bataille" mph="fs" subcat="C">bataille</w>
</NP>
- <NP fct="MOD">
<w cat="D" ee="D-def-fs" ei="Dfs" lemma="le" mph="fs" subcat="def">la</w>
- <AP>
<w cat="ADV" ee="ADV" ei="ADV" lemma="plus">plus</w>
<w cat="A" ee="A-qual-fs" ei="Afs" lemma="féroce" mph="fs" subcat="qual">féroce</w>
</AP>
- <COORD>
<w cat="C" ee="C-C" ei="CC" lemma="et" subcat="C">et</w>
- <NP>
<w cat="D" ee="D-def-fs" ei="Dfs" lemma="le" mph="fs" subcat="def">la</w>
- <AP>
<w cat="ADV" ee="ADV" ei="ADV" lemma="plus">plus</w>
<w cat="A" ee="A-qual-fs" ei="Afs" lemma="coûteux" mph="fs" subcat="qual">coûteuse</w>
</AP>
</NP>
</COORD>
</NP>
- <PP fct="MOD">
- <w cat="P" compound="yes" ee="P" ei="P" lemma="en termes de">
<w catint="P">en</w>
<w catint="N">termes</w>
<w catint="P">de</w>
</w>
- <NP>
<w cat="N" ee="N-C-ms" ei="NCms" lemma="lobbying" mph="ms" subcat="C">lobbying</w>
</NP>
</PP>
<w cat="PONCT" ee="PONCT-W" ei="PONCTW" lemma="," subcat="W">,</w>
- <PP fct="MOD">
<w cat="P" ee="P" ei="P" lemma="de">de</w>
- <NP>
<w cat="D" ee="D-dem-mp" ei="Dmp" lemma="ce" mph="mp" subcat="dem">ces</w>
<w cat="A" ee="A-qual-mp" ei="Amp" lemma="dernier" mph="mp" subcat="qual">derniers</w>
<w cat="N" ee="N-C-mp" ei="NCmp" lemma="mois" mph="mp" subcat="C">mois</w>
</NP>
</PP>
<w cat="PONCT" ee="PONCT-W" ei="PONCTW" lemma=""" subcat="W">"</w>
</Srel>
</NP>
</PP>
</VPpart>
<w cat="PONCT" ee="PONCT-W" ei="PONCTW" lemma="," subcat="W">,</w>
- <NP fct="SUJ">
<w cat="D" ee="D-def-ms" ei="Dms" lemma="le" mph="ms" subcat="def">le</w>
<w cat="N" ee="N-C-ms" ei="NCms" lemma="groupe" mph="ms" subcat="C">groupe</w>
<w cat="N" ee="N-P-ms" ei="NPms" lemma="Thomson" mph="ms" subcat="P">Thomson</w>
</NP>
- <VN>
<w cat="V" ee="V--P3s" ei="VP3s" lemma="avoir" mph="P3s" subcat="">a</w>
<w cat="V" ee="V--Kms" ei="VKms" lemma="informer" mph="Kms" subcat="">informé</w>
</VN>
- <NP fct="MOD">
<w cat="D" ee="D-def-ms" ei="Dms" lemma="le" mph="ms" subcat="def">le</w>
<w cat="A" ee="A-card-ms" ei="Ams" lemma="6" mph="ms" subcat="card">6</w>
<w cat="N" ee="N-C-ms" ei="NCms" lemma="juillet" mph="ms" subcat="C">juillet</w>
</NP>
- <NP fct="OBJ">
<w cat="D" ee="D-def-fp" ei="Dfp" lemma="le" mph="fp" subcat="def">les</w>
<w cat="N" ee="N-C-fp" ei="NCfp" lemma="autorité" mph="fp" subcat="C">autorités</w>
- <AP>
<w cat="A" ee="A-qual-fp" ei="Afp" lemma="américain" mph="fp" subcat="qual">américaines</w>
</AP>
</NP>
- <PP fct="MOD">
<w cat="P" ee="P" ei="P" lemma="de">de</w>
- <NP>
<w cat="D" ee="D-poss-3fss" ei="Dfs" lemma="son" mph="3fss" subcat="poss">son</w>
<w cat="N" ee="N-C-fs" ei="NCfs" lemma="intention" mph="fs" subcat="C">intention</w>
- <VPinf>
<w cat="P" ee="P" ei="P" lemma="de">de</w>
- <VN>
<w cat="V" ee="V--W" ei="VW" lemma="renoncer" mph="W" subcat="">renoncer</w>
</VN>
<w cat="PONCT" ee="PONCT-W" ei="PONCTW" lemma="," subcat="W">,</w>
- <PP fct="MOD">
<w cat="P" ee="P" ei="P" lemma="dans">dans</w>
- <NP>
<w cat="D" ee="D-poss-3fss" ei="Dfs" lemma="son" mph="3fss" subcat="poss">sa</w>
<w cat="N" ee="N-C-fs" ei="NCfs" lemma="phase" mph="fs" subcat="C">phase</w>
- <AP>
<w cat="A" ee="A-qual-fs" ei="Afs" lemma="actuel" mph="fs" subcat="qual">actuelle</w>
</AP>
</NP>
</PP>
<w cat="PONCT" ee="PONCT-W" ei="PONCTW" lemma="," subcat="W">,</w>
- <PP fct="A-OBJ">
<w cat="P" ee="P" ei="P" lemma="à">au</w>
- <NP>
<w cat="D" ee="D-def-ms" ei="Dms" lemma="le" mph="ms" subcat="def" />
<w cat="N" ee="N-C-ms" ei="NCms" lemma="rachat" mph="ms" subcat="C">rachat</w>
- <PP>
<w cat="P" ee="P" ei="P" lemma="de">du</w>
- <NP>
<w cat="D" ee="D-def-ms" ei="Dms" lemma="le" mph="ms" subcat="def" />
<w cat="N" ee="N-C-ms" ei="NCms" lemma="fabricant" mph="ms" subcat="C">fabricant</w>
- <PP>
<w cat="P" ee="P" ei="P" lemma="de">de</w>
- <NP>
<w cat="N" ee="N-C-mp" ei="NCmp" lemma="missile" mph="mp" subcat="C">missiles</w>
<w cat="N" ee="N-P-fs" ei="NPfs" lemma="LTV" mph="fs" subcat="P">LTV</w>
- <Srel>
- <PP fct="DE-OBJ">
<w cat="PRO" ee="PRO-rel-3ms" ei="PROR3ms" lemma="dont" mph="3ms" subcat="rel">dont</w>
</PP>
- <VN fct="SUJ">
<w cat="CL" ee="CL-suj-3ms" ei="CL3ms" lemma="il" mph="3ms" subcat="suj">il</w>
<w cat="V" ee="V--I3s" ei="VI3s" lemma="avoir" mph="I3s" subcat="">avait</w>
<w cat="V" ee="V--Kms" ei="VKms" lemma="hériter" mph="Kms" subcat="">hérité</w>
</VN>
- <NP fct="MOD">
<w cat="D" ee="D-def-ms" ei="Dms" lemma="le" mph="ms" subcat="def">le</w>
<w cat="A" ee="A-card-ms" ei="Ams" lemma="10" mph="ms" subcat="card">10</w>
<w cat="N" ee="N-C-ms" ei="NCms" lemma="avril" mph="ms" subcat="C">avril</w>
</NP>
<w cat="PONCT" ee="PONCT-W" ei="PONCTW" lemma="," subcat="W">,</w>
......................
</PP>
<w cat="PONCT" ee="PONCT-S" ei="PONCTS" lemma="." subcat="S">.</w>
</SENT>

MaYouSHka 0 Newbie Poster · Answer 2 · 2011-05-10T23:51:16+00:00

MaYouSHka 0 Newbie Poster

14 Years Ago

hi snippsat
thanks for ur quick answer

MaYouSHka 0 Newbie Poster · Answer 3 · 2011-05-23T15:30:33+00:00

hi,

i am trying to load an xml file,

first i tried a simple method using IDLE to load and parse a file, but i get the error

ActivePython 2.7.1.4 (ActiveState Software Inc.) based on
Python 2.7.1 (r271:86832, Feb 7 2011, 11:30:38) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> from xml.dom import minidom
>>> xmldoc = minidom.parse('~/Documents/maya/le grand projet/ftb.xml')

Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
xmldoc = minidom.parse('~/Documents/maya/le grand projet/ftb.xml')
File "C:\Python27\lib\xml\dom\minidom.py", line 1914, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 922, in parse
fp = open(file, 'rb')
IOError: [Errno 2] No such file or directory: '~/Documents/maya/le grand projet/ftb.xml'

can someone please tell me how to tell the parser where the file is?
i am working on Windows (before i worked on Ubuntu)

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 4 · 2011-05-23T15:58:04+00:00

Tilde is valid file name in windows and I do not think you meant it. It is quite common to have the script and the work file in same directory, then you can just use the name without directory part. Or you can give proper name of directory that exist (you can do dir directory\path\to\the\file\ in CMD prompt).

MaYouSHka 0 Newbie Poster · Answer 5 · 2011-05-25T17:47:29+00:00

hi tonyjv,

thanks for your answer,
i saved both files on my Desktop, and now everythg is working

but i have another problem, i am trying to access the 'SENT' part in my .xml file,
knowing that its not on the same level as the firstnode
<?xml version="1.0" encoding="ISO-8859-1" ?>
- <text>
- 
- <SENT nb="511">
- <w cat="ADV" compound="yes" ee="ADV" ei="ADV" lemma="de fait">
<w catint="P">De</w>
<w catint="N">fait</w>
</w>

i am writing

sent=xmldoc.getElementByTagName('SENT')[0].firstchild.data

and im getting: AttributeError: Document instance has no attribute getElementByTagName

tomboy 0 Newbie Poster · Answer 6 · 2011-05-27T16:19:49+00:00

Firstly you should validate your xml file before you try and parse, you should be able to find a suitable xml parserby googling it, i work with Liquid XML Editor but feel free to pick whichever you want.

snippsat 661 Master Poster · Answer 7 · 2011-05-27T17:54:14+00:00

Tomboy python has BeautifulSoup and lxml that are pure(free) python tool and can parse any xml file even if it not validate.
As you see in my example over it works fine,no need to google for none python tools.

MaYouSHka 0 Newbie Poster · Answer 8 · 2011-05-31T16:03:34+00:00

thank you snippsat for your answer

i am using the "dive into python" book, and this book explains maintly how to use Dom . So thats why i am trying to use a parser that i understand/know how to use (a bit)
rather than use a parser i dont know how to use at all!

maya

MaYouSHka 0 Newbie Poster · Answer 9 · 2011-06-01T20:10:46+00:00

thank you all for your answers

The problem is not from my XML file or my parser.

I have a question about python files, is there a precise structure to follow, like
do i always have to have a class and a main part ?

thanks again

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 10 · 2011-06-01T21:19:42+00:00

Not stone hammered rules but generally PEP8 format suggestions are good to follow in addition to generally follow order imports, global values, definitions, last he main code (sometimes only call to main function)

(Next time feel free to start new thread for any new queastions and do not only forget to mark each time solved threads solved)

parsing an XML file

Recommended Answers Collapse Answers

All 14 Replies

Recommended Answers