954,510 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

parsing an XML file

hi all,
i have been working for two months on a project and i have come up with an algorithm that
is a mix of R. Mitkov's algorithm on anaphora resolution (robust, knowledge-poor algorithm) and
several filters that are applied first to the xml file (a POS tagged text) to eliminate the non-expletive (impersonal/ non anaphoris) "it" e.g. it rains
(i am working on a french text)
my algorithm is structured as follows:

input: XML text with POS tagging
step 1: search for pronouns that are tagged as "CL" or "PRO" and tag them as
being "expletive" or "non expletive" by applying a list of rules

step 2: search for pronouns that are tagged as being "expletive" and
search for their antecedents applying Mitkov's algorithm : (10 rules
in all) that attribute scores to each antecedent
in a distance of 4 sentences (if its a personal pronoun)
and search for antecedents in the same phrase ( if the pronoun is reflexive or possessive)

output: a list with every line containing: the pronoun, its position
in the XML text (nr of sentence) , the chosen antecedent, and the nr
of sentence in which it was found

so i am wondering how to proceed with the score application,
do i make a matrix for the antecedents of every pronoun (if thats
possible in Python?)
because every antecedent will be attributed scores,
and at the end of every antecedent-search for every expletive pronoun
found,
i have to add up the scores attributed to every antecedent, and pick
the antecedent with the best score !

thank you for any advice you can give me on implementing this
algorithm

P.S
i already read the two previous discussions about xml parsing
but i was told that not all parsers allow detailed search in the xml file, like if i want the
child of a Node (direct child or the second child) :this would be possible using certain parsers
and not others

MaYouSHka
Newbie Poster
7 posts since May 2011
Reputation Points: 10
Solved Threads: 0
 

Two good parser are BeautifulSoup an lxml.
Can you post a part of xml file and tell what info you want out.
So maybe can i show a little about how to parse xml.

snippsat
Practically a Posting Shark
808 posts since Aug 2008
Reputation Points: 353
Solved Threads: 294
 

hi,
thanks for your quick reply,

My XML file is made up of sentences, each sentence has a number, i need the sentence nr. to locate the pronoun and the antecedent.

first i have to locate the personal pronouns tagged "CL" or "PRO" in my XML file,
then i need to tag or classify them as being anaphoric or non-anaphoric by applying several filters, which are rules, and
so i need to have as output
all the pronouns, the anaphoric and non-anaphoric ones,
e.g. pronoun, location (sentence nr.), anaphoric, antecedent, location of the antecedent (sentence nr.)
So my answer is in list-form for every pronoun

But,
1- i dont know whether i should create a new XML file with a tag for the pronouns (anaphoric or non-anaphoric) OR insert a tag into my present XML file, cause the step of distinguishing anaphoric and non anaphoric is essential, cause i want to show in my output that my program recognized the non anaphoric, but then the rest of the algorithm is only applied to the anaphoric ones !
2- i dont know how to stock the scores for every pronoun, cause if i find an anaphoric pronoun with 4 antecedents or more, i need to apply lets say 9 or 10 scores for each and then choose and print the antecedent with the highest score!


in this sentence, i have for instance "il" , marked as part of a verb phrase 'VN', with the tag ' SENT nb="525">
-
-
- Mettantunterme
- à
- ce
-
- qui,
-
- d'après
- le
- WashingtonPost,
- avaitconstitué"
- labataille
- la
- plusféroce
- et
- la
- pluscoûteuse
-
- entermesde
- lobbying,
- de
- cesderniersmois",
- legroupeThomson
- ainformé
- le6juillet
- lesautorités
- américaines
- de
- sonintention
- de
- renoncer,
- dans
- saphase
- actuelle,
- au
- rachat
- du
- fabricant
- de
- missilesLTV
-
- dont
- >ilw>
avaithérité
- le10avril,
......................
.

MaYouSHka
Newbie Poster
7 posts since May 2011
Reputation Points: 10
Solved Threads: 0
 

Just to parse somthing for a start,i have not read to detailed about your task.
That xml you got is not the easiest xml i have seen.
I want to take out groupe and Thomson.

from BeautifulSoup import BeautifulStoneSoup
import re

xml = '''\
</VPpart>
<w cat="PONCT" ee="PONCT-W" ei="PONCTW" lemma="," subcat="W">,</w>
- <NP fct="SUJ">
<w cat="D" ee="D-def-ms" ei="Dms" lemma="le" mph="ms" subcat="def">le</w>
<w cat="N" ee="N-C-ms" ei="NCms" lemma="groupe" mph="ms" subcat="C">groupe</w>
<w cat="N" ee="N-P-ms" ei="NPms" lemma="Thomson" mph="ms" subcat="P">Thomson</w>
</NP>
- <VN>
<w cat="V" ee="V--P3s" ei="VP3s" lemma="avoir" mph="P3s" subcat="">a</w>
<w cat="V" ee="V--Kms" ei="VKms" lemma="informer" mph="Kms" subcat="">informé</w>
</VN>'''

soup = BeautifulStoneSoup(xml)
r = re.findall(r"C|P", str(soup))
tag = soup.findAll('w', subcat=r)

print [tag[i].string for i in range(len(tag))] #--> [u'groupe', u'Thomson']
snippsat
Practically a Posting Shark
808 posts since Aug 2008
Reputation Points: 353
Solved Threads: 294
 

hi snippsat
thanks for ur quick answer

MaYouSHka
Newbie Poster
7 posts since May 2011
Reputation Points: 10
Solved Threads: 0
 

hi,

i am trying to load an xml file,

first i tried a simple method using IDLE to load and parse a file, but i get the error

ActivePython 2.7.1.4 (ActiveState Software Inc.) based on
Python 2.7.1 (r271:86832, Feb 7 2011, 11:30:38) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> from xml.dom import minidom
>>> xmldoc = minidom.parse('~/Documents/maya/le grand projet/ftb.xml')

Traceback (most recent call last):
File "", line 1, in
xmldoc = minidom.parse('~/Documents/maya/le grand projet/ftb.xml')
File "C:\Python27\lib\xml\dom\minidom.py", line 1914, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 922, in parse
fp = open(file, 'rb')
IOError: [Errno 2] No such file or directory: '~/Documents/maya/le grand projet/ftb.xml'


can someone please tell me how to tell the parser where the file is?
i am working on Windows (before i worked on Ubuntu)

MaYouSHka
Newbie Poster
7 posts since May 2011
Reputation Points: 10
Solved Threads: 0
 

Tilde is valid file name in windows and I do not think you meant it. It is quite common to have the script and the work file in same directory, then you can just use the name without directory part. Or you can give proper name of directory that exist (you can do dir directory\path\to\the\file\ in CMD prompt).

pyTony
pyMod
Moderator
5,359 posts since Apr 2010
Reputation Points: 782
Solved Threads: 852
 

hi tonyjv,

thanks for your answer,
i saved both files on my Desktop, and now everythg is working

but i have another problem, i am trying to access the 'SENT' part in my .xml file,
knowing that its not on the same level as the firstnode
<?xml version="1.0" encoding="ISO-8859-1" ?>
-
-
-
- Defait

i am writing

sent=xmldoc.getElementByTagName('SENT')[0].firstchild.data

and im getting: AttributeError: Document instance has no attribute getElementByTagName

MaYouSHka
Newbie Poster
7 posts since May 2011
Reputation Points: 10
Solved Threads: 0
 

The problem with with parser like xml.dom is that the xml most be perfekt.
Parser like BeautifulSoup an lxml can handle xml/html even if is not correct.
From BeautifulSoup wewbpage.
You didn't write that awful page.
You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like.

Neither does this parser.

snippsat
Practically a Posting Shark
808 posts since Aug 2008
Reputation Points: 353
Solved Threads: 294
 

Firstly you should validate your xml file before you try and parse, you should be able to find a suitable xml parserby googling it, i work with Liquid XML Editor but feel free to pick whichever you want.

tomboy
Newbie Poster
2 posts since Mar 2011
Reputation Points: 10
Solved Threads: 0
 

Tomboy python has BeautifulSoup and lxml that are pure(free) python tool and can parse any xml file even if it not validate.
As you see in my example over it works fine,no need to google for none python tools.

snippsat
Practically a Posting Shark
808 posts since Aug 2008
Reputation Points: 353
Solved Threads: 294
 


Thanks for stating the obvious. We sometimes get too caught up in problem solving and forget to mention good techniques. Data validation should be the first step.

woooee
Nearly a Posting Maven
2,454 posts since Dec 2006
Reputation Points: 777
Solved Threads: 714
 

thank you snippsat for your answer

i am using the "dive into python" book, and this book explains maintly how to use Dom . So thats why i am trying to use a parser that i understand/know how to use (a bit)
rather than use a parser i dont know how to use at all!

maya

MaYouSHka
Newbie Poster
7 posts since May 2011
Reputation Points: 10
Solved Threads: 0
 

thank you all for your answers

The problem is not from my XML file or my parser.

I have a question about python files, is there a precise structure to follow, like
do i always have to have a class and a main part ?

thanks again

MaYouSHka
Newbie Poster
7 posts since May 2011
Reputation Points: 10
Solved Threads: 0
 

Not stone hammered rules but generally PEP8 format suggestions are good to follow in addition to generally follow order imports, global values, definitions, last he main code (sometimes only call to main function)

(Next time feel free to start new thread for any new queastions and do not only forget to mark each time solved threads solved)

pyTony
pyMod
Moderator
5,359 posts since Apr 2010
Reputation Points: 782
Solved Threads: 852
 

This article has been dead for over three months

Post: Markdown Syntax: Formatting Help
You
View similar articles that have also been tagged: