Hi everyone,
I'm trying to extract text from between tags but only in certain conditions. <title> and <pos> are both children of <page>, but neither one is nested inside the other (i.e., they're siblings). Each <page> always has one <title> and zero to 5 <pos> sections. What I need to do is to return the text of <title> and the sum of all <pos> if there are one or more <pos>. If there aren't any <pos>, title shouldn't be returned. Here's a bit of the sample source:
<page>
<title>dasher</title>
<pos>red</pos>
</page>
<page>
<title>dancer</title>
<pos>red</pos>
<pos>blue</pos>
</page>
<page>
<title>coconut</title>
</page>
<page>
<title>rudolph</title>
<pos>red</pos>
<pos>brown</pos>
<pos>red</pos>
</page>
What I'd like to return is this:
dasher red
dancer red blue
rudolph red brown red
My code currently does almost exactly what I want, but the problem is that it returns every title, whether or not it has any <pos> siblings. This is what I have:
import os, sys
from bs4 import BeautifulSoup
file = open('sourcedata.xml')
fixed = open('ftemp.txt','w')
soup = BeautifulSoup(file, "lxml")
divTag = soup.find_all("page")
for tag in divTag:
ttl = tag.find_all("title")
allpos = tag.find_all('pos')
mypos = ""
for tag in ttl:
mytitle = tag.text
for tag in allpos:
if len(tag.text) > 0:
mypos += " " + tag.text
else:
pass
fixed.write(mytitle + " " + mypos + "\n")
fixed.close()
I know why it's returning every title - because the write statement is inside a for loop - but I don't know how to fix it. Moving it elsewhere hasn't worked, either. If move the write statement to inside the "for tag in allpos" loop, then I get duplicates ("dasher red" / "dasher red blue"). I was thinking that if a <title> has no corresponding <pos>, delete that title, but I don't know how to do that. Can anyone suggest a solution?