I'm trying to extract text from between tags but only in certain conditions. <title> and <pos> are both children of <page>, but neither one is nested inside the other (i.e., they're siblings). Each <page> always has one <title> and zero to 5 <pos> sections. What I need to do is to return the text of <title> and the sum of all <pos> if there are one or more <pos>. If there aren't any <pos>, title shouldn't be returned. Here's a bit of the sample source:
<page> <title>dasher</title> <pos>red</pos> </page> <page> <title>dancer</title> <pos>red</pos> <pos>blue</pos> </page> <page> <title>coconut</title> </page> <page> <title>rudolph</title> <pos>red</pos> <pos>brown</pos> <pos>red</pos> </page>
What I'd like to return is this:
dancer red blue
rudolph red brown red
My code currently does almost exactly what I want, but the problem is that it returns every title, whether or not it has any <pos> siblings. This is what I have:
import os, sys from bs4 import BeautifulSoup file = open('sourcedata.xml') fixed = open('ftemp.txt','w') soup = BeautifulSoup(file, "lxml") divTag = soup.find_all("page") for tag in divTag: ttl = tag.find_all("title") allpos = tag.find_all('pos') mypos = "" for tag in ttl: mytitle = tag.text for tag in allpos: if len(tag.text) > 0: mypos += " " + tag.text else: pass fixed.write(mytitle + " " + mypos + "\n") fixed.close()
I know why it's returning every title - because the write statement is inside a for loop - but I don't know how to fix it. Moving it elsewhere hasn't worked, either. If move the write statement to inside the "for tag in allpos" loop, then I get duplicates ("dasher red" / "dasher red blue"). I was thinking that if a <title> has no corresponding <pos>, delete that title, but I don't know how to do that. Can anyone suggest a solution?