Hi everyone,

I'm trying to extract text from between tags but only in certain conditions. <title> and <pos> are both children of <page>, but neither one is nested inside the other (i.e., they're siblings). Each <page> always has one <title> and zero to 5 <pos> sections. What I need to do is to return the text of <title> and the sum of all <pos> if there are one or more <pos>. If there aren't any <pos>, title shouldn't be returned. Here's a bit of the sample source:

<page>
<title>dasher</title>
<pos>red</pos>
</page>
<page>
<title>dancer</title>
<pos>red</pos>
<pos>blue</pos>
</page>
<page>
<title>coconut</title>
</page>
<page>
<title>rudolph</title>
<pos>red</pos>
<pos>brown</pos>
<pos>red</pos>
</page>

What I'd like to return is this:

dasher red
dancer red blue
rudolph red brown red

My code currently does almost exactly what I want, but the problem is that it returns every title, whether or not it has any <pos> siblings. This is what I have:

import os, sys

from bs4 import BeautifulSoup
file = open('sourcedata.xml')
fixed = open('ftemp.txt','w')

soup = BeautifulSoup(file, "lxml")

divTag = soup.find_all("page")
for tag in divTag:
    ttl = tag.find_all("title")
    allpos = tag.find_all('pos')
    mypos = ""
    for tag in ttl:
        mytitle = tag.text
        for tag in allpos:
            if len(tag.text) > 0:
                mypos += "    " + tag.text
            else:
                pass
        fixed.write(mytitle + " " + mypos + "\n")
fixed.close()

I know why it's returning every title - because the write statement is inside a for loop - but I don't know how to fix it. Moving it elsewhere hasn't worked, either. If move the write statement to inside the "for tag in allpos" loop, then I get duplicates ("dasher red" / "dasher red blue"). I was thinking that if a <title> has no corresponding <pos>, delete that title, but I don't know how to do that. Can anyone suggest a solution?

Recommended Answers

All 3 Replies

Perhaps play with the .next_siblings and .sibling attributes of the title tag.

I was thinking that if a <title> has no corresponding <pos>, delete that title, but I don't know how to do that. Can anyone suggest a solution?

Some hint use fetchNextSiblings() if return empty list,then decompose() that title tag.

xml = '''\
<page>
<title>dasher</title>
<pos>red</pos>
</page>
<page>
<title>dancer</title>
<pos>red</pos>
<pos>blue</pos>
</page>
<page>
<title>coconut</title>
</page>
<page>
<title>rudolph</title>
<pos>red</pos>
<pos>brown</pos>
<pos>red</pos>
</page>'''


from bs4 import BeautifulSoup

soup = BeautifulSoup(xml)
title = soup.find_all('title')

for index, item in enumerate(title):
    print index, item  
'''
0 <title>dasher</title>
1 <title>dancer</title>
2 <title>coconut</title>
3 <title>rudolph</title>
'''

for index, item in enumerate(title):
    print index, item.fetchNextSiblings() 
'''
0 [<pos>red</pos>]
1 [<pos>red</pos>, <pos>blue</pos>]
2 []
3 [<pos>red</pos>, <pos>brown</pos>, <pos>red</pos>]
'''

for index, item in enumerate(title):
    if item.fetchNextSiblings() == []:
        print item.decompose()
'''None'''

for index, item in enumerate(title):
    print index, item  
'''
0 <title>dasher</title>
1 <title>dancer</title>
2 <None></None>
3 <title>rudolph</title>
'''

Thanks for the suggestions. I tried something along the lines of what snippsat suggested last night, but clearly I wasn't getting it right. I kept getting AttributeError telling me that there was no next sibling (despite soup.prettify showing clearly that <title> and <pos> are siblings).

I was able to fix it in a much simpler way, though. I simply changed the write-out statement to this:

if len(mypos) > 0:
    fixed.write(mytitle + " " + mypos + "\n")
else:
    pass

Thanks for your suggestion, though. I have a lot to learn about Beautiful Soup!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.