Returning only tags with certain siblings (Beautiful Soup)

Question

Afroula 0 Newbie Poster

10 Years Ago

Hi everyone,

I'm trying to extract text from between tags but only in certain conditions. <title> and <pos> are both children of <page>, but neither one is nested inside the other (i.e., they're siblings). Each <page> always has one <title> and zero to 5 <pos> sections. What I need to do is to return the text of <title> and the sum of all <pos> if there are one or more <pos>. If there aren't any <pos>, title shouldn't be returned. Here's a bit of the sample source:

<page>
<title>dasher</title>
<pos>red</pos>
</page>
<page>
<title>dancer</title>
<pos>red</pos>
<pos>blue</pos>
</page>
<page>
<title>coconut</title>
</page>
<page>
<title>rudolph</title>
<pos>red</pos>
<pos>brown</pos>
<pos>red</pos>
</page>

What I'd like to return is this:

dasher red
dancer red blue
rudolph red brown red

My code currently does almost exactly what I want, but the problem is that it returns every title, whether or not it has any <pos> siblings. This is what I have:

import os, sys

from bs4 import BeautifulSoup
file = open('sourcedata.xml')
fixed = open('ftemp.txt','w')

soup = BeautifulSoup(file, "lxml")

divTag = soup.find_all("page")
for tag in divTag:
    ttl = tag.find_all("title")
    allpos = tag.find_all('pos')
    mypos = ""
    for tag in ttl:
        mytitle = tag.text
        for tag in allpos:
            if len(tag.text) > 0:
                mypos += "    " + tag.text
            else:
                pass
        fixed.write(mytitle + " " + mypos + "\n")
fixed.close()

I know why it's returning every title - because the write statement is inside a for loop - but I don't know how to fix it. Moving it elsewhere hasn't worked, either. If move the write statement to inside the "for tag in allpos" loop, then I get duplicates ("dasher red" / "dasher red blue"). I was thinking that if a <title> has no corresponding <pos>, delete that title, but I don't know how to do that. Can anyone suggest a solution?

beautiful-soup nested-loop python scraping

3 Contributors
3 Replies
2K Views
18 Hours Discussion Span
Latest Post 10 Years Ago Latest Post by Afroula

All 3 Replies

Gribouillis 1,391 Programming Explorer

10 Years Ago

Perhaps play with the .next_siblings and .sibling attributes of the title tag.

Edited 10 Years Ago by Gribouillis

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

snippsat 661 Master Poster · Answer 1 · 2014-04-18T06:26:21+00:00

I was thinking that if a <title> has no corresponding <pos>, delete that title, but I don't know how to do that. Can anyone suggest a solution?

Some hint use fetchNextSiblings() if return empty list,then decompose() that title tag.

xml = '''\
<page>
<title>dasher</title>
<pos>red</pos>
</page>
<page>
<title>dancer</title>
<pos>red</pos>
<pos>blue</pos>
</page>
<page>
<title>coconut</title>
</page>
<page>
<title>rudolph</title>
<pos>red</pos>
<pos>brown</pos>
<pos>red</pos>
</page>'''


from bs4 import BeautifulSoup

soup = BeautifulSoup(xml)
title = soup.find_all('title')

for index, item in enumerate(title):
    print index, item  
'''
0 <title>dasher</title>
1 <title>dancer</title>
2 <title>coconut</title>
3 <title>rudolph</title>
'''

for index, item in enumerate(title):
    print index, item.fetchNextSiblings() 
'''
0 [<pos>red</pos>]
1 [<pos>red</pos>, <pos>blue</pos>]
2 []
3 [<pos>red</pos>, <pos>brown</pos>, <pos>red</pos>]
'''

for index, item in enumerate(title):
    if item.fetchNextSiblings() == []:
        print item.decompose()
'''None'''

for index, item in enumerate(title):
    print index, item  
'''
0 <title>dasher</title>
1 <title>dancer</title>
2 <None></None>
3 <title>rudolph</title>
'''

Afroula 0 Newbie Poster · Answer 2 · 2014-04-18T14:14:34+00:00

Thanks for the suggestions. I tried something along the lines of what snippsat suggested last night, but clearly I wasn't getting it right. I kept getting AttributeError telling me that there was no next sibling (despite soup.prettify showing clearly that <title> and <pos> are siblings).

I was able to fix it in a much simpler way, though. I simply changed the write-out statement to this:

if len(mypos) > 0:
    fixed.write(mytitle + " " + mypos + "\n")
else:
    pass

Thanks for your suggestion, though. I have a lot to learn about Beautiful Soup!

Returning only tags with certain siblings (Beautiful Soup)

Recommended Answers Collapse Answers

All 3 Replies

Recommended Answers