Help with Navigating BeautifulSoup Tree

kshw 3 Newbie Poster

14 Years Ago

Hi,
I'm using BeautifulSoup to parst html pages. I wrote a recursive function to traverse the parsed tree and extract NavigableStrings, add them to a string. Then return the string. The problem is my recursive skills sucks. I know I'm initializing the (Text) string each time the function is called. How can I solve this and have a complete string and return it? Thanks

import re
import urllib2
from BeautifulSoup import BeautifulSoup, NavigableString

html = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']


def ParseContent(current):
    
    Text =""

    if hasattr(current,'contents'):
        for next in current.contents:
            ParseContent(next)

    if isinstance(current, NavigableString):
        print str(current)
        Text += str(current)
    return Text


soup = BeautifulSoup(''.join(html))
Page_Text = ParseContent(soup)
print "Text after function call: ", Page_Text

python

1 Contributor
0 Replies
40 Views

Be the first to reply

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.