Hi,
I'm using BeautifulSoup to parst html pages. I wrote a recursive function to traverse the parsed tree and extract NavigableStrings, add them to a string. Then return the string. The problem is my recursive skills sucks. I know I'm initializing the (Text) string each time the function is called. How can I solve this and have a complete string and return it? Thanks
import re
import urllib2
from BeautifulSoup import BeautifulSoup, NavigableString
html = ['<html><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
def ParseContent(current):
Text =""
if hasattr(current,'contents'):
for next in current.contents:
ParseContent(next)
if isinstance(current, NavigableString):
print str(current)
Text += str(current)
return Text
soup = BeautifulSoup(''.join(html))
Page_Text = ParseContent(soup)
print "Text after function call: ", Page_Text