I have the source of a webpage that tells the weather and i want to extract the data and my only hurdle left to jump is to remove all the formatting HTML marks inside and including the <> bracket. I have the web page source stored as a string so maybe the sting module. I dont know.

8 Years
Discussion Span
Last Post by bumsfeld

It would be nice to have an actual example of the HTML source code or at least the web site's URL. As Jeff said, the BeautifulSoup module (an HTML scraper) is great, but a has a steep learning curve. An interesting project, so let us know any progress.


The tags are just bits in the source like <b> and </b> to make the text inbetween bold.
I had an idea of how to do it though. Hows this?

code = "source of webpage goes here"
count = code.count('>')
while count:
    start = code.rfind("<")
    end = code.rfind(">")
    code[start:end] = ''

i think that ought to do it!


sorry for double posting but i have finished my code and i have taken all the stuff inside the brackets out with the following code:

count = data.count('<')

while count:
    start = data.find('<')
    end = data.find('>')
    rem = data[start:end+1]
    data = data.replace(rem,'',1)

the other code i tried didn't work because of the bit where i said:

data[start:end] = ''

that got me some error relating to how strings are immutable and stuff like that. But anyhow i have used this to make a script that finds and returns my local weather forecast for the next few days.
Thanks for all the help!


Python also has HTMLParser module that can help you muchly:

# extract a specified text from web page HTML source code

import urllib2
import HTMLParser
import cStringIO   # acts like file in memory

class HTML2Text(HTMLParser.HTMLParser):
    extract text from HTML code basically using inherited
    class HTMLParser and some additional custom methods
    def __init__(self):
        self.output = cStringIO.StringIO()

    def get_text(self):
        """get the text output"""
        return self.output.getvalue()

    def handle_starttag(self, tag, attrs):
        """handle <br> tags"""
        if tag == 'br':
            # need to put one new line in

    def handle_data(self, data):
        """normal text"""

    def handle_endtag(self, tag):
        if tag == 'p':
            # end of paragraph add newline

def extract(html, sub1, sub2):
    extract string from text between first
    occurances of substrings sub1 and sub2
    return html.split(sub1, 1)[-1].split(sub2, 1)[0]

# you may need to update this web page for your needs
url = 'http://www.bom.gov.au/products/IDN10060.shtml#HUN'

# get the raw HTML code
    file_handle = urllib2.urlopen(url)
    html1 = file_handle.read()
    print '-'*70
    print 'Data from URL =', url
except IOError:
    print 'Cannot open URL %s for reading' % url
    html1 = 'error!'
#print '-'*70; print html1  # testing

# extract code between sub1 and sub2
# you may need to update sub1 and sub2 for your needs
sub1 = 'www.bom.gov.au/weather/nsw</a></P><P>'
sub2 = 'The next routine forecast'
html2 = extract(html1, sub1, sub2)

#print '-'*70; print html2  # testing

# remove HTML tags to give clean text
p = HTML2Text()
text = p.get_text()
print '-'*70
print text
print '-'*70

You can process the text further if you need to.

This question has already been answered. Start a new discussion instead.
Take the time to help us to help you. Please be thoughtful and detailed and be sure to adhere to our posting rules.