Hi
I have the source of a webpage that tells the weather and i want to extract the data and my only hurdle left to jump is to remove all the formatting HTML marks inside and including the <> bracket. I have the web page source stored as a string so maybe the sting module. I dont know.

Recommended Answers

All 7 Replies

It would be nice to have an actual example of the HTML source code or at least the web site's URL. As Jeff said, the BeautifulSoup module (an HTML scraper) is great, but a has a steep learning curve. An interesting project, so let us know any progress.

OK, so give an idea of what the tag looks like, and we might be able to help.

Jeff

Okay.
The tags are just bits in the source like <b> and </b> to make the text inbetween bold.
I had an idea of how to do it though. Hows this?

code = "source of webpage goes here"
count = code.count('>')
while count:
    start = code.rfind("<")
    end = code.rfind(">")
    code[start:end] = ''
    count-=1

i think that ought to do it!

sorry for double posting but i have finished my code and i have taken all the stuff inside the brackets out with the following code:

count = data.count('<')

while count:
    start = data.find('<')
    end = data.find('>')
    rem = data[start:end+1]
    data = data.replace(rem,'',1)
    count-=1

the other code i tried didn't work because of the bit where i said:

data[start:end] = ''

that got me some error relating to how strings are immutable and stuff like that. But anyhow i have used this to make a script that finds and returns my local weather forecast for the next few days.
Thanks for all the help!

Python also has HTMLParser module that can help you muchly:

# extract a specified text from web page HTML source code

import urllib2
import HTMLParser
import cStringIO   # acts like file in memory

class HTML2Text(HTMLParser.HTMLParser):
    """
    extract text from HTML code basically using inherited
    class HTMLParser and some additional custom methods
    """
    def __init__(self):
        HTMLParser.HTMLParser.__init__(self)
        self.output = cStringIO.StringIO()

    def get_text(self):
        """get the text output"""
        return self.output.getvalue()

    def handle_starttag(self, tag, attrs):
        """handle <br> tags"""
        if tag == 'br':
            # need to put one new line in
            self.output.write('\n')

    def handle_data(self, data):
        """normal text"""
        self.output.write(data)

    def handle_endtag(self, tag):
        if tag == 'p':
            # end of paragraph add newline
            self.output.write('\n')


def extract(html, sub1, sub2):
    """
    extract string from text between first
    occurances of substrings sub1 and sub2
    """
    return html.split(sub1, 1)[-1].split(sub2, 1)[0]


# you may need to update this web page for your needs
url = 'http://www.bom.gov.au/products/IDN10060.shtml#HUN'

# get the raw HTML code
try:
    file_handle = urllib2.urlopen(url)
    html1 = file_handle.read()
    file_handle.close()
    print '-'*70
    print 'Data from URL =', url
except IOError:
    print 'Cannot open URL %s for reading' % url
    html1 = 'error!'
  
#print '-'*70; print html1  # testing

# extract code between sub1 and sub2
# you may need to update sub1 and sub2 for your needs
sub1 = 'www.bom.gov.au/weather/nsw</a></P><P>'
sub2 = 'The next routine forecast'
html2 = extract(html1, sub1, sub2)

#print '-'*70; print html2  # testing

# remove HTML tags to give clean text
p = HTML2Text()
p.feed(html2)
text = p.get_text()
print '-'*70
print text
print '-'*70

You can process the text further if you need to.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.