Removing everything inbetween '<>'

Question

lllllIllIlllI 178 Veteran Poster

17 Years Ago

Hi
I have the source of a webpage that tells the weather and i want to extract the data and my only hurdle left to jump is to remove all the formatting HTML marks inside and including the <> bracket. I have the web page source stored as a string so maybe the sting module. I dont know.

python

4 Contributors
7 Replies
107 Views
1 Day Discussion Span
Latest Post 17 Years Ago Latest Post by bumsfeld

All 7 Replies

jrcagle 77 Practically a Master Poster

17 Years Ago

well, you might consider the BeautifulSoup module.

link:

http://www.crummy.com/software/BeautifulSoup/

It has the capability to extract tags and values relatively easily.

Jeff

bumsfeld 413 Nearly a Posting Virtuoso

17 Years Ago

Python also has HTMLParser module that can help you muchly:

# extract a specified text from web page HTML source code

import urllib2
import HTMLParser
import cStringIO   # acts like file in memory

class HTML2Text(HTMLParser.HTMLParser):
    """
    extract text from HTML code basically using inherited
    class HTMLParser and some additional custom methods
    """
    def __init__(self):
        HTMLParser.HTMLParser.__init__(self)
        self.output = cStringIO.StringIO()

    def get_text(self):
        """get the text output"""
        return self.output.getvalue()

    def handle_starttag(self, tag, attrs):
        """handle <br> tags"""
        if tag == 'br':
            # need to put one new line in
            self.output.write('\n')

    def handle_data(self, data):
        """normal text"""
        self.output.write(data)

    def handle_endtag(self, tag):
        if tag == 'p':
            # end of paragraph add newline
            self.output.write('\n')


def extract(html, sub1, sub2):
    """
    extract string from text between first
    occurances of substrings sub1 and sub2
    """
    return html.split(sub1, 1)[-1].split(sub2, 1)[0]


# you may need to update this web page for your needs
url = 'http://www.bom.gov.au/products/IDN10060.shtml#HUN'

# get the raw HTML code
try:
    file_handle = urllib2.urlopen(url)
    html1 = file_handle.read()
    file_handle.close()
    print '-'*70
    print 'Data from URL =', url
except IOError:
    print 'Cannot open URL %s for reading' % url
    html1 = 'error!'
  
#print '-'*70; print html1  # testing

# extract code between sub1 and sub2
# you may need to update sub1 and sub2 for your needs
sub1 = 'www.bom.gov.au/weather/nsw</a></P><P>'
sub2 = 'The next routine forecast'
html2 = extract(html1, sub1, sub2)

#print '-'*70; print html2  # testing

# remove HTML tags to give clean text
p = HTML2Text()
p.feed(html2)
text = p.get_text()
print '-'*70
print text
print '-'*70

You can process the text further if you need to.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

sneekula 969 Nearly a Posting Maven · Answer 1 · 2008-05-25T20:19:09+00:00

It would be nice to have an actual example of the HTML source code or at least the web site's URL. As Jeff said, the BeautifulSoup module (an HTML scraper) is great, but a has a steep learning curve. An interesting project, so let us know any progress.

lllllIllIlllI 178 Veteran Poster · Answer 2 · 2008-05-26T03:16:55+00:00

yeah the soup module looks a bit difficult. I might continue trying to find a way to change it in its string format. Oh and the URL is http://www.bom.gov.au/products/IDN10060.shtml#HUN
Its a local weather forcast for the HUNTER i am trying to extract

jrcagle 77 Practically a Master Poster · Answer 3 · 2008-05-26T06:08:41+00:00

OK, so give an idea of what the tag looks like, and we might be able to help.

Jeff

lllllIllIlllI 178 Veteran Poster · Answer 4 · 2008-05-26T13:31:10+00:00

Okay.
The tags are just bits in the source like <b> and </b> to make the text inbetween bold.
I had an idea of how to do it though. Hows this?

code = "source of webpage goes here"
count = code.count('>')
while count:
    start = code.rfind("<")
    end = code.rfind(">")
    code[start:end] = ''
    count-=1

i think that ought to do it!

lllllIllIlllI 178 Veteran Poster · Answer 5 · 2008-05-26T15:05:40+00:00

sorry for double posting but i have finished my code and i have taken all the stuff inside the brackets out with the following code:

count = data.count('<')

while count:
    start = data.find('<')
    end = data.find('>')
    rem = data[start:end+1]
    data = data.replace(rem,'',1)
    count-=1

the other code i tried didn't work because of the bit where i said:

data[start:end] = ''

that got me some error relating to how strings are immutable and stuff like that. But anyhow i have used this to make a script that finds and returns my local weather forecast for the next few days.
Thanks for all the help!

Removing everything inbetween '<>'

Recommended Answers Collapse Answers

All 7 Replies

Recommended Answers