I have the source of a webpage that tells the weather and i want to extract the data and my only hurdle left to jump is to remove all the formatting HTML marks inside and including the <> bracket. I have the web page source stored as a string so maybe the sting module. I dont know.
It would be nice to have an actual example of the HTML source code or at least the web site's URL. As Jeff said, the BeautifulSoup module (an HTML scraper) is great, but a has a steep learning curve. An interesting project, so let us know any progress.
yeah the soup module looks a bit difficult. I might continue trying to find a way to change it in its string format. Oh and the URL is http://www.bom.gov.au/products/IDN10060.shtml#HUN
Its a local weather forcast for the HUNTER i am trying to extract
sorry for double posting but i have finished my code and i have taken all the stuff inside the brackets out with the following code:
count = data.count('<')
start = data.find('<')
end = data.find('>')
rem = data[start:end+1]
data = data.replace(rem,'',1)
the other code i tried didn't work because of the bit where i said:
data[start:end] = ''
that got me some error relating to how strings are immutable and stuff like that. But anyhow i have used this to make a script that finds and returns my local weather forecast for the next few days.
Thanks for all the help!
Python also has HTMLParser module that can help you muchly:
# extract a specified text from web page HTML source code
import cStringIO # acts like file in memory
extract text from HTML code basically using inherited
class HTMLParser and some additional custom methods
self.output = cStringIO.StringIO()
"""get the text output"""
def handle_starttag(self, tag, attrs):
"""handle <br> tags"""
if tag == 'br':
# need to put one new line in
def handle_data(self, data):
def handle_endtag(self, tag):
if tag == 'p':
# end of paragraph add newline
def extract(html, sub1, sub2):
extract string from text between first
occurances of substrings sub1 and sub2
return html.split(sub1, 1)[-1].split(sub2, 1)
# you may need to update this web page for your needs
url = 'http://www.bom.gov.au/products/IDN10060.shtml#HUN'
# get the raw HTML code
file_handle = urllib2.urlopen(url)
html1 = file_handle.read()
print 'Data from URL =', url
print 'Cannot open URL %s for reading' % url
html1 = 'error!'
#print '-'*70; print html1 # testing
# extract code between sub1 and sub2
# you may need to update sub1 and sub2 for your needs
sub1 = 'www.bom.gov.au/weather/nsw</a></P><P>'
sub2 = 'The next routine forecast'
html2 = extract(html1, sub1, sub2)
#print '-'*70; print html2 # testing
# remove HTML tags to give clean text
p = HTML2Text()
text = p.get_text()