I've been tearing my hair out for 2 days over this, hopefully someone here can help me. I'm trying to scrape the price data off the following webpage:

http://www.morningstar.co.uk/UK/snapshot/snapshot.aspx?lang=en-GB&id=F0GBR04S4X

The value I want currently stands at 6.19 (i.e. the NAV value on the right hand side).

I currently have a working macro written in vba in excel that uses the following regular expression to do this:

(GBP).\d{1,2}[.]\d\d

but for some reason I can't get this to work in python and I want to transition into python for a few reasons I won't go into here (I repeat this for various unit trusts hence the {1,2} bit).

Below is a python script I've written to download the webpage contents and then prettify it using beautiful soup. If I don't do this the encoding of the webpage is difficult to decipher.

After 2 days I can't get a python compatible regular expression to grab this data. I also use the left() and right() functions in vba to remove any whitespaces and text characters from the resulting string, any ideas on how to do that in python most gratefully received!

How do I grab the 6.19 from this page (or whatever the price is when you look!)?

#!/usr/bin/python

import re
import urllib
import string
from BeautifulSoup import BeautifulSoup  #requires python-beautifulsoup package
# documentation = http://www.crummy.com/software/BeautifulSoup/documentation.html#Quick%20Start

#pattern = '(>GBP).\d{1,2}[.]\d\d'  #this is the VBA regex pattern that works inMS Excel
pattern = '\d{1,2}[.]\d\d'
urladdress = "http://www.morningstar.co.uk/UK/snapshot/snapshot.aspx?lang=en-GB&id=F0GBR04S4X"

try:
    #get data from web into one string
    url = urllib.urlopen(urladdress)
    htmltext = url.readlines()
    url.close()
    
    #Beautiful Soup bit
    soup = BeautifulSoup(''.join(htmltext))
    soup = soup.prettify()
    
    #use regular expression to search through for price using above pattern
    price = re.search(pattern,  soup)
    
    if price == None:
        print'no result'
        exit
    else:
        print price.group(0)

except StandardError,  e:
    print str(e)
   
exit

I also use the left() and right() functions in vba to remove any whitespaces and text characters from the resulting string, any ideas on how to do that in python most gratefully received!

Use strip() like so:

>>> padded = '             foobar          '
>>> padded1 = 'foo  '
>>> padded2 = '   foo\n'
>>> padded3 = '\tfoo\tbar\n'
>>> padded.strip()
'foobar'
>>> padded1.strip()
'foo'
>>> padded2.strip()
'foo'
>>> padded3.strip()
'foo\tbar'
>>>

As you can see it removes leading and trailing white space.

When I tried your code it grabbed the YTD trailing return...

I just made a minor modification to your regex and got the result... however there's a strange symbol appearing instead of the 'space' or tab or whatever is actually there on the page: pattern = 'GBP.*\d{1,2}[.]\d\d' Also, if you modify the pattern as such: 'GBP.*(\d{1,2}[.]\d\d)' you can use price.groups(1) to identify only the numbers.. but I'm sure you were aware of that

Use strip() like so:

>>> padded = '             foobar          '
>>> padded1 = 'foo  '
>>> padded2 = '   foo\n'
>>> padded3 = '\tfoo\tbar\n'
>>> padded.strip()
'foobar'
>>> padded1.strip()
'foo'
>>> padded2.strip()
'foo'
>>> padded3.strip()
'foo\tbar'
>>>

As you can see it removes leading and trailing white space.

When I tried your code it grabbed the YTD trailing return...

Thanks, I was aware of strip() to remove white spaces but it can't remove other characters such as the "GBP" in this case. Python doesn't appear to have a direct equivalent of vb's left() or right()?

If the regex here worked I'd end up with "GBP 6.19". In vb I could do CSng(Right("GBP 6.19", 4)) and job done. Not so easy in python it would appear?

I just made a minor modification to your regex and got the result... however there's a strange symbol appearing instead of the 'space' or tab or whatever is actually there on the page: pattern = 'GBP.*\d{1,2}[.]\d\d' Also, if you modify the pattern as such: 'GBP.*(\d{1,2}[.]\d\d)' you can use price.groups(1) to identify only the numbers.. but I'm sure you were aware of that

Yes I noticed the strange symbol between the GBP and the data as well. '(GBP)\s' wouldn't find anything but '(GBP).' would find the first occurance of "GBP". Very strange!

I'll try your modified pattern now and no I wasn't aware of the .groups(1) bit, many thanks for that :icon_smile:

In vb I could do CSng(Right("GBP 6.19", 4)) and job done. Not so easy in python it would appear?

Not so easy? I tend to disagree...

>>> "GBP 6.19"[-4:]
'6.19'
>>> "GBP 6.19"[:3]
'GBP'
>>> "GBP 6.19".split()[1]
'6.19'
>>>

If you want more info on the bracket method it's called slicing, and can be used for just about any built-in python object. Try searching the internets for 'python slicing' if you want more info.

Not so easy? I tend to disagree...

>>> "GBP 6.19"[-4:]
'6.19'
>>> "GBP 6.19"[:3]
'GBP'
>>> "GBP 6.19".split()[1]
'6.19'
>>>

If you want more info on the bracket method it's called slicing, and can be used for just about any built-in python object. Try searching the internets for 'python slicing' if you want more info.

Your regex pattern coupled with .group(1) works beautifully and I thank you.

I stand corrected on slicing, never come across that despite googling for python alternatives to left() and right(), thanks for the pointer.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.