'''
in Python 3.4 Attempting to parse and print one line (actually a number)from the downloaded code from yahoo sourecode using regex to pull the number that is located at the (.*?). I've tried everything I can think of to get this to work - I expect the problem is my coding somehow - any help appreciated!! :)
'''

pbr = re.search(r'(Price\/Book (mrq):<\/td><td class="yfnc_tabledata1">)(.*?)<\/td>',str(respData))
print (pbr)

Recommended Answers

All 10 Replies

We see the failing regex, but we don't know how it fails. Can you post a fully failing python example with a (short) concrete respData ?

#

used to parse values into the url
url = 'https://ca.finance.yahoo.com/q/ks?s=CUS.TO'

values = {'s': 'basics',
         'submit': 'search'}
data = urllib.parse.urlencode(values)
data = data.encode('utf-8')  # data should be bytes
req = urllib.request.Request(url, data)
resp = urllib.request.urlopen(req)
respData = resp.read()

seems to work to this point below I have tried numerous things but Im very new tp programming so it might be simple

pbr = re.findall(r''Price\/Book \(mrq\):<\/td><td class="yfnc_tabledata1">(.*?)</td>',(respData))
print (pbr)`

Cant find what you search in respData.
Do post also post your import.

import urllib.request, urllib.parse

This is the adress,you get data from.

>>> resp.geturl()
'https://ca.finance.yahoo.com/lookup?s=basics'

Do you find Price Book or class="yfnc_tabledata1 in url or in return respData?

Some notes this use JavaScript heavy,and are not a easy site to start with.
Which mean that you may have to use other method than urllib to read site.
I use Selenium to read sites like this.
Then i get executed JavaSript to,and can parse with Beautiful Soup or lxml.

Regex to parse HTML can be a bad choice,
it can work in some cases,but use a parser(BeautifulSoup) is the first choice.
I usually post this link,why not to use regex.

commented: good tips +14

the "Price Book or class="yfnc_tabledata1" is in the return respData which is the source code downloaded from yahoo.ca. my goal to get the number between that and the </td> tag to return to a floating variable. I've yet to try out BeautifulSoup - I'll have a look tonight when I'm home from work - Thank you ! :)

the "Price Book or class="yfnc_tabledata1" is in the return respData which is >the source code downloaded from yahoo.ca.

Ok i understand,it's just that i cant find it if search through "respData" or url.

Ok downloaded Beautifulsoup4 and installed after a few attempts .. seems to be working well now :). I've still got some more of the docs to read but if I am after the "1.41" in the following string of HTML from only the Price/Book what would my soup.findAll('') look like???
I'm still playing around with the code now but I'm still getting lots of misc characters. Any Help appreciated! If Im asking too many questions on this let me know - Cheers!

#### this is the HTML line which returns as soup - I'm after the 1.41 only - which I hope to return as valueTable

<td class="yfnc_tablehead1" width="74%">Price/Book (mrq):</td><td class="yfnc_tabledata1">1.41</td>

Below is the Full code I've been playing with it does return close to what I want I just need to be more specific

import time
import urllib.request
import urllib.error
import urllib.parse
import bs4
import requests
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen

tsxowned = ('CUS.TO', 'CG.TO', 'S.TO', 'AQN.TO', 'GPS.TO', 'COS.TO', 'CSE.TO', 'CPX.TO', 'ERG.TO', 'CWW.TO', 'LEA.TO', 'WEF.TO')


############# Soup Calls for Yahoo!#########################

#Fetching the Yahoo Finance Page
optionsUrl = 'https://ca.finance.yahoo.com/q/ks?s=CUS.TO'
optionsPage = urlopen(optionsUrl)

#The following code will load the page into BeautifulSoup:
from bs4 import BeautifulSoup
soup = BeautifulSoup(optionsPage)

# Need 1.41 from <td class="yfnc_tablehead1" width="74%">Price/Book (mrq):</td><td class="yfnc_tabledata1">1.41</td>

valueTable = [
    [x.text for x in y.parent.contents]
    for y in soup.findAll('td', attrs={'class': 'yfnc_tabledata1', 'nowrap': ''})
]


# print (soup) # shows all recovered data
print (valueTable) # shows varibles your after eg price to book ...

You need to add

value = float([y for x, y in valueTable if x == 'Price/Book (mrq):'][0])
print(value)

You could perhaps find the table first with a findAll('table', ...).

this is the HTML line which returns as soup - I'm after the 1.41 only - which
I hope to return as valueTable

Using .next_sibling can be better.

from bs4 import BeautifulSoup

html = '''\
<td class="yfnc_tablehead1" width="74%">Price/Book (mrq):</td><td class="yfnc_tabledata1">1.41</td>'''

soup = BeautifulSoup(html)
tag = soup.find('td', {'class': 'yfnc_tablehead1'})

Test with parent and nextSibling.

>>> tag
<td class="yfnc_tablehead1" width="74%">Price/Book (mrq):</td>
>>> tag.parent
<td class="yfnc_tablehead1" width="74%">Price/Book (mrq):</td><td class="yfnc_tabledata1">1.41</td>
>>> tag.parent.text
'Price/Book (mrq):1.41'    

>>> tag.nextSibling
<td class="yfnc_tabledata1">1.41</td>
>>> tag.nextSibling.text
'1.41'
>>> float(tag.nextSibling.text) + 1
2.41

Woo Hoo !! working! thank you so much everyone :)

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.