Hello,

I begin in Python, and I have the following problem: I retrieve an excerpt of a HTML webpage from the web, and then want the result to be hold in a variable (before being processed by a reg-exp).

The function do get the HTML source, but when I assign the function to the variable t_main_page, the interpretor tells the variable is a None type.

Here is the code:

#/usr/bin/env/ python


# Script to fetch and parse the specific web page of PPI for Manufactured Goods
# on http://www.stats.gov.cn/english/ .

import urllib.request, re
from html.parser import HTMLParser


def fetch_main_page():
    """
    Open the web page and retrieve the HTML code.

    Returns: string UTF-8
    """
    
    main_page = ''
    
    try:
        main_page = urllib.request.urlopen("http://www.stats.gov.cn/english/").read(20000).decode('gb2312')
    except (UnicodeDecodeError, urllib.error.URLError) as e:
        fetch_main_page()
    else:
        return main_page

t_main_page = fetch_main_page()

print(t_main_page)

"""
relevant_links = re.findall('<a href=(.*?)>PPI of Main Manufactured Goods.*?</a>', t_main_page)

for link in relevant_links:
    print(link)

"""

Can someone tell me how to put the string returned by a function in a variable callable by the regexp ?

Thanks ! :)

Recommended Answers

All 6 Replies

>>> import urllib.request, re
>>> from html.parser import HTMLParser
>>> main_page = urllib.request.urlopen("http://www.stats.gov.cn/english/").read(20000).decode('gb2312')
Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    main_page = urllib.request.urlopen("http://www.stats.gov.cn/english/").read(20000).decode('gb2312')
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 1-2: illegal multibyte sequence

It will return none because of error with decode('gb2312')
So the Chinese data encoded GB2312 format give som problem.
Search google python decode('gb2312')

>>> main_page = urllib.request.urlopen("http://www.stats.gov.cn/english/").read(20000)
>>> main_page
<now it work>

If this give you what you need i am not sure.

Thanks for the reply !

But with read() without decode(), it returns only bytecode, and I can't parse that with the regexp.

I made a try-except-else function because the urlopen can't catch the page content immediatly ; I have to make a recursion within the exception for unicode.decode.error and urlerror, just to force the function to retrieve the content by trial and error (sometimes it takes 5 minutes, but I can always have it in th end).

Then, when I put a print() instead of return, it works.

This works:

def fetch_main_page():
    """
    Open the web page and retrieve the HTML code.

    Returns: string UTF-8
    """
    
    main_page = ''
    
    try:
        main_page = urllib.request.urlopen("http://www.stats.gov.cn/english/").read(20000).decode('gb2312')
    except (UnicodeDecodeError, urllib.error.URLError) as e:
        fetch_main_page()
    else:
        print(main_page)

fetch_main_page()

How can I have the result of a funtion put back to the script main flow ?? If I put a return statement in my function(), shouldn't I be able to do like this ???

variable = function()

Okay, I have the function result in a global variable ...

I find the solution "not very pythonic" though: I commented the steps.

#/usr/bin/env/python


# Script to fetch and parse the specific web page of PPI for Manufactured Goods
# on http://www.stats.gov.cn/english/ .

import urllib.request, re
from html.parser import HTMLParser

# Declare an empty global variable to act as a container
t_main_page = ''

def fetch_main_page():
    """
    Open the web page and retrieve the HTML code.

    Returns: string UTF-8
    """
    
    main_page = ''
    
    try:
        main_page = urllib.request.urlopen("http://www.stats.gov.cn/english/")\
                    .read(20000).decode('gb2312')
    except (UnicodeDecodeError, urllib.error.URLError) as e:
        fetch_main_page()
    else:
        global t_main_page             # call the global variable (can't 
        t_main_page = main_page        # assign it on the same statement)
        return t_main_page             # THEN assign THEN return

fetch_main_page()
# now t_main_page is containing the string

So it is solved, but is there another solution to do it ??

Unless I'm missing something all you are doing with the variable t_main_page is copying the contents of the variable main_page which already has the information you want. Just return main_page.

#/usr/bin/env/python


# Script to fetch and parse the specific web page of PPI for Manufactured Goods
# on http://www.stats.gov.cn/english/ .

import urllib.request, re
from html.parser import HTMLParser

def fetch_main_page():
    """
    Open the web page and retrieve the HTML code.

    Returns: string UTF-8
    """
    
    main_page = ''
    
    try:
        main_page = urllib.request.urlopen("http://www.stats.gov.cn/english/")\
                    .read(20000).decode('gb2312')
    except (UnicodeDecodeError, urllib.error.URLError) as e:
        fetch_main_page()
    else:
        return main_page             # return

t_main_page = fetch_main_page()
# now t_main_page is containing the string

That is what I have done, but when I do it it returns a Nonetype object !!

I am running the script from IDLE, and I can't obtain anything if I do that way.

Just to be sure, I added a print(t_main_page) after the function call and I ran your code, and here is what the interpretor shows:

******
Python 3.1.1 (r311:74483, Aug 17 2009, 17:02:12) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> ================================ RESTART ================================
>>>
None
>>> type(t_main_page)
<class 'NoneType'>
>>>
******
Is Python3 "doing the right thing" or what ?

Is that a gotcha or something is going wrong with Python ? (Sincerely, I think the problem should be with me, but why everything on the web is coded like you do ?)

Unless I'm missing something all you are doing with the variable t_main_page is copying the contents of the variable main_page which already has the information you want. Just return main_page.

#/usr/bin/env/python


# Script to fetch and parse the specific web page of PPI for Manufactured Goods
# on http://www.stats.gov.cn/english/ .

import urllib.request, re
from html.parser import HTMLParser

def fetch_main_page():
    """
    Open the web page and retrieve the HTML code.

    Returns: string UTF-8
    """
    
    main_page = ''
    
    try:
        main_page = urllib.request.urlopen("http://www.stats.gov.cn/english/")\
                    .read(20000).decode('gb2312')
    except (UnicodeDecodeError, urllib.error.URLError) as e:
        fetch_main_page()
    else:
        return main_page             # return

t_main_page = fetch_main_page()
# now t_main_page is containing the string

I installed python 2.6.4 to try the original code.

It works INSTANTLY (e.g.: no 5 minutes waiting to retrieve the html source of the page). Without the fuzzy global variable.

So I won't code in Python 3 anymore, and it should solve a lot of headaches.

This works perfectly in Python 2.6.4, and IS what is said to be the good way of writing it.

#/usr/bin/env/python


# Script to fetch and parse the specific web page of PPI for Manufactured Goods
# on http://www.stats.gov.cn/english/ .

import urllib, re


def fetch_main_page():
    """
    Open the web page and retrieve the HTML code.

    Returns: string
    """
     
    try:
        main_page = urllib.urlopen("http://www.stats.gov.cn/english/")\
                    .read(20000).decode('gb2312')
    except (UnicodeDecodeError, urllib.error.URLError) as e:
        fetch_main_page()
    else:
        return main_page             # THEN assign THEN return

t_main_page = fetch_main_page()

Now I will listen when they say that we better wait before programming for Python3K ...

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.