1.11M Members

urllib in python 3.1

 
0
 

hi, i've tried to search any examples to understand more about how to use the urllib in python 3.1, but all the tutorials are for python 2.x

i need to do just a simple thing like getting the text (as string) and manipulate it.

the code in python 2.x would be like:

import urllib
url = 'http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=12345'
text = urllib.urlopen(url).read()

but when i tried it in python 3.1 (with urllib.request) like this:

import urllib.request
url = 'http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing='
current = '12345'
response = urllib.request.urlopen(url+current)
text = response.read()
current = text[-5:]
response = urllib.request.urlopen(url+current)

it gives me an error:
Can't convert 'bytes' object to str implicitly

i then look into the value of 'text' which turns out to be something like b'........' which means its a binary value.

from what i read and understand from the tutorials about urllib in python 2.x, when you do the urllib....read(), it gives you back a string.

i tried to convert the binary to string with binascii.b2a_uu(html), but the result is even more catastrophic.

can anyone please help how i can do this thing? thank you.

 
1
 

One rude awakening with Python3 will be the frequent use of bytearrays instead of strings. It does this to handle international unicode better.

Here is how you deal with this:

# get code of given URL as html text string
# Python3 uses urllib.request.urlopen()
# instead of Python2's urllib.urlopen() or urllib2.urlopen()

import urllib.request

fp = urllib.request.urlopen("http://www.python.org")

mybytes = fp.read()
# note that Python3 does not read the html code as string
# but as html code bytearray, convert to string with
mystr = mybytes.decode("utf8")

fp.close()

print(mystr)
 
0
 

ugh.. the solution was so easy? *lol*

one more question then:
is it safe to say that everytime i want to convert bytearrays to string i just need to use the .decode('utf8')?
(you are right with what you said about python 3 use more bytearrays than string. until now whenever that happen, i just try to use other way to overcome the problem since i cant find a way to convert it (base on info from python doc)).

 
0
 

UTF8 is pretty much the defacto standard for decoding in Python. However there may be other standards on the internet, depending on the country and its native language.

In this particular case ( http://www.python.org ), you can find the encoding used in this line:

<meta http-equiv="content-type" content="text/html; charset=utf-8" />

However, if you go for instance to a German website (http://www.python-forum.de), you will find this line:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
 
1
 

ugh.. the solution was so easy? *lol*

one more question then:
is it safe to say that everytime i want to convert bytearrays to string i just need to use the .decode('utf8')?...some other stuff...

No, please don't do that.

That is okay if the page you are reading is encoded in ascii or utf-8. But if it is encoded in latin-1 (a fairly common encoding in its own right), you can run into trouble with non-ascii characters (like é for example)). And this is just for sites in languages based on latin script.

What you have to do is get the encoding from the server. The web server sends out the encoding that it uses* in the http header. The general idea is that the pages it serves would use that encoding.

* If only it were that simple. Sometimes the pages on the web server itself are encoded using a different encoding than what the web server is set to report (blame it on lazy, inconsistent webmasters?) Ideally, these pages have their encoding specified in their HTML headers.

Oh right, you were looking for a solution? I generally start decoding with the encoding that the http header specifies, but if the page's html header specifies a different encoding, I start over again with that new encoding. And if all else fails, either fall back to utf8 or refuse to decode all together. I know it sounds overly complicated (it probably is) but it's worth not having your application crash as soon as it's given a page not encoded with the one codec you hard coded.

I have a class that does this somewhere, but I can't find it...

Question Answered as of 4 Years Ago by vegaseat, bumsfeld and scru
 
0
 

ok.. i get the point..

huh, i guess there are no such thing as beginners luck in python :-p

 
1
 

As Scru mentioned, let's hope the authors of the website were good guys and included the type of encoding. You can modify Henri's code ...

# get the code of a given URL as html text string
# Python3 uses urllib.request.urlopen()
# get the encoding used first
# tested with Python 3.1 with the Editra IDE

import urllib.request

def extract(text, sub1, sub2):
    """
    extract a substring from text between first
    occurances of substrings sub1 and sub2
    """
    return text.split(sub1, 1)[-1].split(sub2, 1)[0]


fp = urllib.request.urlopen("http://www.python.org")

mybytes = fp.read()

encoding = extract(str(mybytes).lower(), 'charset=', '"')
print('-'*50)
print( "Encoding type = %s" % encoding )
print('-'*50)

if encoding:
    # note that Python3 does not read the html code as string
    # but as html code bytearray, convert to string with
    mystr = mybytes.decode(encoding)
    print(mystr)
else:
    print("Encoding type not found!")


fp.close()
 
0
 

amazing..

thanks a lot guys.. really appreciate all your help..

 
0
 

Not sure if this is a proper method or not but I havn't had any problems so far using....

x = repr(request.urlopen(req).read())
print(x)
 
0
 

Not sure if this is a proper method or not but I havn't had any problems so far using....

x = repr(request.urlopen(req).read())
print(x)

You may want to take a second look at what that actually does...

 
0
 

It changes the returned request into a string? Seems to work so far.

 
0
 

For one, not quite it doesn't. It changes the request returned into its pythonic representation. You therefore end up with this:

"b'request'"

Even worse, when you have non-ascii characters in the request, it doesn't convert them for you, so you end up with things like:

"b'requestconvert\xa5\xb3\x7f\xc4dumbl\xd3'"

All over it.

Lesson? Do things the right way and stop using cop out shortcuts. You'll end up with buggy code and no way to explain it.

Also, please don't use repr to convert stuff into stings, use str instead. Unless you want to always have retarded bugs like the one above.

 
0
 

Yeah good point, thank you for pointing this out I'm new to python in general and with not much documentation on 3.1 it's easy to assume things. I never actually looked at the read close enough to notice the b'. Yah at least I only have a few programs running with urllib atm, better to find out now than later.

 
0
 

Hi all,

I wonder how you get that the first nothing is 12345. I have no clue.

Regards.

You
This question has already been solved: Start a new discussion instead
Post:
Start New Discussion
Tags Related to this Article