urllib in python 3.1

Please support our Python advertiser: Programming Forums - DaniWeb Sister Site
Thread Solved

Join Date: Aug 2009
Posts: 27
Reputation: Lingson is an unknown quantity at this point 
Solved Threads: 1
Lingson Lingson is offline Offline
Light Poster

urllib in python 3.1

 
0
  #1
Aug 22nd, 2009
hi, i've tried to search any examples to understand more about how to use the urllib in python 3.1, but all the tutorials are for python 2.x

i need to do just a simple thing like getting the text (as string) and manipulate it.

the code in python 2.x would be like:

  1. import urllib
  2. url = 'http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=12345'
  3. text = urllib.urlopen(url).read()

but when i tried it in python 3.1 (with urllib.request) like this:

  1. import urllib.request
  2. url = 'http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing='
  3. current = '12345'
  4. response = urllib.request.urlopen(url+current)
  5. text = response.read()
  6. current = text[-5:]
  7. response = urllib.request.urlopen(url+current)

it gives me an error:
Can't convert 'bytes' object to str implicitly

i then look into the value of 'text' which turns out to be something like b'........' which means its a binary value.

from what i read and understand from the tutorials about urllib in python 2.x, when you do the urllib....read(), it gives you back a string.

i tried to convert the binary to string with binascii.b2a_uu(html), but the result is even more catastrophic.

can anyone please help how i can do this thing? thank you.
Reply With Quote Quick reply to this message  
Join Date: Jul 2005
Posts: 1,221
Reputation: bumsfeld will become famous soon enough bumsfeld will become famous soon enough 
Solved Threads: 137
bumsfeld's Avatar
bumsfeld bumsfeld is offline Offline
Nearly a Posting Virtuoso

Re: urllib in python 3.1

 
1
  #2
Aug 22nd, 2009
One rude awakening with Python3 will be the frequent use of bytearrays instead of strings. It does this to handle international unicode better.

Here is how you deal with this:
  1. # get code of given URL as html text string
  2. # Python3 uses urllib.request.urlopen()
  3. # instead of Python2's urllib.urlopen() or urllib2.urlopen()
  4.  
  5. import urllib.request
  6.  
  7. fp = urllib.request.urlopen("http://www.python.org")
  8.  
  9. mybytes = fp.read()
  10. # note that Python3 does not read the html code as string
  11. # but as html code bytearray, convert to string with
  12. mystr = mybytes.decode("utf8")
  13.  
  14. fp.close()
  15.  
  16. print(mystr)
Last edited by bumsfeld; Aug 22nd, 2009 at 11:39 am.
Should you find Irony, you can keep her!
Reply With Quote Quick reply to this message  
Join Date: Aug 2009
Posts: 27
Reputation: Lingson is an unknown quantity at this point 
Solved Threads: 1
Lingson Lingson is offline Offline
Light Poster

Re: urllib in python 3.1

 
0
  #3
Aug 22nd, 2009
ugh.. the solution was so easy? *lol*

one more question then:
is it safe to say that everytime i want to convert bytearrays to string i just need to use the .decode('utf8')?
(you are right with what you said about python 3 use more bytearrays than string. until now whenever that happen, i just try to use other way to overcome the problem since i cant find a way to convert it (base on info from python doc)).
Reply With Quote Quick reply to this message  
Join Date: Oct 2004
Posts: 4,055
Reputation: vegaseat is just really nice vegaseat is just really nice vegaseat is just really nice vegaseat is just really nice vegaseat is just really nice 
Solved Threads: 935
Moderator
vegaseat's Avatar
vegaseat vegaseat is online now Online
DaniWeb's Hypocrite

Re: urllib in python 3.1

 
0
  #4
Aug 23rd, 2009
UTF8 is pretty much the defacto standard for decoding in Python. However there may be other standards on the internet, depending on the country and its native language.

In this particular case ( http://www.python.org ), you can find the encoding used in this line:
  1. <meta http-equiv="content-type" content="text/html; charset=utf-8" />
However, if you go for instance to a German website (http://www.python-forum.de), you will find this line:
  1. <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
Last edited by vegaseat; Aug 23rd, 2009 at 2:37 pm. Reason: html code
May 'the Google' be with you!
Reply With Quote Quick reply to this message  
Join Date: Feb 2007
Posts: 1,613
Reputation: scru has a spectacular aura about scru has a spectacular aura about 
Solved Threads: 130
Featured Poster
scru's Avatar
scru scru is offline Offline
Posting Virtuoso

Re: urllib in python 3.1

 
1
  #5
Aug 23rd, 2009
Originally Posted by Lingson View Post
ugh.. the solution was so easy? *lol*

one more question then:
is it safe to say that everytime i want to convert bytearrays to string i just need to use the .decode('utf8')?...some other stuff...
No, please don't do that.

That is okay if the page you are reading is encoded in ascii or utf-8. But if it is encoded in latin-1 (a fairly common encoding in its own right), you can run into trouble with non-ascii characters (like é for example)). And this is just for sites in languages based on latin script.

What you have to do is get the encoding from the server. The web server sends out the encoding that it uses* in the http header. The general idea is that the pages it serves would use that encoding.

* If only it were that simple. Sometimes the pages on the web server itself are encoded using a different encoding than what the web server is set to report (blame it on lazy, inconsistent webmasters?) Ideally, these pages have their encoding specified in their HTML headers.

Oh right, you were looking for a solution? I generally start decoding with the encoding that the http header specifies, but if the page's html header specifies a different encoding, I start over again with that new encoding. And if all else fails, either fall back to utf8 or refuse to decode all together. I know it sounds overly complicated (it probably is) but it's worth not having your application crash as soon as it's given a page not encoded with the one codec you hard coded.

I have a class that does this somewhere, but I can't find it...
Last edited by scru; Aug 23rd, 2009 at 12:21 pm.
Reply With Quote Quick reply to this message  
Join Date: Aug 2009
Posts: 27
Reputation: Lingson is an unknown quantity at this point 
Solved Threads: 1
Lingson Lingson is offline Offline
Light Poster

Re: urllib in python 3.1

 
0
  #6
Aug 23rd, 2009
ok.. i get the point..

huh, i guess there are no such thing as beginners luck in python :-p
Last edited by Lingson; Aug 23rd, 2009 at 2:06 pm.
Reply With Quote Quick reply to this message  
Join Date: Oct 2004
Posts: 4,055
Reputation: vegaseat is just really nice vegaseat is just really nice vegaseat is just really nice vegaseat is just really nice vegaseat is just really nice 
Solved Threads: 935
Moderator
vegaseat's Avatar
vegaseat vegaseat is online now Online
DaniWeb's Hypocrite

Re: urllib in python 3.1

 
1
  #7
Aug 23rd, 2009
As Scru mentioned, let's hope the authors of the website were good guys and included the type of encoding. You can modify Henri's code ...
  1. # get the code of a given URL as html text string
  2. # Python3 uses urllib.request.urlopen()
  3. # get the encoding used first
  4. # tested with Python 3.1 with the Editra IDE
  5.  
  6. import urllib.request
  7.  
  8. def extract(text, sub1, sub2):
  9. """
  10. extract a substring from text between first
  11. occurances of substrings sub1 and sub2
  12. """
  13. return text.split(sub1, 1)[-1].split(sub2, 1)[0]
  14.  
  15.  
  16. fp = urllib.request.urlopen("http://www.python.org")
  17.  
  18. mybytes = fp.read()
  19.  
  20. encoding = extract(str(mybytes).lower(), 'charset=', '"')
  21. print('-'*50)
  22. print( "Encoding type = %s" % encoding )
  23. print('-'*50)
  24.  
  25. if encoding:
  26. # note that Python3 does not read the html code as string
  27. # but as html code bytearray, convert to string with
  28. mystr = mybytes.decode(encoding)
  29. print(mystr)
  30. else:
  31. print("Encoding type not found!")
  32.  
  33.  
  34. fp.close()
May 'the Google' be with you!
Reply With Quote Quick reply to this message  
Join Date: Aug 2009
Posts: 27
Reputation: Lingson is an unknown quantity at this point 
Solved Threads: 1
Lingson Lingson is offline Offline
Light Poster

Re: urllib in python 3.1

 
0
  #8
Aug 23rd, 2009
amazing..

thanks a lot guys.. really appreciate all your help..
Reply With Quote Quick reply to this message  
Join Date: Aug 2009
Posts: 56
Reputation: willygstyle is an unknown quantity at this point 
Solved Threads: 6
willygstyle willygstyle is offline Offline
Junior Poster in Training

Re: urllib in python 3.1

 
0
  #9
Aug 25th, 2009
Not sure if this is a proper method or not but I havn't had any problems so far using....

  1. x = repr(request.urlopen(req).read())
  2. print(x)
Reply With Quote Quick reply to this message  
Join Date: Feb 2007
Posts: 1,613
Reputation: scru has a spectacular aura about scru has a spectacular aura about 
Solved Threads: 130
Featured Poster
scru's Avatar
scru scru is offline Offline
Posting Virtuoso

Re: urllib in python 3.1

 
0
  #10
Aug 25th, 2009
Originally Posted by willygstyle View Post
Not sure if this is a proper method or not but I havn't had any problems so far using....

  1. x = repr(request.urlopen(req).read())
  2. print(x)
You may want to take a second look at what that actually does...
Last edited by scru; Aug 25th, 2009 at 8:18 pm.
Reply With Quote Quick reply to this message  
Reply

This thread has been marked solved.
Perhaps start a new thread instead?
Message:


Thread Tools Search this Thread



About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC