| | |
urllib in python 3.1
Thread Solved |
•
•
Join Date: Aug 2009
Posts: 27
Reputation:
Solved Threads: 1
hi, i've tried to search any examples to understand more about how to use the urllib in python 3.1, but all the tutorials are for python 2.x
i need to do just a simple thing like getting the text (as string) and manipulate it.
the code in python 2.x would be like:
but when i tried it in python 3.1 (with urllib.request) like this:
it gives me an error:
Can't convert 'bytes' object to str implicitly
i then look into the value of 'text' which turns out to be something like b'........' which means its a binary value.
from what i read and understand from the tutorials about urllib in python 2.x, when you do the urllib....read(), it gives you back a string.
i tried to convert the binary to string with binascii.b2a_uu(html), but the result is even more catastrophic.
can anyone please help how i can do this thing? thank you.
i need to do just a simple thing like getting the text (as string) and manipulate it.
the code in python 2.x would be like:
Python Syntax (Toggle Plain Text)
import urllib url = 'http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=12345' text = urllib.urlopen(url).read()
but when i tried it in python 3.1 (with urllib.request) like this:
Python Syntax (Toggle Plain Text)
import urllib.request url = 'http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=' current = '12345' response = urllib.request.urlopen(url+current) text = response.read() current = text[-5:] response = urllib.request.urlopen(url+current)
it gives me an error:
Can't convert 'bytes' object to str implicitly
i then look into the value of 'text' which turns out to be something like b'........' which means its a binary value.
from what i read and understand from the tutorials about urllib in python 2.x, when you do the urllib....read(), it gives you back a string.
i tried to convert the binary to string with binascii.b2a_uu(html), but the result is even more catastrophic.
can anyone please help how i can do this thing? thank you.
One rude awakening with Python3 will be the frequent use of bytearrays instead of strings. It does this to handle international unicode better.
Here is how you deal with this:
Here is how you deal with this:
python Syntax (Toggle Plain Text)
# get code of given URL as html text string # Python3 uses urllib.request.urlopen() # instead of Python2's urllib.urlopen() or urllib2.urlopen() import urllib.request fp = urllib.request.urlopen("http://www.python.org") mybytes = fp.read() # note that Python3 does not read the html code as string # but as html code bytearray, convert to string with mystr = mybytes.decode("utf8") fp.close() print(mystr)
Last edited by bumsfeld; Aug 22nd, 2009 at 10:39 am.
Should you find Irony, you can keep her!
•
•
Join Date: Aug 2009
Posts: 27
Reputation:
Solved Threads: 1
ugh.. the solution was so easy? *lol*
one more question then:
is it safe to say that everytime i want to convert bytearrays to string i just need to use the .decode('utf8')?
(you are right with what you said about python 3 use more bytearrays than string. until now whenever that happen, i just try to use other way to overcome the problem since i cant find a way to convert it (base on info from python doc)).
one more question then:
is it safe to say that everytime i want to convert bytearrays to string i just need to use the .decode('utf8')?
(you are right with what you said about python 3 use more bytearrays than string. until now whenever that happen, i just try to use other way to overcome the problem since i cant find a way to convert it (base on info from python doc)).
UTF8 is pretty much the defacto standard for decoding in Python. However there may be other standards on the internet, depending on the country and its native language.
In this particular case ( http://www.python.org ), you can find the encoding used in this line:
However, if you go for instance to a German website (http://www.python-forum.de), you will find this line:
In this particular case ( http://www.python.org ), you can find the encoding used in this line:
Python Syntax (Toggle Plain Text)
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
Python Syntax (Toggle Plain Text)
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
Last edited by vegaseat; Aug 23rd, 2009 at 1:37 pm. Reason: html code
May 'the Google' be with you!
•
•
•
•
ugh.. the solution was so easy? *lol*
one more question then:
is it safe to say that everytime i want to convert bytearrays to string i just need to use the .decode('utf8')?...some other stuff...
That is okay if the page you are reading is encoded in ascii or utf-8. But if it is encoded in latin-1 (a fairly common encoding in its own right), you can run into trouble with non-ascii characters (like é for example)). And this is just for sites in languages based on latin script.
What you have to do is get the encoding from the server. The web server sends out the encoding that it uses* in the http header. The general idea is that the pages it serves would use that encoding.
* If only it were that simple. Sometimes the pages on the web server itself are encoded using a different encoding than what the web server is set to report (blame it on lazy, inconsistent webmasters?) Ideally, these pages have their encoding specified in their HTML headers.
Oh right, you were looking for a solution? I generally start decoding with the encoding that the http header specifies, but if the page's html header specifies a different encoding, I start over again with that new encoding. And if all else fails, either fall back to utf8 or refuse to decode all together. I know it sounds overly complicated (it probably is) but it's worth not having your application crash as soon as it's given a page not encoded with the one codec you hard coded.
I have a class that does this somewhere, but I can't find it...
Last edited by scru; Aug 23rd, 2009 at 11:21 am.
As Scru mentioned, let's hope the authors of the website were good guys and included the type of encoding. You can modify Henri's code ...
python Syntax (Toggle Plain Text)
# get the code of a given URL as html text string # Python3 uses urllib.request.urlopen() # get the encoding used first # tested with Python 3.1 with the Editra IDE import urllib.request def extract(text, sub1, sub2): """ extract a substring from text between first occurances of substrings sub1 and sub2 """ return text.split(sub1, 1)[-1].split(sub2, 1)[0] fp = urllib.request.urlopen("http://www.python.org") mybytes = fp.read() encoding = extract(str(mybytes).lower(), 'charset=', '"') print('-'*50) print( "Encoding type = %s" % encoding ) print('-'*50) if encoding: # note that Python3 does not read the html code as string # but as html code bytearray, convert to string with mystr = mybytes.decode(encoding) print(mystr) else: print("Encoding type not found!") fp.close()
May 'the Google' be with you!
•
•
Join Date: Aug 2009
Posts: 60
Reputation:
Solved Threads: 8
Not sure if this is a proper method or not but I havn't had any problems so far using....
python Syntax (Toggle Plain Text)
x = repr(request.urlopen(req).read()) print(x)
•
•
•
•
Not sure if this is a proper method or not but I havn't had any problems so far using....
python Syntax (Toggle Plain Text)
x = repr(request.urlopen(req).read()) print(x)
Last edited by scru; Aug 25th, 2009 at 7:18 pm.
![]() |
Similar Threads
- Starting Python (Python)
- How can I fetch web pages in Python using sockets (not urllib)? (Python)
- standard LIB and others... (Python)
- urlencode / urlencode php + python (Python)
- A python script with input (Python)
- python-csv data problem (Python)
- using wget in python (Python)
- Opening HTTP sessions with Python (Python)
Other Threads in the Python Forum
- Previous Thread: need help capturing data from port
- Next Thread: How to close web browser
Views: 2427 | Replies: 12
| Thread Tools | Search this Thread |
Tag cloud for Python
aliased application array beginner c++ c/c++ change character class client code command convert count create csv ctypes database dictionary django dll error examples excel exe extensions fdlib file float format framework ftp function graphics gui homework image images import input keyboard library line linux list lists logging loop loops microcontroller mouse mysql mysqldb number numbers output parse parsing path port prime processing program programming py2exe pygame pygtk pyqt python random raw_input recursion recursive redirect remote scrolledtext server socket ssh stdout string strings syntax table terminal text thread threading tkinter transparency tuple tutorial ubuntu unicode variable variables web windows wordgame wxpython






