I have a question...I'm trying to get text out of a website containing Spanish characters (like ñ or á) using urllib2 (and with #-*- coding: latin-1 -*- near the beginning of the Python file). However, when I write the output of the text to a file, I get something else--for example, ñ appears as ñ (so español appears as español). Even if I manually put in a line that says something like letter="ñ" and print it to the screen, it appears as ±. Any advice? As I mentioned in the title, I'm using Python 2.7 on Windows 7 (though I get the same output in a file in Ubuntu). Thanks in advance!

Recommended Answers

All 6 Replies

Is it sure that the site is not using utf8, for example?

This is printing ok in Python 2.7.2/Windows XP, both typed from keyboard and copied from this post in DaniWeb.

# -*- coding: cp1252 -*-
print 'Viva España!'
print 'español'

Is it sure that the site is not using utf8, for example?

This is printing ok in Python 2.7.2/Windows XP, both typed from keyboard and copied from this post in DaniWeb.

# -*- coding: cp1252 -*-
print 'Viva España!'
print 'español'

Thanks! At the screen level, your code prints the ñ just fine. But...if I try to write (even from keyboard) the output into a file by using f.write('español'), it won't print the ñ properly into the file opened by f.

The website I'm accessing is within the www.wordreference.com domain. In links, the format of the characters is habr%c3%a1 for habrá, but where there's just regular text (the vast majority of the page), then words like habrá are used without any (apparently) special coding.

So I checked it out, looks like utf8 for me:

# -*- coding: cp1252 -*-
import urllib2
print 'Viva España!'
print 'español'
site = urllib2.urlopen('http://www.wordreference.com/')
test = site.read()
site.close()
print test.decode('utf8')

for writing into file, I suggest codecs:

import codecs
out_file = codecs.open("some_file.txt", 'w', 'utf8')
out_file.write('español')

So I checked it out, looks like utf8 for me:

# -*- coding: cp1252 -*-
import urllib2
print 'Viva España!'
print 'español'
site = urllib2.urlopen('http://www.wordreference.com/')
test = site.read()
site.close()
print test.decode('utf8')

That code gave me the following error after the print test.decode('utf8')...

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 6695-6700: character maps to <undefined>

Is there a way around this? Thanks!

for writing into file, I suggest codecs:

import codecs
out_file = codecs.open("some_file.txt", 'w', 'utf8')
out_file.write('español')

The code using codecs still gave me...
espa¤ol

(This is based on how it looks in MS Word (opening with utf8) and WordPad.)

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.