Hi. I'm having a little difficulty understanding character sets in Python. Basically I'm trying to write a function that will substitute a non-ASCII character with a similar ASCII equivalent. So if given a string like 'ÂBÇD', the function would iterate through the string object, replacing select characters to return a fully-ASCII string, 'ABCD'. It would substitute A for  and C for Ç, while leaving the other characters alone. (The point to all this is to write EXIF metadata in JPEG images, where some fields only allow ASCII characters. So I want to use the real, partially non-ASCII names for some metadata fields and "safe" ASCIIfied versions for others.)

I've tried this a number of ways and all of them have failed. Adding debugging output, the issue seems to be a character set problem, and I've tried adding encode, decode and unicode functions, however the more I read about character sets in Python, the more confused I get. Right now the solution I'm working on involves a list of tuples like this: translations = [('Ä', 'A'), ('Å', 'A'), ...] The program accepts a user-supplied string (called nameString) that may contain non-ASCII characters as an argument on the command line. It passes this nameString to the convertToAscii() function. That function uses the translations list of tuples to swap characters where needed:

def convertToAscii(nameString):
    global translations
    res = ''
    for character in nameString:
        found = False
        for translation in translations:
            if character == translation[0]:
                res = res + translation[1] # replace with ASCII equivalent
                found = True
                break
        if found == False:
            res = res + character # just use original character
     return res

Except it doesn't work. The names all come out as they went in, so for some reason Python isn't matching non-ASCII characters in the nameString with strings in translation[0]. Adding debugging print statements shows that it is reading the translations list and its component tuples, however. It's just not recognizing matches. If anyone knows why, I'd be much obliged. In case it's helpful, it's Python 2.5.2 on Linux.

Regards,
Ed Holden

It's just personal preference, but I prefer to work with numbers in this type of code. So the following is an example of using the ord() of the character as the key in a conversion dictionary.

#!/usr/bin/python
# -*- coding: utf-8 -*-

x_list=[u'Ä', u'Â']
print ord(x_list[0]), ord(x_list[1])

ord_dic={194:'A', 196:'A', 197:'A', 199:'C'}
for chr in x_list:
   if ord(chr) in ord_dic:
      print ord(chr), 'converts to', ord_dic[ord(chr)]

print
x=u'ÂBÇD'
new_str=""
for chr in x:
   ord_chr=ord(chr)
   if ord_chr in ord_dic:
      print ord_chr, 'converts to', ord_dic[ord_chr]
      new_str += ord_dic[ord_chr]
   else:
      new_str += chr
print "converted to", new_str

Thanks for getting back to me, woooee. That was a good suggestion. However, I think I'm running into the same problem I always run into. I think the issue is that unicode characters are being interpreted as two characters rather than one, because they take up twice the number of bytes.

First off, I'm doing this:

#!/usr/bin/python
# -*- coding: utf-8 -*-

(I've tried that a couple ways, including with latin-1.) Then I'm establishing my translations dictionary more or less as you suggested:

translations = {
    196: 'A',   # translate 'Ä'
    197: 'A',   # translate 'Å'
}

Finally I'm making my function:

def convertToAscii(nameString):
    global translations
    revisedNameString = ''

    for character in nameString:
        ordCharacter=ord(character)
        print "ordCharacter of", character, "is", ordCharacter # debugging
        if ordCharacter in translations:
            print " ... found" # debugging
            revisedNameString = revisedNameString + translations[ordCharacter]
        else:
            print " ... NOT found" # debugging
            revisedNameString = revisedNameString + character
    return revisedNameString

That all looks good, and the print statements allow me to keep track of what Python is doing. But when I run it, passing a string like ÄBC-ÅBC:

ordCharacter of � is 195
 ... found
ordCharacter of � is 132
 ... NOT found
ordCharacter of B is 66
 ... NOT found
ordCharacter of C is 67
 ... NOT found
ordCharacter of - is 45
 ... NOT found
ordCharacter of � is 195
 ... found
ordCharacter of � is 133
 ... NOT found
ordCharacter of B is 66
 ... NOT found
ordCharacter of C is 67
 ... NOT found

The end result is ÄBC-ÅBC, the original string, rather than a nice ASCIIfied ABC-ABC. But you'll notice that Python treats the two A-like characters to not just one iteration, but two. So Ä and Å are not being correctly identified by their ordinate codes of 196 and 197. Instead, Ä is being identified as both 195 and 132, and later on Å is being identified as 195 and 133. So again, I think Python doesn't grok that I'm passing it a unicode character and it's iterating over two characters in the string rather than one, preventing a match.

Ever seen this before? I'm sure there's a simple way of getting around it, but I haven't been able to dig one up.

Thanks again,
Ed

unicode characters are being interpreted as two characters rather than one

This usually means that it is not a unicode string, but a standard python string. In 3.0, all strings will be unicode by default, so I have heard, and so will not require conversion. Look at the print out from the following snippet. I am not even close to being an expert on unicode, but I think this will work.

normal_string='ÂBÇD'                                                                       
for chr in normal_string:
   print ord(chr),
   print "\n--------------------"
#
##unicode_string=u'ÂBÇD'     ## use this or the following line
unicode_string=normal_string.decode('utf-8')
print "\n"
for chr in unicode_string:
   print ord(chr)
   print "--------------------"

Yes, I've also heard that about Python 3.0. Sounds like a good idea.

So anyway, that did it. Simply adding this construction:

unicode_string=normal_string.decode('utf-8')

... converted it into a form that finally worked, and the decoding did create an ASCII version. Thanks again for the assistance.

Best,
Ed

This article has been dead for over six months. Start a new discussion instead.