| | |
Probably a character set problem, trying to substitute characters
Please support our Python advertiser: Programming Forums - DaniWeb Sister Site
![]() |
•
•
Join Date: Jun 2008
Posts: 3
Reputation:
Solved Threads: 0
Hi. I'm having a little difficulty understanding character sets in Python. Basically I'm trying to write a function that will substitute a non-ASCII character with a similar ASCII equivalent. So if given a string like 'ÂBÇD', the function would iterate through the string object, replacing select characters to return a fully-ASCII string, 'ABCD'. It would substitute A for  and C for Ç, while leaving the other characters alone. (The point to all this is to write EXIF metadata in JPEG images, where some fields only allow ASCII characters. So I want to use the real, partially non-ASCII names for some metadata fields and "safe" ASCIIfied versions for others.)
I've tried this a number of ways and all of them have failed. Adding debugging output, the issue seems to be a character set problem, and I've tried adding encode, decode and unicode functions, however the more I read about character sets in Python, the more confused I get. Right now the solution I'm working on involves a list of tuples like this:
The program accepts a user-supplied string (called nameString) that may contain non-ASCII characters as an argument on the command line. It passes this nameString to the convertToAscii() function. That function uses the translations list of tuples to swap characters where needed:
Except it doesn't work. The names all come out as they went in, so for some reason Python isn't matching non-ASCII characters in the nameString with strings in translation[0]. Adding debugging print statements shows that it is reading the translations list and its component tuples, however. It's just not recognizing matches. If anyone knows why, I'd be much obliged. In case it's helpful, it's Python 2.5.2 on Linux.
Regards,
Ed Holden
I've tried this a number of ways and all of them have failed. Adding debugging output, the issue seems to be a character set problem, and I've tried adding encode, decode and unicode functions, however the more I read about character sets in Python, the more confused I get. Right now the solution I'm working on involves a list of tuples like this:
translations = [('Ä', 'A'), ('Å', 'A'), ...] The program accepts a user-supplied string (called nameString) that may contain non-ASCII characters as an argument on the command line. It passes this nameString to the convertToAscii() function. That function uses the translations list of tuples to swap characters where needed:
python Syntax (Toggle Plain Text)
def convertToAscii(nameString): global translations res = '' for character in nameString: found = False for translation in translations: if character == translation[0]: res = res + translation[1] # replace with ASCII equivalent found = True break if found == False: res = res + character # just use original character return res
Except it doesn't work. The names all come out as they went in, so for some reason Python isn't matching non-ASCII characters in the nameString with strings in translation[0]. Adding debugging print statements shows that it is reading the translations list and its component tuples, however. It's just not recognizing matches. If anyone knows why, I'd be much obliged. In case it's helpful, it's Python 2.5.2 on Linux.
Regards,
Ed Holden
•
•
Join Date: Dec 2006
Posts: 1,045
Reputation:
Solved Threads: 294
It's just personal preference, but I prefer to work with numbers in this type of code. So the following is an example of using the ord() of the character as the key in a conversion dictionary.
Python Syntax (Toggle Plain Text)
#!/usr/bin/python # -*- coding: utf-8 -*- x_list=[u'Ä', u'Â'] print ord(x_list[0]), ord(x_list[1]) ord_dic={194:'A', 196:'A', 197:'A', 199:'C'} for chr in x_list: if ord(chr) in ord_dic: print ord(chr), 'converts to', ord_dic[ord(chr)] x=u'ÂBÇD' new_str="" for chr in x: ord_chr=ord(chr) if ord_chr in ord_dic: print ord_chr, 'converts to', ord_dic[ord_chr] new_str += ord_dic[ord_chr] else: new_str += chr print "converted to", new_str
Last edited by woooee; Jun 11th, 2008 at 11:32 pm.
•
•
Join Date: Jun 2008
Posts: 3
Reputation:
Solved Threads: 0
Thanks for getting back to me, woooee. That was a good suggestion. However, I think I'm running into the same problem I always run into. I think the issue is that unicode characters are being interpreted as two characters rather than one, because they take up twice the number of bytes.
First off, I'm doing this:
(I've tried that a couple ways, including with latin-1.) Then I'm establishing my translations dictionary more or less as you suggested:
Finally I'm making my function:
That all looks good, and the print statements allow me to keep track of what Python is doing. But when I run it, passing a string like ÄBC-ÅBC:
The end result is ÄBC-ÅBC, the original string, rather than a nice ASCIIfied ABC-ABC. But you'll notice that Python treats the two A-like characters to not just one iteration, but two. So Ä and Å are not being correctly identified by their ordinate codes of 196 and 197. Instead, Ä is being identified as both 195 and 132, and later on Å is being identified as 195 and 133. So again, I think Python doesn't grok that I'm passing it a unicode character and it's iterating over two characters in the string rather than one, preventing a match.
Ever seen this before? I'm sure there's a simple way of getting around it, but I haven't been able to dig one up.
Thanks again,
Ed
First off, I'm doing this:
python Syntax (Toggle Plain Text)
#!/usr/bin/python # -*- coding: utf-8 -*-
(I've tried that a couple ways, including with latin-1.) Then I'm establishing my translations dictionary more or less as you suggested:
python Syntax (Toggle Plain Text)
translations = { 196: 'A', # translate 'Ä' 197: 'A', # translate 'Å' }
Finally I'm making my function:
python Syntax (Toggle Plain Text)
def convertToAscii(nameString): global translations revisedNameString = '' for character in nameString: ordCharacter=ord(character) print "ordCharacter of", character, "is", ordCharacter # debugging if ordCharacter in translations: print " ... found" # debugging revisedNameString = revisedNameString + translations[ordCharacter] else: print " ... NOT found" # debugging revisedNameString = revisedNameString + character return revisedNameString
That all looks good, and the print statements allow me to keep track of what Python is doing. But when I run it, passing a string like ÄBC-ÅBC:
Python Syntax (Toggle Plain Text)
ordCharacter of � is 195 ... found ordCharacter of � is 132 ... NOT found ordCharacter of B is 66 ... NOT found ordCharacter of C is 67 ... NOT found ordCharacter of - is 45 ... NOT found ordCharacter of � is 195 ... found ordCharacter of � is 133 ... NOT found ordCharacter of B is 66 ... NOT found ordCharacter of C is 67 ... NOT found
The end result is ÄBC-ÅBC, the original string, rather than a nice ASCIIfied ABC-ABC. But you'll notice that Python treats the two A-like characters to not just one iteration, but two. So Ä and Å are not being correctly identified by their ordinate codes of 196 and 197. Instead, Ä is being identified as both 195 and 132, and later on Å is being identified as 195 and 133. So again, I think Python doesn't grok that I'm passing it a unicode character and it's iterating over two characters in the string rather than one, preventing a match.
Ever seen this before? I'm sure there's a simple way of getting around it, but I haven't been able to dig one up.
Thanks again,
Ed
•
•
Join Date: Dec 2006
Posts: 1,045
Reputation:
Solved Threads: 294
•
•
•
•
unicode characters are being interpreted as two characters rather than one
Python Syntax (Toggle Plain Text)
normal_string='ÂBÇD' for chr in normal_string: print ord(chr), print "\n--------------------" # ##unicode_string=u'ÂBÇD' ## use this or the following line unicode_string=normal_string.decode('utf-8') print "\n" for chr in unicode_string: print ord(chr) print "--------------------"
Last edited by woooee; Jun 14th, 2008 at 5:39 pm.
•
•
Join Date: Jun 2008
Posts: 3
Reputation:
Solved Threads: 0
Yes, I've also heard that about Python 3.0. Sounds like a good idea.
So anyway, that did it. Simply adding this construction:
... converted it into a form that finally worked, and the decoding did create an ASCII version. Thanks again for the assistance.
Best,
Ed
So anyway, that did it. Simply adding this construction:
python Syntax (Toggle Plain Text)
unicode_string=normal_string.decode('utf-8')
... converted it into a form that finally worked, and the decoding did create an ASCII version. Thanks again for the assistance.
Best,
Ed
![]() |
Similar Threads
- Help Me Please (C++)
Other Threads in the Python Forum
- Previous Thread: Splitting String
- Next Thread: GLUT Trouble
| Thread Tools | Search this Thread |
Tag cloud for Python
accessdenied apache application argv array beginner book change code color converter countpasswordentry dan08 dictionary dynamic edit editing enter examples excel file filename float format function gui homework import inches input java keyboard lapse library line lines linux list lists loop microphone mouse movingimageswithpygame mysql newb number numbers numeric output parameters parsing path phonebook plugin port prime programming projects py2exe pygame pyopengl pyqt pysimplewizard python random recursion redirect remote reverse scrolledtext session simple smtp software sprite statictext string strings syntax table tennis terminal text thread threading time tkinter tlapse trick tuple tutorial ubuntu unicode unit urllib urllib2 variable windows wordgame wxpython






