Probably a character set problem, trying to substitute characters

Please support our Python advertiser: Programming Forums - DaniWeb Sister Site
Reply

Join Date: Jun 2008
Posts: 3
Reputation: edholden is an unknown quantity at this point 
Solved Threads: 0
edholden edholden is offline Offline
Newbie Poster

Probably a character set problem, trying to substitute characters

 
0
  #1
Jun 11th, 2008
Hi. I'm having a little difficulty understanding character sets in Python. Basically I'm trying to write a function that will substitute a non-ASCII character with a similar ASCII equivalent. So if given a string like 'ÂBÇD', the function would iterate through the string object, replacing select characters to return a fully-ASCII string, 'ABCD'. It would substitute A for  and C for Ç, while leaving the other characters alone. (The point to all this is to write EXIF metadata in JPEG images, where some fields only allow ASCII characters. So I want to use the real, partially non-ASCII names for some metadata fields and "safe" ASCIIfied versions for others.)

I've tried this a number of ways and all of them have failed. Adding debugging output, the issue seems to be a character set problem, and I've tried adding encode, decode and unicode functions, however the more I read about character sets in Python, the more confused I get. Right now the solution I'm working on involves a list of tuples like this:

translations = [('Ä', 'A'), ('Å', 'A'), ...]
The program accepts a user-supplied string (called nameString) that may contain non-ASCII characters as an argument on the command line. It passes this nameString to the convertToAscii() function. That function uses the translations list of tuples to swap characters where needed:

  1. def convertToAscii(nameString):
  2. global translations
  3. res = ''
  4. for character in nameString:
  5. found = False
  6. for translation in translations:
  7. if character == translation[0]:
  8. res = res + translation[1] # replace with ASCII equivalent
  9. found = True
  10. break
  11. if found == False:
  12. res = res + character # just use original character
  13. return res

Except it doesn't work. The names all come out as they went in, so for some reason Python isn't matching non-ASCII characters in the nameString with strings in translation[0]. Adding debugging print statements shows that it is reading the translations list and its component tuples, however. It's just not recognizing matches. If anyone knows why, I'd be much obliged. In case it's helpful, it's Python 2.5.2 on Linux.

Regards,
Ed Holden
Reply With Quote Quick reply to this message  
Join Date: Dec 2006
Posts: 1,045
Reputation: woooee is a jewel in the rough woooee is a jewel in the rough woooee is a jewel in the rough 
Solved Threads: 294
woooee woooee is offline Offline
Veteran Poster

Re: Probably a character set problem, trying to substitute characters

 
0
  #2
Jun 11th, 2008
It's just personal preference, but I prefer to work with numbers in this type of code. So the following is an example of using the ord() of the character as the key in a conversion dictionary.
  1. #!/usr/bin/python
  2. # -*- coding: utf-8 -*-
  3.  
  4. x_list=[u'Ä', u'Â']
  5. print ord(x_list[0]), ord(x_list[1])
  6.  
  7. ord_dic={194:'A', 196:'A', 197:'A', 199:'C'}
  8. for chr in x_list:
  9. if ord(chr) in ord_dic:
  10. print ord(chr), 'converts to', ord_dic[ord(chr)]
  11.  
  12. print
  13. x=u'ÂBÇD'
  14. new_str=""
  15. for chr in x:
  16. ord_chr=ord(chr)
  17. if ord_chr in ord_dic:
  18. print ord_chr, 'converts to', ord_dic[ord_chr]
  19. new_str += ord_dic[ord_chr]
  20. else:
  21. new_str += chr
  22. print "converted to", new_str
Last edited by woooee; Jun 11th, 2008 at 11:32 pm.
Reply With Quote Quick reply to this message  
Join Date: Jun 2008
Posts: 3
Reputation: edholden is an unknown quantity at this point 
Solved Threads: 0
edholden edholden is offline Offline
Newbie Poster

Re: Probably a character set problem, trying to substitute characters

 
0
  #3
Jun 14th, 2008
Thanks for getting back to me, woooee. That was a good suggestion. However, I think I'm running into the same problem I always run into. I think the issue is that unicode characters are being interpreted as two characters rather than one, because they take up twice the number of bytes.

First off, I'm doing this:

  1. #!/usr/bin/python
  2. # -*- coding: utf-8 -*-

(I've tried that a couple ways, including with latin-1.) Then I'm establishing my translations dictionary more or less as you suggested:

  1. translations = {
  2. 196: 'A', # translate 'Ä'
  3. 197: 'A', # translate 'Å'
  4. }

Finally I'm making my function:


  1. def convertToAscii(nameString):
  2. global translations
  3. revisedNameString = ''
  4.  
  5. for character in nameString:
  6. ordCharacter=ord(character)
  7. print "ordCharacter of", character, "is", ordCharacter # debugging
  8. if ordCharacter in translations:
  9. print " ... found" # debugging
  10. revisedNameString = revisedNameString + translations[ordCharacter]
  11. else:
  12. print " ... NOT found" # debugging
  13. revisedNameString = revisedNameString + character
  14. return revisedNameString

That all looks good, and the print statements allow me to keep track of what Python is doing. But when I run it, passing a string like ÄBC-ÅBC:

  1. ordCharacter of � is 195
  2. ... found
  3. ordCharacter of � is 132
  4. ... NOT found
  5. ordCharacter of B is 66
  6. ... NOT found
  7. ordCharacter of C is 67
  8. ... NOT found
  9. ordCharacter of - is 45
  10. ... NOT found
  11. ordCharacter of � is 195
  12. ... found
  13. ordCharacter of � is 133
  14. ... NOT found
  15. ordCharacter of B is 66
  16. ... NOT found
  17. ordCharacter of C is 67
  18. ... NOT found

The end result is ÄBC-ÅBC, the original string, rather than a nice ASCIIfied ABC-ABC. But you'll notice that Python treats the two A-like characters to not just one iteration, but two. So Ä and Å are not being correctly identified by their ordinate codes of 196 and 197. Instead, Ä is being identified as both 195 and 132, and later on Å is being identified as 195 and 133. So again, I think Python doesn't grok that I'm passing it a unicode character and it's iterating over two characters in the string rather than one, preventing a match.

Ever seen this before? I'm sure there's a simple way of getting around it, but I haven't been able to dig one up.

Thanks again,
Ed
Reply With Quote Quick reply to this message  
Join Date: Dec 2006
Posts: 1,045
Reputation: woooee is a jewel in the rough woooee is a jewel in the rough woooee is a jewel in the rough 
Solved Threads: 294
woooee woooee is offline Offline
Veteran Poster

Re: Probably a character set problem, trying to substitute characters

 
0
  #4
Jun 14th, 2008
unicode characters are being interpreted as two characters rather than one
This usually means that it is not a unicode string, but a standard python string. In 3.0, all strings will be unicode by default, so I have heard, and so will not require conversion. Look at the print out from the following snippet. I am not even close to being an expert on unicode, but I think this will work.
  1. normal_string='ÂBÇD'
  2. for chr in normal_string:
  3. print ord(chr),
  4. print "\n--------------------"
  5. #
  6. ##unicode_string=u'ÂBÇD' ## use this or the following line
  7. unicode_string=normal_string.decode('utf-8')
  8. print "\n"
  9. for chr in unicode_string:
  10. print ord(chr)
  11. print "--------------------"
Last edited by woooee; Jun 14th, 2008 at 5:39 pm.
Reply With Quote Quick reply to this message  
Join Date: Jun 2008
Posts: 3
Reputation: edholden is an unknown quantity at this point 
Solved Threads: 0
edholden edholden is offline Offline
Newbie Poster

Re: Probably a character set problem, trying to substitute characters

 
0
  #5
Jun 14th, 2008
Yes, I've also heard that about Python 3.0. Sounds like a good idea.

So anyway, that did it. Simply adding this construction:

  1. unicode_string=normal_string.decode('utf-8')

... converted it into a form that finally worked, and the decoding did create an ASCII version. Thanks again for the assistance.

Best,
Ed
Reply With Quote Quick reply to this message  
Reply

This thread is more than three months old.
Perhaps start a new thread instead?
Message:



Similar Threads
Other Threads in the Python Forum
Thread Tools Search this Thread



Tag cloud for Python
About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC