So, I am slamming my head into my desk right now. I am trying to take a string containing unicode character codes and convert it to a python unicode string. I thought it would be simple, but I am having major issues. Any help would be greatly appreciated. This is what I am confused about.

Starting with this: test = "\u2022" I want to convert it to a unicode string which should look like u'\u2022' But when I try to convert test with test.encode("utf-8") I gives me back u'\\u2022' which when printed just shows "\u2022" which is not helpful at all!

Check this out:

>>> test = "\u2022"
>>> test.decode("utf-8")
u'\\u2022'
>>> test.encode("utf-8")
u'\\u2022'
>>> print test.decode("utf-8")
\u2022
>>> print test.encode("utf-8")
\u2022

So, I must be missing something, I am retrieving the original string externally so I cannot make it unicode from the start, I need to be able to convert it after the fact. I feel like I have tried everything, it would be great if there was a simple fix.

Thanks very much!

Recommended Answers

All 9 Replies

test = u"\u2022"
print test.encode("utf-8")
ÔÇó

Have you tested this one ?

test = u"\u2022"
print test.encode("utf-8")
ÔÇó

Have you tested this one ?

Yes, I have tried this, but it does not solve the problem I am currently working with, I need to be able to start with the plain ASCII string "\u2022" and then after the fact convert it to UTF-8 to look like u'\u2022'

This one is not very clean but it may work till you've got a better solution...

test = '\u2022'
exec 'print u"%s".encode("utf-8")' % (test)

That looks like its gonna work!

Thank you so much jice, you have no idea how happy I am to have this solution.

If you find something better, don't hesitate to post it here... I'd be glad to know it.

I definitely will, but jeez, from the hunting I did before I posted on here, I dont know if there is anything else out there.

One thing though, for my uses, since the "\u2022" is already basically in unicode form, just not in a unicode string, your code does not need the .encode("utf-8") since its already being inserted into a unicode declared string.

You may also want to try this:
test.decode('raw_unicode_escape')

>>> test = "\u2022"
>>> test.decode('raw_unicode_escape')
u'\u2022'

meastman.. thank you, that is basically what I was looking for in the first place, I could sweat that i saw a list of the arguments that could be pasted to encode and decode and didnt see this. Anyway, thank you very much! It's nice to have the "proper" solution. jice's is a bit more interesting though, haha. Thank you everyone!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.