How do I test a byte string in Python? I want to manually convert (no libraries or functions) a UTF-8 string into UTF-16.

My basic solution is to reading from the stream some number of UTF-8 bytes, convert them into codepoints, then convert those codepoints into UTF-16 bytes. I want to code this myself, but I don't understand how to test the actual byte sequence.

Let's say I use the following code to ensure I have a UTF-8 encoding (from Evan Jones' Scratch Pad: http://evanjones.ca/python-utf8.html)

s = "hello normal string"
u = unicode( s, "utf-8" )
backToBytes = u.encode( "utf-8" )

Now, I need to test the lead byte of the sequence for each character in "backToBytes", right? Is there a function that does this? Any help would be appreciated.

8 Years
Discussion Span
Last Post by ChrisP_Buffalo

I guess I get to solve my own thread (thanks again to the natural Language Toolkit's online tutorial). The function repr() appears to give me what I need:

line = u'\u0144'
line_utf = line.encode('utf8')

print 'line = ', line_utf
print 'line repr ', repr(line_utf)

line = Å„
line repr '\xc5\x84'

It's the '\xc5\x84' part that I needed.

This question has already been answered. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.