unicode data

Question

funfullson 0 Junior Poster in Training

14 Years Ago

hi dears.
I have an unicode data in a file.it is a persian text.
I want to read it and after some change write it in another file.
but when I read it, I face to some hexadecimal characters.
what I have to do with them?

Thanks.

python

3 Contributors
6 Replies
159 Views
4 Days Discussion Span
Latest Post 14 Years Ago Latest Post by griswolf

All 6 Replies

d5e5 109 Master Poster

14 Years Ago

What you have in your file are bytes of data. When you read a record from the file you get a string of bytes. If these bytes are supposed to represent text characters, they must have been encoded in some format, such as utf8, before being written to the file.

You haven't said whether you use Python 2 or 3, or what encoding the file has. Unicode is not an encoding -- I think there are several encodings that can represent Persian characters. There is a brief slide show at http://farmdev.com/talks/unicode/ which may help you.

griswolf 304 Veteran Poster

14 Years Ago

There is a brief but clear article here: http://effbot.org/zone/unicode-objects.htm. I also liked this one http://diveintopython3.org/strings.html#one-ring-to-rule-them-all; ... from which I have stolen this one important insight:

Bytes are not characters; bytes are bytes. Characters are an abstraction.

For clarity of thinking, you also need to be very aware that "Unicode" is not an encoding. Unicode is a way of ordering characters: Every character has a code point (which is an index into a list of abstractions:)). The encoding is a way to translate the code point to and from something that is stored in a file or in memory. Unicode has three common encodings: UTF-8 which has a variable number of bytes to encode any given character but is very efficient for ASCII and European characters; UTF-16 which encodes the 64K most common characters in two bytes each (and the others another way); and UTF-32 which encodes every Unicode character using 4 bytes each. There are big-endian and little-endian variations for the two and four byte encodings, but UTF-8 is endian-neutral.

As d5e5 says, your file contains bytes of data. Unless you know (or guess correctly) the encoding you cannot translate those bytes to characters. Because of historical importance to the programming community, we often guess that a text file contains ASCII encoded characters, but it is a guess, so when we display text from such files based on that guess, sometimes it doesn't work as hoped. In your case you know it is one of the possible Unicode encodings, so you can guess just a few times to find out... or you can be more thoughtful and find out some other way. The basic pattern for reading non-ASCII encoded files is like this:

fileencoding = "utf-8" # or some other encoding
    raw_bytes = file.readline() # newlines are the same for most encodings
    decoded_text =  raw_bytes.decode(fileencoding)

On this page is a list of the encodings that Python knows about (and a lot of other stuff): http://docs.python.org/library/codecs.html

Edited 14 Years Ago by griswolf because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

funfullson 0 Junior Poster in Training · Answer 1 · 2010-09-26T17:45:33+00:00

I am using python 2.6 in windows.I am testing it.
I made a text file and wrote in it "ایران افتخار من است".
and tryed to read it.but I faced problem.I saw some semi-octet characters after reading the file.
and now I dont know how to do it.
I thing it is utf8 but am not sure.
at first:
how can I recognize that what my text type is.if it is utf8 or utf16 or...
and after:
how I read it truely or convert these bytes to true style.
thanks friends.

griswolf 304 Veteran Poster · Answer 2 · 2010-09-27T00:59:42+00:00

The key is in this line of your explanation: I made a text file and wrote in it "ایران افتخار من است".
When you "made" the text file, you (by accepting the editor's default, probably) were choosing an encoding. You need to find out what the editor did, then you can use the decode function as mentioned in previous posts.

There is no absolutely reliable way to look at the bytes of the file and know the encoding. There are heuristics, but you are much better off if you can know in advance.

funfullson 0 Junior Poster in Training · Answer 3 · 2010-09-29T13:08:41+00:00

Thanks dears.
I did it in portable python but it did not do in terminal.
Hoe can I do it in terminal or Python IDLE for example?

thanks.

griswolf 304 Veteran Poster · Answer 4 · 2010-09-30T00:51:41+00:00

Please what does 'do it' mean? (create a text file? read a text file? display the characters after reading? more than one of these? something else??)

unicode data

Recommended Answers Collapse Answers

All 6 Replies

Recommended Answers