I have a simple question about unicode and utf8.

How does a utf8 encoding know what its uppercase encoding is? I understand how utf8 encoding carries its unicode value embedded in itself but I fail to see how it maps a utf8 encoding to an uppercase unicode value. What is the mechanism which maps utf8 encodings to uppercase encodings or the other features available in the unicode universe?

3 Years
Discussion Span
Last Post by deceptikon

I'm not sure I understand the question. Upper and lower case glyphs have a unique encoding, they're independent of each other in the same way digits and letters are.



That's the point. How does an uppercase function work when its used with utf8? A uppercase function would be pretty simple arithmetic with ASCII values but I fail to see how that function would work with utf8.

Edited by gerard4143


Let's first differentiate Unicode and UTF-8. UTF-8 is an encoding technique for Unicode code points. So you can break the problem down by removing the encoding aspect and only thinking about code points.

For simplicity sake, consider the code points which can fit in a single byte. These correspond to ASCII, and the case conversion is identical in concept with ASCII. The only real difference is more complex encoding logic of the value because UTF-8 is a variable width encoding.

Extending that a bit to the non-ASCII code points, a simple arithmetic transform may not work. At that point the upper case function would use a lookup to map the two characters.

This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.