I have a simple question about unicode and utf8.

How does a utf8 encoding know what its uppercase encoding is? I understand how utf8 encoding carries its unicode value embedded in itself but I fail to see how it maps a utf8 encoding to an uppercase unicode value. What is the mechanism which maps utf8 encodings to uppercase encodings or the other features available in the unicode universe?

I'm not sure I understand the question. Upper and lower case glyphs have a unique encoding, they're independent of each other in the same way digits and letters are.


That's the point. How does an uppercase function work when its used with utf8? A uppercase function would be pretty simple arithmetic with ASCII values but I fail to see how that function would work with utf8.

Let's first differentiate Unicode and UTF-8. UTF-8 is an encoding technique for Unicode code points. So you can break the problem down by removing the encoding aspect and only thinking about code points.

For simplicity sake, consider the code points which can fit in a single byte. These correspond to ASCII, and the case conversion is identical in concept with ASCII. The only real difference is more complex encoding logic of the value because UTF-8 is a variable width encoding.

Extending that a bit to the non-ASCII code points, a simple arithmetic transform may not work. At that point the upper case function would use a lookup to map the two characters.