Yes, it is an encoding issue. I suspect that it comes from the way your
editor is saving the text file.
There are several ways to 'encode', or store, character data.
There is the old
char-sized
ASCII encoding, but that is limited to only 7-bit ASCII characters and any
system dependant character codes above 127. Microsoft calls this "ANSI" and the exact selection of extended characters depends on your output code page. Obviously, this is not very convenient for languages using anything but straight-up Roman characters.
Then came (eventually)
Unicode, which handles all language graphemes. (This doesn't mean it is complete --additions are still being made, but most industrialized nations can express their native language[s] with Unicode.)
There are several ways to store Unicode: three of which are of interest to us.
UTF-8 uses our venerable
char. Only those graphemes that need more than one byte use more than one byte.
UTF-16/UCS-2 variable-width characters, like UTF-8, but where the smallest element is a 16-bit word instead of a byte. This format is considered deprecated, but it is still very much in use.
UTF-32/UCS-4 simply stores every character in a 32-bit word. This is how the GCC treats Unicode (
wchar_t) values. As such, modern Linux systems in general are moving toward the exclusive use of this encoding.
So, now that you've had the lecture, on to the point: your text editor is using UTF-8, which you will recall is variable-width. I don't have Portugese installed, but I do have Spanish, so I hope you'll forgive the language choice in the examples. The file I've encoded is
"
ANSI" (Microsoft's way), produces the following byte sequence
(escapes are either C-style or HEX, and the code page is Notepad's default)
E s p a n o l \r \n
E s p a \F1 o l \r \n
UTF-16 produces
(Notepad's "Unicode" option; notice the byte-order mark at the beginning)
\FF \FE
E \0 s \0 p \0 a \0 n \0 o \0 l \0 \r \0 \n \0
E \0 s \0 p \0 a \0 \F1 \0 o \0 l \0 \r \0 \n \0
UTF-8 produces
(I removed Notepads weird BOM prefix)
E s p a n o l \r \n
E s p a \C3 \B1 o l \r \n
Notice how the second line is a
different length than the first, due to the two-byte code for 'ñ'.
You are using UTF-8. And you have found UTF-8's limitation: you can't use any of the standard C or C++ string length functions on a UTF-8 string. You must either roll your own or use a library of some kind. Here is one using the STL:
#include <algorithm>
#include <functional>
#include <string>
std::size_t UTF8_length( const std::string& s )
{
return std::count_if(
s.begin(),
s.end(),
std::bind2nd( std::less <char> (), 0x80 )
);
}
Hope this helps.