string.size and accentuated words

Please support our C++ advertiser: Intel Parallel Studio Home
Thread Solved

Join Date: Feb 2008
Posts: 25
Reputation: onemanclapping is an unknown quantity at this point 
Solved Threads: 0
onemanclapping onemanclapping is offline Offline
Light Poster

string.size and accentuated words

 
0
  #1
Sep 1st, 2008
hi,

when I try string.size on accentuated words, the result is bigger than it was "supposed" to be, as accentuated characters count as 2 size units instead of one.

how can I count them as one?

cheers
Reply With Quote Quick reply to this message  
Join Date: Aug 2005
Posts: 15,413
Reputation: Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute 
Solved Threads: 1470
Team Colleague
Featured Poster
Ancient Dragon's Avatar
Ancient Dragon Ancient Dragon is offline Offline
Still Learning

Re: string.size and accentuated words

 
0
  #2
Sep 1st, 2008
Please post example code. Are you compiling for UNICODE ?
Don't PM me with questions -- you might get a nasty PM in response. If you have a question then post it in one of the forums.
Reply With Quote Quick reply to this message  
Join Date: Feb 2008
Posts: 25
Reputation: onemanclapping is an unknown quantity at this point 
Solved Threads: 0
onemanclapping onemanclapping is offline Offline
Light Poster

Re: string.size and accentuated words

 
0
  #3
Sep 1st, 2008
Originally Posted by Ancient Dragon View Post
Please post example code.
the code is irrelevant, I think, as my question applies to any code calling this function, but here it is:
  1. void geraMenu(string titulo,string versao)
  2. {
  3. int numEstrelas = titulo.size() + 40;
  4. string linha = string(numEstrelas,'*');
  5. string meiaLinha = string(19,'*');
  6. cout << linha << endl << linha << endl;
  7. cout << meiaLinha << " " << titulo << " " << meiaLinha << endl;
  8. cout << linha << endl << linha << endl;
  9. cout << string(5,'*') << " " << versao << " " << string(numEstrelas - 7 - versao.size(),'*') << endl;
  10. }

the output with "PROJECTO GESTÃO" as 'titulo':
  1. ********************************************************
  2. ********************************************************
  3. ******************* PROJECTO GESTÃO *******************
  4. ********************************************************
  5. ********************************************************
  6. ***** beta 1 *******************************************

the output with "PROJECTO GESTAO" as 'titulo':
  1. *******************************************************
  2. *******************************************************
  3. ******************* PROJECTO GESTAO *******************
  4. *******************************************************
  5. *******************************************************
  6. ***** beta 1 ******************************************

as you see, the '*' aren't aligned in the first case, as I use 'Ã' instead of 'A' in the word "GESTÃO".
in the first case titulo.size() counts as 21 and in the second case counts as 20 (the correct amount of letters).

what I want is to know how can I count the right number of letters, independently of them being accentuated or not.

Originally Posted by Ancient Dragon View Post
Are you compiling for UNICODE ?
I'm sorry, but I'm a beginner C++ programmer to such a "noob" level I don't know what you're talking about
all I can do is tell you that I'm using Eclipse SDK under Ubuntu and show you this picture of the properties menu: http://img366.imageshack.us/img366/9537/help1yh9.png
Last edited by onemanclapping; Sep 1st, 2008 at 9:31 am.
Reply With Quote Quick reply to this message  
Join Date: Aug 2005
Posts: 5,266
Reputation: iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold iamthwee is a splendid one to behold 
Solved Threads: 377
Featured Poster
iamthwee's Avatar
iamthwee iamthwee is offline Offline
Posting Expert

Re: string.size and accentuated words

 
1
  #4
Sep 1st, 2008
Yes I have no issue on windowsxp using dev-cpp, but with ubuntu (under a vmware environment) I do.
  1. #include <iostream>
  2. #include <string>
  3.  
  4. using namespace std;
  5.  
  6. int main()
  7. {
  8. string a = "Ã";
  9. string b = "A";
  10.  
  11. //cout << a << " " << b << endl;
  12. cout << a.length();
  13. cout << "\n";
  14. cout << b.length();
  15.  
  16.  
  17. cin.get();
  18. }

Output in ubuntu
  1. user@ubuntu804desktop:~$ g++ -Wall pedantic.cc
  2. user@ubuntu804desktop:~$ ./a.out
  3. 2
  4. 1

Output in windowsxp using dev-cpp
Last edited by iamthwee; Sep 1st, 2008 at 10:35 am.
*Voted best profile in the world*
Reply With Quote Quick reply to this message  
Join Date: Aug 2005
Posts: 15,413
Reputation: Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute Ancient Dragon has a reputation beyond repute 
Solved Threads: 1470
Team Colleague
Featured Poster
Ancient Dragon's Avatar
Ancient Dragon Ancient Dragon is offline Offline
Still Learning

Re: string.size and accentuated words

 
0
  #5
Sep 1st, 2008
I don't have ubantu, but Microsoft VC++ 2008 Express reports 1 for the program that jamthwee posted. The compiler stored -61 in that byte. crappy *nix

>>I'm sorry, but I'm a beginner C++ programmer to such a "noob" level I don't know what you're talking about

UNICODE is a standard way to use non-English languages in computer programs. The standard UNICODE character is wchar_t, not char. Under MS-Windows wchar_t is defined to be unsigned short while in *nix (the last time I heard) it is unsigned long This is because many languages, such as Chinese, use graphic symbols which can be accommodated by wchar_t. In order to compile for UNICODE you have to set specific flags in the makefile -- I have no clue what those flags are for your compiler.

[edit]Considering jamthwee's test I would not bother with the UNICODE described above. It appears to be a compiler issue.[/edit]
Last edited by Ancient Dragon; Sep 1st, 2008 at 10:51 am.
Don't PM me with questions -- you might get a nasty PM in response. If you have a question then post it in one of the forums.
Reply With Quote Quick reply to this message  
Join Date: Oct 2007
Posts: 1,951
Reputation: Duoas has much to be proud of Duoas has much to be proud of Duoas has much to be proud of Duoas has much to be proud of Duoas has much to be proud of Duoas has much to be proud of Duoas has much to be proud of Duoas has much to be proud of 
Solved Threads: 214
Featured Poster
Duoas's Avatar
Duoas Duoas is offline Offline
Posting Virtuoso

Re: string.size and accentuated words

 
1
  #6
Sep 1st, 2008
Yes, it is an encoding issue. I suspect that it comes from the way your editor is saving the text file.

There are several ways to 'encode', or store, character data.

There is the old char-sized ASCII encoding, but that is limited to only 7-bit ASCII characters and any system dependant character codes above 127. Microsoft calls this "ANSI" and the exact selection of extended characters depends on your output code page. Obviously, this is not very convenient for languages using anything but straight-up Roman characters.

Then came (eventually) Unicode, which handles all language graphemes. (This doesn't mean it is complete --additions are still being made, but most industrialized nations can express their native language[s] with Unicode.)

There are several ways to store Unicode: three of which are of interest to us.

UTF-8 uses our venerable char. Only those graphemes that need more than one byte use more than one byte.

UTF-16/UCS-2 variable-width characters, like UTF-8, but where the smallest element is a 16-bit word instead of a byte. This format is considered deprecated, but it is still very much in use.

UTF-32/UCS-4 simply stores every character in a 32-bit word. This is how the GCC treats Unicode (wchar_t) values. As such, modern Linux systems in general are moving toward the exclusive use of this encoding.


So, now that you've had the lecture, on to the point: your text editor is using UTF-8, which you will recall is variable-width. I don't have Portugese installed, but I do have Spanish, so I hope you'll forgive the language choice in the examples. The file I've encoded is
  1. Espanol
  2. Español
"ANSI" (Microsoft's way), produces the following byte sequence
(escapes are either C-style or HEX, and the code page is Notepad's default)
  1. E s p a n o l \r \n
  2. E s p a \F1 o l \r \n
UTF-16 produces
(Notepad's "Unicode" option; notice the byte-order mark at the beginning)
  1. \FF \FE
  2. E \0 s \0 p \0 a \0 n \0 o \0 l \0 \r \0 \n \0
  3. E \0 s \0 p \0 a \0 \F1 \0 o \0 l \0 \r \0 \n \0
UTF-8 produces
(I removed Notepads weird BOM prefix)
  1. E s p a n o l \r \n
  2. E s p a \C3 \B1 o l \r \n
Notice how the second line is a different length than the first, due to the two-byte code for 'ñ'.

You are using UTF-8. And you have found UTF-8's limitation: you can't use any of the standard C or C++ string length functions on a UTF-8 string. You must either roll your own or use a library of some kind. Here is one using the STL:
  1. #include <algorithm>
  2. #include <functional>
  3. #include <string>
  4.  
  5. std::size_t UTF8_length( const std::string& s )
  6. {
  7. return std::count_if(
  8. s.begin(),
  9. s.end(),
  10. std::bind2nd( std::less <char> (), 0x80 )
  11. );
  12. }
Hope this helps.
Last edited by Duoas; Sep 1st, 2008 at 12:58 pm.
Reply With Quote Quick reply to this message  
Join Date: Feb 2008
Posts: 25
Reputation: onemanclapping is an unknown quantity at this point 
Solved Threads: 0
onemanclapping onemanclapping is offline Offline
Light Poster

Re: string.size and accentuated words

 
0
  #7
Sep 1st, 2008
Originally Posted by Duoas View Post
Hope this helps.
Thank you very much for your help in explaining me this problem! It's now very clear why it happens.

The only problem is that the code you gave me for helping me count characters does not work

here's my code:
  1. #include <iostream>
  2. using std::cout;
  3. using std::cin;
  4. using std::endl;
  5.  
  6. #include <string>
  7. using std::string;
  8.  
  9. #include <fstream>
  10. using std::ifstream;
  11.  
  12. #include <algorithm>
  13.  
  14. #include <functional>
  15.  
  16. std::size_t UTF8_length(const string& s )
  17. {
  18. return std::count_if(s.begin(),s.end(),std::bind2nd(std::less <char> (), 0x80));
  19. }
  20.  
  21. void geraMenu(const string& titulo,const string& versao)
  22. {
  23. int numEstrelas = titulo.size() + 40;
  24. string linha = string(numEstrelas,'*');
  25. string meiaLinha = string(19,'*');
  26. cout << linha << endl << linha << endl;
  27. cout << meiaLinha << " " << titulo << " " << meiaLinha << endl;
  28. cout << linha << endl << linha << endl;
  29. cout << string(5,'*') << " " << versao << " " << string(numEstrelas - 7 - versao.size(),'*') << endl;
  30.  
  31. // UTF8_length tests
  32. cout << titulo.size() << endl;
  33. cout << UTF8_length(titulo) << endl;
  34. cout << UTF8_length("coco") << endl;
  35. cout << UTF8_length("cocó") << endl;
  36. }

here's the output when I call geraMenu("PROJECTO GESTÃO", "beta 1"):
  1. ********************************************************
  2. ********************************************************
  3. ******************* PROJECTO GESTÃO *******************
  4. ********************************************************
  5. ********************************************************
  6. ***** beta 1 *******************************************
  7. 16
  8. 0
  9. 0
  10. 0

I can't see where the problem is as I've no idea what this [ std::bind2nd(std::less <char> (), 0x80) ] means.

thanks!
Reply With Quote Quick reply to this message  
Join Date: Oct 2007
Posts: 1,951
Reputation: Duoas has much to be proud of Duoas has much to be proud of Duoas has much to be proud of Duoas has much to be proud of Duoas has much to be proud of Duoas has much to be proud of Duoas has much to be proud of Duoas has much to be proud of 
Solved Threads: 214
Featured Poster
Duoas's Avatar
Duoas Duoas is offline Offline
Posting Virtuoso

Re: string.size and accentuated words

 
0
  #8
Sep 1st, 2008
Argh! I'm so sorry! (Recent med changes have made my brain work worse than usual...)

I forgot a couple of things:
  1. force proper type comparison
  2. non-ASCII characters
This will work. (I tested it to be sure!)
  1. #include <algorithm>
  2. #include <ciso646>
  3. #include <functional>
  4. #include <string>
  5.  
  6. struct UTF8_ischar
  7. {
  8. bool operator () ( unsigned char c ) const
  9. {
  10. return (c < 0x80) or (c >= 0xC0);
  11. }
  12. };
  13.  
  14. std::size_t UTF8_length( const std::string& s )
  15. {
  16. return std::count_if( s.begin(), s.end(), UTF8_ischar() );
  17. }
The above is an optimized version of
  1. std::size_t UTF8_length( const std::string& s )
  2. {
  3. return std::count_if(
  4. s.begin(),
  5. s.end(),
  6. std::bind2nd( std::less <unsigned char> (), 0x80 )
  7. )
  8. + std::count_if(
  9. s.begin(),
  10. s.end(),
  11. std::bind2nd( std::greater_equal <unsigned char> (), 0xC0 )
  12. );
  13. }
Don't worry too much about the weird stuff. You'll learn about it soon enough. It is just C++'s way of giving the user simple lambdas.

Essentially it says "count every character that has the msb == 0 or the two msbs == 11", which are the UTF-8 prefix codes for individual character sequences [1].

Sorry again!
Have fun now!
Last edited by Duoas; Sep 1st, 2008 at 6:01 pm. Reason: Fixed another stupid typo
Reply With Quote Quick reply to this message  
Join Date: Feb 2008
Posts: 25
Reputation: onemanclapping is an unknown quantity at this point 
Solved Threads: 0
onemanclapping onemanclapping is offline Offline
Light Poster

Re: string.size and accentuated words

 
0
  #9
Sep 1st, 2008
Originally Posted by Duoas View Post
Argh! I'm so sorry!
Oh! Don't be sorry at all, you've been very kind for explaining me all this...

The new code works perfectly! Thank you for your precious help!

Best regards,
André
Reply With Quote Quick reply to this message  
Reply

This thread has been marked solved.
Perhaps start a new thread instead?
Message:



Other Threads in the C++ Forum
Thread Tools Search this Thread



About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC