Escape and Unescape / Handling

Question

dev.cplusplus 0 Junior Poster in Training

18 Years Ago

Explination:
Ever notice how the URLs of search engines and other sites are cluttered with % symbols and other numbers? This is known as URI encoding, which is simply % signs followed by nonalphanumeric characters that have been converted to their hexadecimal values

Hi to all, I have the following problem, I'm writting an application that receives a string (escape) escaped ( like this "Here%20is") and return the string unescaped like "Here is", this is possible to do with the function UrlUnescapeInPlace,but I'm working with unicode characters(chinese).
Somebody know how to do this for unicode characters?
Thanks

c

4 Contributors
13 Replies
303 Views
1 Week Discussion Span
Latest Post 18 Years Ago Latest Post by dev.cplusplus

Dave Sinkula 2,398 long time no c

18 Years Ago

What have you tried so far?

There are RFCs and standards out there that may help decsribe the encodings, but I have not worked much with them.

Rashakil Fol 978 Super Senior Demiposter

18 Years Ago

I'd guess a good starting place is http://en.wikipedia.org/wiki/IDNA

WolfPack 491 Posting Virtuoso

18 Years Ago

Thank you for your help, I read the it, and was very usefull, but still I need a function that can convert from escape URL to unescape with unicode chars, if this possible at all, or maybe I'm trying to do something that is not possible at all.
Thanks

UrlUnescapeInPlace takes LPTSTR as it's URL pointer. LPTSTR is a pointer to a TCHAR. A TCHAR is defined as wchar_t when you have defined the UNICODE constant in your project. Since wchar_t is wide character, you only have to define the constant UNICODE as a preprocessor command to make this function work for unicode strings. So either you write #define UNICODE in a header file common to all your source files, or add UNICODE as a preprocessor directive to your project file. The second method is the easiest.

More on unicode in windows.

WolfPack 491 Posting Virtuoso

18 Years Ago

I'm receiving the string like:
%E6%97%A5
this is equal to one chinese letter:
星

The Unicode for '星' is 0x66F1;
you can confirm that by running

int _tmain(int argc, _TCHAR* argv[])
{
    wchar_t blah = 0x661F;
	return 0;
}

under a debugger and inspecting what is stored in blah by applying a debug point at the return statement. It should be '星'.

I receive the string "%E6%97%A5" and I should return the string like this '星', convert the escaped string to a unescaped string.

From what I understand all that URLUnescape does is extract the unicode information of the encoded string and give you this unicode string. For example %20 is space character (unicode 0x0020 ). So what is expected for "%E6%97%A5" is 3 characters of unicode values
0x00E6, 0x0097, and 0x00A5. Now if you look up the characters for those values in a unicode table, you get
0x00E6 = 'æ'
0x0097 = <control char>,
0x00A5 = '￥' (Yen Character)

So I think what you get IS the correct unicode string.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

dev.cplusplus 0 Junior Poster in Training · Answer 1 · 2006-06-19T02:44:33+00:00

I already try writting my own function, but I think I'm missing how to handle the characters, when they are in unicode.
I look in the Web(google) and I didn't found anything, also when searching "unicode unescape", I found web sites in chinese, and I don't know chinese, so dind't help.
If any one have any ideas?
Thanks

dev.cplusplus 0 Junior Poster in Training · Answer 2 · 2006-06-19T16:10:25+00:00

Thank you for your help, I read the it, and was very usefull, but still I need a function that can convert from escape URL to unescape with unicode chars, if this possible at all, or maybe I'm trying to do something that is not possible at all.
Thanks

dev.cplusplus 0 Junior Poster in Training · Answer 3 · 2006-06-19T17:52:57+00:00

Thank you, I already did that.
I already compiled my project in Unicode, and is working fine, my problem is that I'm receiving the string like:
%E6%97%A5
this is equal to one chinese letter:
星
I receive the string "%E6%97%A5" and I should return the string like this '星', convert the escaped string to a unescaped string.
I use the function UrlUnescapeInPlace but this works only for not unicode characters, that means when I use the following code, return the wrong character, the function returns the character:'æ', that is not equal to the character I'm expecting

str = " %E6%97%A5";
UrlUnescapeInPlace(str, 0);

Ideas?
Thnank you

dev.cplusplus 0 Junior Poster in Training · Answer 4 · 2006-06-21T00:52:22+00:00

Thank you for your answer, your are rigth, but I'll explain my self:
I have an html page that send in post method, the character is send to my ISAPI extension(DLL) I need to take the character and insert it to the DB, the problem is that I have to insert one chinese character not the escape string, when I'm sending one chinese character, I receive '%E6%97%A5' three character, I read about unicode, and I found that when working with unicode every character has wchar(2 bytes)
why I receive three characters?
I need to write a function that takes the three characters, the extension recives (e.g. '%E6%97%A5') and then convert it to one chinese character and save it into the DB. My DB support chinese, my problem is to convert the 3 character into one chinese character.
Is this possible?
Help :cry:

Dave Sinkula 2,398 long time no c Team Colleague · Answer 5 · 2006-06-21T01:00:54+00:00

Can you specify an encoding?

[edit]For example, if it were using UTF-8, I might begin cobbling something together along this line.

#include <stdio.h>
#include <wchar.h>

void foo(const char *text)
{
   wchar_t wch = 0;
   unsigned int byte;
   int i, n;
   if ( sscanf(text, "%%%2x%n", &byte, &n) == 1 )
   {
      if ( (byte & 0x80) == 0 )
      {
         wch += byte & 0x7F;
      }
      else if ( (byte & 0xE0) == 0xC0 )
      {
         wch += byte & 0x1F;
         wch <<= 6;
         text += n;
         if ( sscanf(text, "%%%2x%n", &byte, &n) == 1 )
         {
            if ( (byte & 0xC0) == 0x80 )
            {
               wch += byte & 0x3F;
            }
         }
      }
      else if ( (byte & 0xF0) == 0xE0 )
      {
         wch += byte & 0xF;
         wch <<= 6;
         text += n;
         if ( sscanf(text, "%%%2x%n", &byte, &n) == 1 )
         {
            if ( (byte & 0xC0) == 0x80 )
            {
               wch += byte & 0x3F;
               wch <<= 6;
               text += n;
               if ( sscanf(text, "%%%2x%n", &byte, &n) == 1 )
               {
                  if ( (byte & 0xC0) == 0x80 )
                  {
                     wch += byte & 0x3F;
                  }
               }
            }
         }
      }
      else if ( (byte & 0xF8) == 0xF0 )
      {
         /* exercise */
      }
   }
   printf("wch = %X = %lc\n", wch, wch);
   
}

int main(void)
{
   foo("%E6%97%A5");
   return 0;
}

/* my output
wch = 65E5 = 半*/

Now, I can almost guarantee that this is buggy as well as being incomplete.

dev.cplusplus 0 Junior Poster in Training · Answer 6 · 2006-06-21T02:12:56+00:00

Thanks a lot, I don't know if is too much to ask, I don't undertand the code, I'll like to undertand what it does
Again thank you

Dave Sinkula 2,398 long time no c Team Colleague · Answer 7 · 2006-06-21T02:26:41+00:00

It's trying to pick off the particular value bits of each byte, as shown in the UTF-8 description, and rebuild the wide character. In short, decoding.

It's not something I'm happy with as it's still a work in progress. It was just meant to show you what might be involved with decoding your incoming text.

Now, can you specify the encoding? UTF-8 seems to be 1 of 9 possibilities, and perhaps there are more. Since each may be different, knowing the encoding is probably the place to start -- otherwise you may have and order of magnitude more code to write (and this may be one of the simpler ones).

dev.cplusplus 0 Junior Poster in Training · Answer 8 · 2006-06-30T14:09:01+00:00

Hi again, after I long time I return working inmy unicode-encoding project, I was very exciting because I try the answer you gave in Visuak Studio 2005 and works great, but then I build another project (new) with visual studio6(old version) and I take the function, but seems that in visual studio6 doesn't work, is this possible?
I set my project adding Unicode support for visual studio 6 like is writen in: http://www.differentpla.net/node/135
but when Idebug the program or write the letter in a file that supports unicode I see that is not the same letter.

Again Help
* You can see the attached picture, to understand what I talking about
:eek:

dev.cplusplus 0 Junior Poster in Training · Answer 9 · 2006-06-30T20:33:26+00:00

I noticed that the variable receive the rigth number, according to the table of unicode characters, provably the problem is storing the character, is possible to convert the number to the letter?
Thanks