Hello,

I have a problem with htmlentities(). I 'm missing something here, but I don't know what it is.
In my webpage, which is set to utf-8, the result of

echo htmlentities('éè')

is éè.
Why isn't the result éè ?
Thing is : I want to put the name 'Hélène' in my database. After stripslashes() and validating the string, I prepare the string as follows :

$name = htmlentities(mysqli_real_escape_string($dbc, trim($name)));

In my database, I have set the column "name" to VARCHAR(30). If I enter 'éè' in this column, it results in éè
In other words, the name 'Hélène' is too long to be entered in a column set to VARCHAR(30).
Can someone tell me what I did wrong here ?
Thanks a lot.

Try this code I found a while back

function charset_decode_utf_8($string) {
    if(@!ereg("[\200-\237]",$string) && @!ereg("[\241-\377]",$string)) {
        return $string;
    }
$string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e","'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",$string);
$string = preg_replace("/([\300-\337])([\200-\277])/e","'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",$string);
return $string;
}
echo charset_decode_utf_8('é');

Thanks Metalix !

Does this mean that, if you webpages are utf-8 encoded, you don't really need to use the htmlentities()-function ?

Member Avatar

diafol

I've had problems with multibyte characters since before I could walk!

Firstly, ensure everything is set to UTF-8 (head encoding and DB charset).7
Make sure that your files are saved as "UTF-8 without BOM". You can check this with a free-use editor like Notepad++.

Non-US-ASCII (or whatever you want to call them) characters take up 2 bytes (or even 3 for some Asian sets), so if you're using UTF-8 throughout, your table field sizes should be x3 as long as you envisaged. There's some talk of 4-byte charcters, so perhaps we should ammend our max sizes to x4. VARCHAR will be preferable to CHAR datatypes for this - otherwise you'll end up wasting mem.

Don't store text in html encoded form (e.g. © ). That'll really mess up your field widths.

When performing string functions, ensure that you use the multibyte versions if they are available: mb_strlen() as opposed to strlen() and mb_substr() as opposed to substr(). There are also 'count' functions.

Hey again.
yes you definitely need to use htmlentities when outputting any user data.
especially if that data is going inside a tag you will need to use htmlentities('',ENT_QUOTES) so it doesn't break your site

Member Avatar

diafol

If you use mysql_real_escape_string(), you've got all the quotes covered.
Using htmlentities or striptags can be used when outputting to protect yourself from scipting. I don't think you necessarily need that for input. As I've mentioned previously, if everything is in utf8, I can't see the benefit in using htmlentities just to store non-ASCII characters. Anybody have any different ideas?

Is it necessary to manipulate user-input with htmlentities before inserting it into the database ?

$name = strip_tags($_POST['name'];
// now check if the content of $name is valid....and than prepare to insert in db
$name = htmlentities(mysqli_real_escape_string($dbc, trim($name)));
// next insert data in db

I've seen bits of code where 'htmlentities' is used, and 'strip_tags' is not used.
(and not 'stripslashes' as I wrongly mentionned in my original question.)
Metalix, you are talking about 'outputting any user data'. I assume to a browser ?

Hey Ardav,

That's the point. In my case, if I use htmlentities before storing data in my db, I have to change the size of my db-fields to 3 times the size they have now.
If I don't use htmlentities, I save a lot of memory...

Member Avatar

diafol

Yes, I'd leave the htmlentities personally. However, take heed of the need to increase your field lengths x3 anyway. As some multibyte chars can take 3 bytes. If you use varchar(18) instead of varchar(6) and the data ends up being just 6 chars, you don't lose out. The problem comes when you use char(18) instead of char(6) - now that really does bite.

Iñtërnâtiônàlizætiøn is stored as 27 bytes, although it's only 20 chars.

A varchar(20) would store: Iñtërnâtiônàliz (obviously not enough)
A varchar(27) would store: Iñtërnâtiônàlizætiøn (on the button!)

The htmlentities('Iñtërnâtiônàlizætiøn') gives a whopping 118 bytes - over 4x the amount req'd by multibyte storage.

This has helped me a lot, thanks !