I'm using ASP to dynamically generate an XML file (for an RSS feed) that pulls information from a SQL DB. Users input the data into the DB using a UI that is a straightforward html form. However, there is a textarea field in that form in which most users seem to paste text from Microsoft word (or some other microsoft product that uses the ms word character set i.e. ms outlook).

My problem is that when this form submits to the database the ms characters are saved in the db as the microsoft character set. When the ASP GETs the information from the DB it seems that it pulls the same microsoft characters that are stored in the db out (as you would expect it would). However, the browser cannot interpret these characters (maybe they are not UTF-8 compatible?) and in place of these characters SOME browsers put a "unidentifiable character" character symbol and other browsers have the xml file (rss feed) break entirely.

I am using MS SQL and there is an option to set a field as xml which requires all information stored in the field to be xml compatible. However, this field does not perform any type of implicit conversion. It spits back an error message to the user on the web page which is unacceptable feedback.

Turning "Paste" off in the textarea is not an option. Users must still retain the ability to paste from something that uses an MS character set.

The input would look something like this:

(in ms character set obviously)
Hello. You're the biggest foo-bar I know.

The output would look something like this:

Hello� You�re the biggest foo�bar I know�


Does anybody have any ideas as to how to address this problem?

Is there some type of converting function in SQL that will convert the microsoft character set into something UTF-8 compatible?
Or is there some way server side to convert the information in this field.

Does the textarea support rich text, or is it the standard textarea. If it's rich text then you may be getting MSWord font formatting which isn't noticed until checking in code view. If it's a standard text area then some characters like apostrophes and other punctuation marks can be non standard and not supported in other applications. For example I have seen Word produce at least 3 different characters for an apostrophe.

When providing textareas it's always good practice to filter the input anyway, otherwise some nasty code can be sent to the database that will break the function and expose the database to attack.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.