Sanitize HTMLString

Troy III 2 Tallied Votes 269 Views Share

This is a simple HTML String Sanitizing tool.

It allows a highly rich,- but safe - html content input to be published on your pages.

The script code is very light and to some degree customizable.

  • This function takes care of blacklisted tags, which are the first to be discarded without further processing.
  • Otherwise, it checks link protocols for code injection, and strips it off if JavaScript is encountered.
  • Further on, it removes all event driven code assignments from event attributes on all tags.
  • And lastly, it restores images after stripping their event attributes. This security step is taken directly on the string source, because images are able to trigger the onerror event on a faulty source as soon as they are converted to DOM Elements, - making it possible for the attacker to execute his malicious code even before appending it to the document. This is the only part of the measure that operates on a string level. But it's absolutely, - a necessary one!

All questions, suggestions or remarks, are welcome...

p.s.: just before posting, I decided to also restrict the use of inline style, -since earlier versions of IE do support JavaScript returned values on this property.

JorgeM commented: nicely done. +10
Taywin commented: Clear description +11
function sterilize(HTMLString){
		HTMLString=HTMLString.replace( /<img /gi, "<imga ");
		var	att, x=0, y, coll,c=[],
			probe = document.createElement("div");
			probe.innerHTML = HTMLString;
			coll = probe.getElementsByTagName("*");
		while(coll[x])coll[x]?c.push(coll[x++]):0;

		for( x in c )
		if( /(script|object|embed|iframe)/i.
		/*you can blacklist more tags here!*/
			test( c[x].tagName ) ){
			c[x].outerHTML="";
 		} else {
		if( c[x].href)/java/.test(coll[x].protocol )?c[x].href="#":0;
			att = c[x].attributes;
		for( y in att ) if(att[y])
		if( /(^on|style)/i.test(att[y].name) )
			c[x].removeAttribute( att[y].name );
		}
		c=probe.innerHTML.replace( /imga/gi, "img" );
		return  c.replace( /<\/img>/gi, "" );
	}
	
	//Test HTMLString...
	
	var inputString="Input start:<a href=javascript:alert('javascript') onclick=\"alert('link clicked')\">klik</a> <div id=sentence>This sentence is without errors.</div><script src='javascript.js'><\/script><div contenteditable='true' onkeyup=\"abc(this, sentence['innerText'||'textContent'])\"></div>The image:<img src=deception onerror='alert()'><a href=sanitize.htm>some href</a><object></object><script><\/script><iframe><\/iframe>The End!";

	//exec:
	console.log( sterilize(inputString) );
Taywin 312 Posting Virtuoso

Nice snippet.

One note for practicality, doing white-list sanitization is better than black-list when it comes down to security. In other words, reject any thing if found any value that is not in expected values rather than attempt to remove unwanted values which may or may not be foreseen. It is less confusing and more intuitive.

Troy III 272 Posting Pro

Hi Taywin, thanks for your suggestion.
I think there was a reason behind this decision...

Its about freedom

And freedom is about more choices, more opportunities, more solutions and so forth.
To my humble opinion -there are always more good people, than bad.

Henceforth, contrary to the widespread policy/opinion that treats them all as equally evil, yet privilege the most unworthy ones - in reality and in practice, it is always just, and easier than anything else - to restrain the evil ones only, and let others enjoy.

The freedom where you are free to do nothing, is no freedom at all.
-It's a quarantine!
My natural borne idea was/is to make freedom enjoyable. (For, what good is freedom that you cant enjoy?!)

Anyway,
as mentioned earlier, - "to some degree" - the code is also customizable;
and so is its policy -which can easilly be inverted:

// just add the "!" [NOT] mark here and fill in your whitelist: 
for( x in c )
    if( ! /(/*your white-list here*/)/i.
    //and continue;

than comment out the "else" part completely:

// } else {

and you got yourself a privileged list of tags allowed, instead.

p.s.
You might notice that - depending on "how you look at freedom", - the privileged list will tend to extend far beyond the length of its opposite.

My Best Regards.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.