need help with stripping html and encodoing output

SelArom 9 Junior Poster in Training

16 Years Ago

i wasn't quite sure where to put this, since this is more of an html/encoding question, but since I'm working in .net I figured this is the best place to start...

I have a site, and the content-type meta tag is utf-8. I am receiving an rss feed, and am displaying the contents of the feed on the page. since this feed is not on my site but is external, I want to strip the html tags (all of them) and encode the output so there is no chance of xss...

this is the regex I found for replacing html: <(.|\n)*?>

it doesn't seem to work, as I'm still getting img tags through. does anyone have a better regex for stripping html tags?

anyway the bigger problem is that after stripping the tags, I pass the returned string to HttpUtility.HtmlEncode(). This works, okay, unfortunately, the output on the page is changing things like apostrophes, quotes, ampersands and even spaces into their encoded formats...

now this is probably a DUH... I mean, I know I'm encoding it, but I don't want to show these characters like that. Do I have to to a string.replace for these characters? or is there another way?

sorry if this is a noob question, I've been developing for a long time, but admittedly never took the time to really learn about encoding!

many thanks in advance for your help!

1 Contributor
0 Replies
68 Views

Be the first to reply

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.