Guys I have a created a test windows form to try few of the HAP functionalities.....
I have the following code as shown below:

The function below deals with HTML Parsing whereby removing unwanted HTML metadata.

Function SanitizeHtml(ByVal html As String) As String

        Dim doc As New HtmlDocument()
        doc.LoadHtml(html)

        'Remove potentially harmful elements
        Dim nc As HtmlNodeCollection = doc.DocumentNode.SelectNodes("//script|//link|//iframe|//frameset|//frame|//applet|//object")
        If nc IsNot Nothing Then
            For Each node As HtmlNode In nc

                node.ParentNode.RemoveChild(node, False)
            Next
        End If

        'remove hrefs to java/j/vbscript URLs
        nc = doc.DocumentNode.SelectNodes("//a[starts-with(@href, 'javascript')]|//a[starts-with(@href, 'jscript')]|//a[starts-with(@href, 'vbscript')]")
        If nc IsNot Nothing Then

            For Each node As HtmlNode In nc
                node.SetAttributeValue("href", "protected")
            Next
        End If

        'remove img with refs to java/j/vbscript URLs
        nc = doc.DocumentNode.SelectNodes("//img[starts-with(@src, 'javascript')]|//img[starts-with(@src, 'jscript')]|//img[starts-with(@src, 'vbscript')]")
        If nc IsNot Nothing Then
            For Each node As HtmlNode In nc
                node.SetAttributeValue("src", "protected")
            Next
        End If

        'remove on<Event> handlers from all tags
        nc = doc.DocumentNode.SelectNodes("//*[@onclick or @onmouseover or @onfocus or @onblur or @onmouseout or @ondoubleclick or @onload or @onunload]")
        If nc IsNot Nothing Then
            For Each node As HtmlNode In nc
                node.Attributes.Remove("onFocus")
                node.Attributes.Remove("onBlur")
                node.Attributes.Remove("onClick")
                node.Attributes.Remove("onMouseOver")
                node.Attributes.Remove("onMouseOut")
                node.Attributes.Remove("onDoubleClick")
                node.Attributes.Remove("onLoad")
                node.Attributes.Remove("onUnload")
            Next
        End If
        Return doc.DocumentNode.WriteTo()
    End Function

Here is how I test the function by using a webbrowser control:

Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
        Dim url As String = "http://htmlagilitypack.codeplex.com/discussions/24346"

        WebBrowser1.Navigate(url)

    End Sub

    Private Sub Button2_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button2.Click
        TextBox1.Text = SanitizeHtml(TextBox2.Text)
    End Sub

    Private Sub WebBrowser1_DocumentCompleted(ByVal sender As System.Object, ByVal e As System.Windows.Forms.WebBrowserDocumentCompletedEventArgs) Handles WebBrowser1.DocumentCompleted
        TextBox2.Text = WebBrowser1.Document.Body.OuterHtml
    End Sub

Q: The problem I am having is that the function is not removing any of the URLs, tags, scripts etc hence leaving the result still in raw state.

Please advice

Recommended Answers

All 2 Replies

>The problem I am having is that the function is not removing any of the URLs, tags, scripts etc hence leaving the result still in raw state

Use remove or clear() method.

If nc IsNot Nothing Then
            For Each node As HtmlNode In nc
                node.Remove()
                'node.ParentNode.RemoveChild(node, False)
            Next
            'Remove all nodes
            ' nc.Clear()
 End If

>The problem I am having is that the function is not removing any of the URLs, tags, scripts etc hence leaving the result still in raw state

Use remove or clear() method.

If nc IsNot Nothing Then
            For Each node As HtmlNode In nc
                node.Remove()
                'node.ParentNode.RemoveChild(node, False)
            Next
            'Remove all nodes
            ' nc.Clear()
 End If

I tried what you suggested above but it has no effect whatsoever.

But I change the return to innertext like this:

Return doc.DocumentNode.InnerText

It sort of works.

But I cant understand these code snipets....(since I have put this code together from a different forum)

Why is it set to false? If set to true it doesnt work?!?

node.ParentNode.RemoveChild(node, False)

Why isn't there remove child used here?

node.SetAttributeValue("href", "protected")
node.SetAttributeValue("src", "protected")

Please explain?

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.