Start New Discussion within our Software Development Community

Hi all,

Looking to extract all HTML tags from a dump of HTML data and put them all in a listbox.

I currently have the following code.

It displays to me things like HTML HEAD TITLE BODY.

But i want things like the IMG and ALT tags.

    ' Obtain the document interface
    Dim htmlDocument As mshtml.IHTMLDocument2 = DirectCast(New mshtml.HTMLDocument(), mshtml.IHTMLDocument2)
    ' Construct the document
    htmlDocument.write(htmlDocument)
    ListBox1.Items.Clear()
    ' Extract all elements
    Dim allElements As mshtml.IHTMLElementCollection = htmlDocument.all
    ' Iterate all the elements and display tag names
    For Each element As mshtml.IHTMLElement In allElements
        ListBox1.Items.Add(element.tagName)
    Next
    ' Extract all image elements
    Dim imgElements As mshtml.IHTMLElementCollection = htmlDocument.images
    ' Iterate through each image element
    For Each img As mshtml.IHTMLImgElement In imgElements
        ListBox2.Items.Add(img.src)
    Next
End Sub

If you don't absolutely have to use the mshtml interface you could try this:

Imports System.IO
Public Class Form1

    Private Sub Form1_Load(ByVal sender As Object, ByVal e As System.EventArgs) Handles Me.Load
        'Set the webbrowser control visible property to false if you don't need it for anything else.
        WebBrowser1.Url = New Uri("C:\Test1.htm")
    End Sub

    Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
        Dim htmlDocument As HtmlDocument = WebBrowser1.Document
        ListBox1.Items.Clear()
        ' Iterate all the elements and display tag names
        For Each element As HtmlElement In htmlDocument.All
            ListBox1.Items.Add(element.TagName)                
            If element.TagName.ToUpper = "IMG" Then
                ListBox2.Items.Add(element.DomElement.src)
            End If
        Next

    End Sub
End Class

Hi,

Thanks so much for your code. working very well. i have one more problem to ask sorry.

But i can open a new thread if you would like me too.

Within the code it has IMG as the tag name. if i wanted to tag to be for example;

TITLE or ALT

    Dim htmlDocument As HtmlDocument = WebBrowser1.Document
    ListBox1.Items.Clear()
    ' Iterate all the elements and display tag names
    For Each element As HtmlElement In htmlDocument.All
        ListBox1.Items.Add(element.TagName)
        If element.TagName.ToUpper = "TITLE" Then
            ListBox2.Items.Add(element.DomElement.src)
        End If
    Next
End Sub

I get the following error...

Public member 'src' on type 'HTMLTitleElementClass' not found.

Thanks so much!!

Sorry. after thinking a little more, i should be more obvious about what i want to do.

So for example if the TITLE element has nothing in it for example "" then i want that printed in listbox2.

If it has something init i want that printed into listbox2 for example if TITLE element has

" Welcome to Amazon " init. i want that into listbox2.

Thanks.

i'll be honest, i don't know how to do that. do you have any sample code?

Thanks.

I have tested this by casting the webbrowser.Document.DomDocument, so hopefully it will work for you.

You used: Dim htmlDocument As mshtml.IHTMLDocument2 = DirectCast(New mshtml.HTMLDocument(), mshtml.IHTMLDocument2)

recast as mshtml.IHTMLDocument3

   Dim doc3 As mshtml.IHTMLDocument3 = DirectCast(htmlDocument, IHTMLDocument3)
   For Each img As mshtml.IHTMLImgElement In doc3.getElementsByTagName("img")
      Debug.WriteLine(img.src)
   Next

src is the file path for images, and my original code works for that. To get the inner text for Title use:

        If element.TagName.ToUpper = "TITLE" Then
            ListBox2.Items.Add(element.InnerText)
        End If
This question has already been answered. Start a new discussion instead.