How can I extract repeated paragraphs of data from an html document. Every paragrahp is preceded by the line:
<p><i>Summary as passed House:</i> <br>

Thanks.

Recommended Answers

All 7 Replies

Why don't you post the HTML in a .txt file on the thread so we can see what you're looking at? Line breaks, white space, etc all affect how you scrape the file.

I've attached the html file as a txt file as per your request.

Thanks.

Imports System.Text.RegularExpressions

Public Class frmRegexParagraph

	Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
		Dim str As String() = GetParagraphs(System.IO.File.ReadAllText("C:\testdata.txt"))
		System.Diagnostics.Debugger.Break()
	End Sub

	Private Shared Function GetParagraphs(ByVal data As String) As String()
		Dim result As New List(Of String)
		Dim m As Match = Regex.Match(data, "<p>\s*(.+?)\s*</p>")
		While (m.Success)
			result.Add(m.Value)
			m = m.NextMatch()
		End While
		Return result.ToArray()
	End Function
End Class

You can't really parse HTML with regular expressions. It's too complex. Regular expression won't handle <![CDATA[ sections and reference enities correctly at all.
I recommend Html Agility Pack

SUMMARY:
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

commented: Great suggestion +15

Thanks for your help. I appreciate it.

How can I extract repeated paragraphs of data from an html document. Every paragrahp is preceded by the line:
<p><i>Summary as passed House:</i> <br>

Thanks.

You can obtain all Paragraph tags using the WebBrowser control using the following technique:

Dim oElements as HtmlElementCollection
oElements = WebBrowser1.Document.GetElementsByTagName("p")
For each oElement as HtmlElement in oElements
     if oElement.InnerHtml.Contains("<i>Summary as passed House:</i>") then
        debug.print "FOUND"
     end if
Next

You can parse the rest to get the specific data within the <P> tags
Hope this helps.

I hope one of these solutions solved your issue

Please mark this thread as solved if you have found an answer to your question and good luck!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.