how to parse html file

Question

aseeman 0 Newbie Poster

10 Years Ago

Hi everybody,
I have a file with some tags like html.
<instance id="bass.1000000" docsrc = "BNC/A0C">
<answer instance="bass.1000000" senseid="bass%fish"/>
<context>
Try it with grilled sea <head>bass</head> and fennel.
</context>
</instance>
I need to parse it and retrieve some information like id and senseid in answer tag.
How can I do this?
I add <html> &<body> and write this code. But it does not work.

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
             doc.LoadHtml(sourcefile);
             HtmlNodeCollection collection = doc.DocumentNode.SelectNodes("//instance");
             foreach (HtmlNode n in collection)
             {
                 label1.Text+=n.Attributes["id"].Value;
                 }

html-css

5 Contributors
4 Replies
297 Views
2 Months Discussion Span
Latest Post 10 Years Ago Latest Post by JOSheaIV

All 4 Replies

Fenrir() 52 Newbie Poster

10 Years Ago

You can use some simple regex to achieve what you want but from past experience this can get quite verbose and inefficient i would suggest trying Click Here

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

RichardGalaviz 15 Newbie Poster · Answer 1 · 2014-10-01T09:05:50+00:00

Parsing HTML content without earlier interpretation of the HTML into an information structure is certain to be touchy to pages made in diverse ways. At the same time, HTML found in the wild is frequently invalid. More terrible, it is frequently indeed XHTML, while records that claim to be XHTML regularly contain HTML peculiarities, for example, missing close-labels, utilization of uppercase, and so forth.

AleMonteiro 238 Can I pick my title? · Answer 2 · 2014-10-01T10:04:06+00:00

I suggest using http://htmlagilitypack.codeplex.com to properly parse the HTML and then navigate through it.

JOSheaIV 119 C# Addict · Answer 3 · 2014-10-01T21:33:46+00:00

Well I feel my suggestion might be a little slanted cause I love to do it, but Regex.

I actually awhile back wrote a custom library that would parse HTML, and relied heavily on Regex to perform the operations (the thing was a little weird, I need to clean it up a little). With regex you could easily parse out whole nodes or attributes. I guess the next question though is, how familar are you with Regex?

If not, there is that HTML Agility Pack mentioned by AleMonterio. While I have never used it myself, when I was looking to parse HTML (that resulted in me building my own library), that popped up A LOT

how to parse html file

Recommended Answers Collapse Answers

All 4 Replies

Recommended Answers