Hello guys!

At school we were given a homework to read some info from web page and display it. While surfing over net for some infos how to approach to this a came across at HTML Agility Pack. I decided to use it. But I have some problems with parsing the content.

Here is the part of http that I have to get values from (marked with red).

<h3>marec - maj 2009</h3>
<div class="graf_table">
	<table summary="layout table">
		<tr>
			<th>DATUM</th>
			<td class="datum">10.03.2009</td>
			<td class="datum">24.03.2009</td>
			<td class="datum">07.04.2009</td>
			<td class="datum">21.04.2009</td>
			<td class="datum">05.05.2009</td>
			<td class="datum">06.05.2009</td>
		</tr>
		<tr>
			<th>Maloprodajna cena [EUR/L]</th>
			<td>0,96000</td>
			<td>0,97000</td>
			<td>0,99600</td>
			<td>1,00800</td>
			<td>1,00800</td>
			<td>1,01000</td>
		</tr>
		<tr>
			<th>Maloprodajna cena [SIT/L]</th>
			<td>230,054</td>
			<td>232,451</td>
			<td>238,681</td>
			<td>241,557</td>
			<td>241,557</td>
			<td>242,036</td>
		</tr>
		<tr>
			<th>Prodajna cena brez dajatev</th>
			<td>0,33795</td>
			<td>0,34628</td>
			<td>0,36795</td>
			<td>0,37795</td>
			<td>0,37795</td>
			<td>0,37962</td>
		</tr>
		<tr>
			<th>Trošarina</th>
			<td>0,46205</td>
			<td>0,46205</td>
			<td>0,46205</td>
			<td>0,46205</td>
			<td>0,46205</td>
			<td>0,46205</td>
		</tr>
		<tr>
			<th>DDV</th>
			<td>0,16000</td>
			<td>0,16167</td>
			<td>0,16600</td>
			<td>0,16800</td>
			<td>0,16800</td>
			<td>0,16833</td>
		</tr>
	</table>
</div>

So far I managed to write this, which gives me all values from table.
So my question is, waht to add/change in query so that it'll return me only values form cells where table header is DATUM and Maloprodajna cena [EUR/L]?

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            doc.Load(@"C:\Users\User\Desktop\petrol.celota.htm");
            doc.OptionCheckSyntax = true;
            doc.OptionFixNestedTags = true;
            doc.OptionAutoCloseOnEnd = true;
            doc.OptionOutputAsXml = true;
            doc.OptionDefaultStreamEncoding = Encoding.Default;


            var query = from html in doc.DocumentNode.SelectNodes("//div[@class='graf_table']").Cast<HtmlNode>()
                        from table in html.SelectNodes("//table").Cast<HtmlNode>()
                        from row in table.SelectNodes("tr").Cast<HtmlNode>()
                        from cell in row.SelectNodes("th|td").Cast<HtmlNode>()
                        select new { Table = table.Id, CellText = cell.InnerHtml };

Recommended Answers

All 5 Replies

You have pretty much answered your own question:
"return me only values form cells where table header is DATUM and Maloprodajna cena [EUR/L]"

Check out the LINQ Where claus

I know that, but I have trouble with setting up the where clause.
I'd be thankful though, if someone would show me an example how to do this. I do understand how to write it to filter numbers, strings etc., but for this particular case I don't have a clue.

I tried something new now, but this gives me duplications ... I kind of understand why, but don't know how to fix it ...

var query = from html in doc.DocumentNode.SelectNodes("//div[@class='graf_table']/table")
                        from date in html.SelectNodes("tr[1]/td")
                        from price in html.SelectNodes("tr[2]/td")
                        select new { Date = date.InnerText, Price = price.InnerHtml };

I solved it:)

var query = from table in doc.DocumentNode.SelectNodes("//div[@class='graf_table']//table")
                        from date in table.SelectNodes("tr[1]/td")
                        from price in table.SelectNodes("tr[2]/td")
                        select new { Date = date.InnerText, Price = price.InnerHtml };

However, can someone show me how would I accomplish this with the first approach I took using where clause(where table header is DATUM|Maloprodajna ...) ?

Ooops, I was too fast ... it does not work the way I want:icon_twisted:

Instead of :
Date: 10.03.2009 Price: 0,96000
Date: 24.03.2009 Price: 0,97000
Date: 07.04.2009 Price: 0,99600
.
.
.
etc.

I get:
Date: 10.03.2009 Price: 0,96000
Date: 10.03.2009 Price: 0,97000
Date: 10.03.2009 Price: 0,99600
Date: 10.03.2009 Price: 1,00800
Date: 10.03.2009 Price: 1,00800
Date: 10.03.2009 Price: 1,01000
Date: 24.03.2009 Price: 0,96000
Date: 24.03.2009 Price: 0,97000
Date: 24.03.2009 Price: 0,99600
Date: 24.03.2009 Price: 1,00800
Date: 24.03.2009 Price: 1,00800
Date: 24.03.2009 Price: 1,01000
Date: 07.04.2009 Price: 0,96000
Date: 07.04.2009 Price: 0,97000
Date: 07.04.2009 Price: 0,99600
Date: 07.04.2009 Price: 1,00800
Date: 07.04.2009 Price: 1,00800
Date: 07.04.2009 Price: 1,01000
.
.
.
etc.

I am getting closer to solution in compare to with previous tries ... any help, hint, explanation would be appreciated though:icon_cheesygrin:

Is it possible to combine those two queries?

var date = (from d in htmlDoc.DocumentNode.SelectNodes("//div[@class='graf_table']//table//tr[1]/td")
                        select DateTime.Parse(d.InnerText)).ToArray();

var price = (from p in htmlDoc.DocumentNode.SelectNodes("//div[@class='graf_table']//table//tr[2]/td")
                         select double.Parse(p.InnerText)).ToArray();
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.