Extract all links from table in a html page

Question

khan17 -1 Junior Poster in Training

13 Years Ago

hello all,

i need to extract table contents from a html page.. in that table one of the column has the links.. i need to extract that link too as of other column elements.. is it possible? i need to extract it and then to store in a database.

actually the code in the html page looks like,

<tr> 
 
<td style="width:100px;">aaaaa</td> 
<td>bbbbb</td> 
<td>ccccc</td> 
<td style="width:75px;">dddddd</td> 
<td align="center">&nbsp;</td> 
<td align="center">&nbsp;</td> 
<td align="center">&nbsp;</td> 
 
<td align="center"> 
<a id="samid" class="samclass" href="samfile.aspx?id=11111"><img src="images\samimg.png" style="border-width:0px;" /></a> 
</td> 
</tr>
<tr> 
 
<td style="width:100px;">eeeee</td> 
<td>fffff</td> 
<td>ggggg</td> 
<td style="width:75px;">hhhhh</td> 
<td align="center">&nbsp;</td> 
<td align="center">&nbsp;</td> 
<td align="center">&nbsp;</td> 
 
<td align="center"> 
<a id="samid1" class="samclass" href="samfile.aspx?id=22222"><img src="images\samimg.png" style="border-width:0px;" /></a> 
</td> 
</tr>

and now im getting output like this when i give the code

HtmlElement pageElement = webBrowser1.Document.GetElementById("tableid");
textResult = pageElement.Children[0].InnerText;
this.textBox1.Text = textResult;


"aaaa     bbbbb      ccccc      ddddd
 eeee     fffff      ggggg      hhhhh"

but i need to get the output like this

"aaaa     bbbbb      ccccc      ddddd                samfile.aspx?id=11111
 eeee     fffff      ggggg      hhhhh                samfile.aspx?id=22222"

5 Contributors
17 Replies
1K Views
4 Days Discussion Span
Latest Post 13 Years Ago Latest Post by khan17

All 17 Replies

farooqaaa 48 Enthusiast

13 Years Ago

You can get it like this:

int current holds the current index number.

int current = 0;

// Get the "aaaa, bbbbb ...."
HtmlElement pageElement = wb.Document.GetElementById("tableid");
this.textBox1.Text = pageElement.Children[current].InnerText);

// Get the "hrefs" 
HtmlElement href = pageElement.GetElementsByTagName("a")[current];
MessageBox.Show(hrefs.GetAttribute("href"));

Edited 13 Years Ago by farooqaaa because: n/a

Alexpap 1 Junior Poster

13 Years Ago

You can get it like this:

int current holds the current index number.

int current = 0;

// Get the "aaaa, bbbbb ...."
HtmlElement pageElement = wb.Document.GetElementById("tableid");
this.textBox1.Text = pageElement.Children[current].InnerText);

// Get the "hrefs" 
HtmlElement href = pageElement.GetElementsByTagName("a")[current];
MessageBox.Show(hrefs.GetAttribute("href"));

I agree with Farooqaa, but there are other more simple ways to do this. You could go an google the word 'HTML parser c#' and you will find some good results.

Hope i helped :)

kvprajapati 1,826 Posting Genius

13 Years Ago

Have you tried http://htmlagilitypack.codeplex.com/ ? I think this question already been asked by you. Isn't it?

kvprajapati 1,826 Posting Genius

13 Years Ago

>i dont know how to merge htmlagilitypack with webbrowser control

HtmlDocument.LoadHtml method receive the htmlText. Have a look at this sample.

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            List<string> links = new List<string>();

            doc.LoadHtml(textBox1.Text);  
            foreach (HtmlAgilityPack.HtmlNode nd in doc.DocumentNode.SelectNodes("//a[@href]"))
            {
                links.Add(nd.Attributes["href"].Value);
            }
            foreach (HtmlAgilityPack.HtmlNode nd in doc.DocumentNode.SelectNodes("//img[@src]"))
            {
                links.Add(nd.Attributes["src"].Value);
            }

kvprajapati 1,826 Posting Genius

13 Years Ago

Download and extract HtmlAgilityPack. Open the project, select Project Menu + Add Reference + Browse + Select HtmlAgilityPack.dll.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

khan17 -1 Junior Poster in Training · Answer 1 · 2010-05-12T13:17:07+00:00

You can get it like this:

int current holds the current index number.

int current = 0;

// Get the "aaaa, bbbbb ...."
HtmlElement pageElement = wb.Document.GetElementById("tableid");
this.textBox1.Text = pageElement.Children[current].InnerText);

// Get the "hrefs" 
HtmlElement href = pageElement.GetElementsByTagName("a")[current];
MessageBox.Show(hrefs.GetAttribute("href"));

thanks for ur reply...

ur code works fine.. but all the links in that website table appear in a single line in a text file..
i want to have link1 in first line then the link2 in second line and so on in a text line..

how can get newline break..

need help.

khan17 -1 Junior Poster in Training · Answer 2 · 2010-05-12T15:44:08+00:00

thanks for helping till now but im having another prob. as u can see in my code it contains simply empty space for some colums.

<tr> 
 
<td style="width:100px;">eeeee</td> 
<td>fffff</td> 
<td>ggggg</td> 
<td style="width:75px;">hhhhh</td> 
<td align="center">&nbsp;</td> 
<td align="center">&nbsp;</td> 
<td align="center">&nbsp;</td> 
 
<td align="center"> 
<a id="samid1" class="samclass" href="samfile.aspx?id=22222"><img src="images\samimg.png" style="border-width:0px;" /></a> 
</td> 
</tr>

im getting output in a single line of row without any spaces.. for example the above code will give me result

aaaabbbbccccdddddsamfile.aspx?id=11111
eeeeffffgggghhhhhsamfile.aspx?id=22222

like this..

but i need to get spaces also.. i want the table exactly how it looks like in that website.. is it possible?

khan17 -1 Junior Poster in Training · Answer 3 · 2010-05-12T17:14:43+00:00

Have you tried http://htmlagilitypack.codeplex.com/ ? I think this question already been asked by you. Isn't it?

im using web browser control to login and click certain buttons on that website sir.. then i ve to download.. the main prob im not using html agility pack is i dont know how to merge htmlagilitypack with webbrowser control.. also when i tried the sample code given in the link u provided, it gives me error like this

The type 'System.Windows.Forms.HtmlDocument' has no constructors defined
The type or namespace name 'HtmlNode' could not be found (are you missing a using directive or an assembly reference?)
The type or namespace name 'HtmlAttribute' could not be found (are you missing a using directive or an assembly reference?

and i dont know which assembly to add reference to resolve this error..
thanks sir.

khan17 -1 Junior Poster in Training · Answer 4 · 2010-05-13T10:39:51+00:00

Thanks sir,

when i tried using HtmlAgilityPack and also the code sample u gave here i got error as,

The type or namespace name 'HtmlAgilityPack' could not be found (are you missing a using directive or an assembly reference?)

also i ve tried to add reference but i cant find this HtmlAgilityPack anywhere. Im using Visual Studio 2008. it will be helpful if u guide me how to refer this HtmlAgilityPack..

thank you..
good day.

khan17 -1 Junior Poster in Training · Answer 5 · 2010-05-13T11:15:17+00:00

Download and extract HtmlAgilityPack. Open the project, select Project Menu + Add Reference + Browse + Select HtmlAgilityPack.dll.

sir i ve downloaded and add reference to HtmlAgilityPack.. now im not getting any error but when i debug im getting exception like this

NullReferenceException was unhandled:Object reference not set to an instance of an object.

in the foreach loop of ur code.. what should i do.

kvprajapati 1,826 Posting Genius Team Colleague · Answer 6 · 2010-05-13T12:37:51+00:00

kvprajapati 1,826 Posting Genius

13 Years Ago

Post your ~complete code~ here please.

khan17 -1 Junior Poster in Training · Answer 7 · 2010-05-13T13:05:21+00:00

Post your ~complete code~ here please.

im simply trying ur code only...
the code is

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using HtmlAgilityPack;

namespace WindowsFormsApplication1
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void button1_Click(object sender, EventArgs e)
        {
            string textResult="";
            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            List<string> links = new List<string>();

            doc.LoadHtml(@"d:\sam.html");
           
          
            foreach (HtmlAgilityPack.HtmlNode nd in doc.DocumentNode.SelectNodes("//a[@href]"))
            {
                links.Add(nd.Attributes["href"].Value);
            }
            foreach (HtmlAgilityPack.HtmlNode nd in doc.DocumentNode.SelectNodes("//img[@src]"))
            {
                links.Add(nd.Attributes["src"].Value);
            }
            foreach (string str in links)
                textResult = textResult + str;
            this.textBox1.Text = textResult;
        }
    }
}

kvprajapati 1,826 Posting Genius Team Colleague · Answer 8 · 2010-05-13T15:21:55+00:00

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            List<string> links = new List<string>();


            doc.Load(@"d:\sam.html");

            if (doc.DocumentNode == null)
                return;

            HtmlAgilityPack.HtmlNodeCollection coll = doc.DocumentNode.SelectNodes("//a[@href]");

            if (coll != null)
            {
                foreach (HtmlAgilityPack.HtmlNode nd in coll)
                {
                    links.Add(nd.Attributes["href"].Value);
                }
            }

            coll = doc.DocumentNode.SelectNodes("//img[@src]");
            if (coll != null)
            {
                foreach (HtmlAgilityPack.HtmlNode nd in coll)
                {
                    links.Add(nd.Attributes["src"].Value);
                }
            }

kim1987 0 Newbie Poster · Answer 9 · 2010-05-13T15:26:20+00:00

Hi,

HTML Table Extractor is an add-in for Internet Explorer (IE) allowing you to extract tables from web pages in an effective and quick manner. HTML Table Extractor also allows you to select tabular data online and easily convert it into files for Microsoft Excel. This Internet Explorer add-in makes it possible to find and extract data from tables into a new window, or to look through the HTML code of the selected table. With this add-in, you will have access to any tables on any web pages, on the Internet or on your hard disk.
This software can be used to extract text elements that are arranged in a certain order on some page on the Internet. HTML Table Extractor is a convenient system for collecting tabular data because it collects this data in an effective way focusing on the particular elements of a web page. This add-in can also be useful when you need to quickly explore the internal structure of a web page because most web developers use HTML tables to position data displayed on pages.

Thanks
David

khan17 -1 Junior Poster in Training · Answer 10 · 2010-05-13T15:37:55+00:00

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            List<string> links = new List<string>();


            doc.Load(@"d:\sam.html");

            if (doc.DocumentNode == null)
                return;

            HtmlAgilityPack.HtmlNodeCollection coll = doc.DocumentNode.SelectNodes("//a[@href]");

            if (coll != null)
            {
                foreach (HtmlAgilityPack.HtmlNode nd in coll)
                {
                    links.Add(nd.Attributes["href"].Value);
                }
            }

            coll = doc.DocumentNode.SelectNodes("//img[@src]");
            if (coll != null)
            {
                foreach (HtmlAgilityPack.HtmlNode nd in coll)
                {
                    links.Add(nd.Attributes["src"].Value);
                }
            }

Thank u sir,

now the code works fine.. it extracts links and iimages on the page but it too display the extracted item in a single line only without any spaces.

actually i did it using webbrowser control and in that too the result are in a single line without any spaces..

my prob is need to get spaces between columns of table data which is extracted from website.

khan17 -1 Junior Poster in Training · Answer 11 · 2010-05-13T15:40:23+00:00

Hi,
HTML Table Extractor is an add-in for Internet Explorer (IE) allowing you to extract tables from web pages in an effective and quick manner. HTML Table Extractor also allows you to select tabular data online and easily convert it into files for Microsoft Excel. This Internet Explorer add-in makes it possible to find and extract data from tables into a new window, or to look through the HTML code of the selected table. With this add-in, you will have access to any tables on any web pages, on the Internet or on your hard disk.
This software can be used to extract text elements that are arranged in a certain order on some page on the Internet. HTML Table Extractor is a convenient system for collecting tabular data because it collects this data in an effective way focusing on the particular elements of a web page. This add-in can also be useful when you need to quickly explore the internal structure of a web page because most web developers use HTML tables to position data displayed on pages.
Thanks
David

web development company

thanks for replying david,

Im a student and im doing project which comes under my curriculum. Since im doing a project for a company i cant use such tools. i ve to do it programmatically..

once again thanks for ur interest.

khan17 -1 Junior Poster in Training · Answer 12 · 2010-05-14T12:55:47+00:00

how can i extract particular column elements in all rows of a table..
for example if im having a table like this in a html page

no name age address
1  aa   11  xxxxx
2  bb   13  yyyyy
3  cc   15  zzzzz

the output i need to get is only name and adddress like this

name address
aa   xxxxx
bb   yyyyy
cc   zzzzz

how can i do that..

Extract all links from table in a html page

Recommended Answers Collapse Answers

All 17 Replies

Recommended Answers