get ALL hyperlinks from ANY page

Question

sfrider0 6 Junior Poster

15 Years Ago

I have recently made a program that gets the links from a webpage, then checks to see if they are valid or not. I used the HtmlElement to get the "HREF" attribute and stored all the links into a list. I also tried using the HttpWebRequest and stored the stream into a string and parsed it out to get the links. These ways seem to work but doesn't give me all the links. I guess those are scripted.

The only way I could get all of the links was to open the page in FireFox, then view Page Source, and save it to a txt file, then use my program to read the file and save it to a string and parse the links out that way.

What I need to know is how or if this is possible to do automatically. I'm just using a webBrowser control. I know there are several built in controls that allow you to refresh a page and things, but is there a way to view and grab the page source?

file-stream open-source perl

3 Contributors
5 Replies
140 Views
15 Hours Discussion Span
Latest Post 15 Years Ago Latest Post by apegram

All 5 Replies

kvprajapati 1,826 Posting Genius

15 Years Ago

>is there a way to view and grab the page source?

HttpWebRequest.

apegram 302 LINQ!

15 Years Ago

string url = "http://www.google.com"
            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
            HttpWebResponse response = (HttpWebResponse)request.GetResponse();
            StreamReader reader = new StreamReader(response.GetResponseStream());
            string output = reader.ReadToEnd();

That's a quick way to get your content. As for parsing it out, you can probably look into regular expressions or, as you said, simply doing it yourself and finding href="blah" and extracting the blah.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

sfrider0 6 Junior Poster · Answer 1 · 2010-02-03T20:36:21+00:00

Thanks for the reply adatapost, but I've already using that and it doesn't give me all the links. I tried it using this example I found

private void button1_Click(object sender, EventArgs e)
        {
            // used to build entire input
            StringBuilder sb = new StringBuilder();

            // used on each read operation
            byte[] buf = new byte[8192];

            // prepare the web page we will be asking for
            HttpWebRequest request = (HttpWebRequest)
                WebRequest.Create("http://www.google.com");

            // execute the request
            HttpWebResponse response = (HttpWebResponse)
                request.GetResponse();

            // we will read data via the response stream
            Stream resStream = response.GetResponseStream();

            string tempString = null;
            int count = 0;

            do
            {
                // fill the buffer with data
                count = resStream.Read(buf, 0, buf.Length);

                // make sure we read some data
                if (count != 0)
                {
                    // translate from bytes to ASCII text
                    tempString = Encoding.ASCII.GetString(buf, 0, count);

                    // continue building the string
                    sb.Append(tempString);
                }
            }
            while (count > 0); // any more data to read?

            // print out page source
            textBox1.Text = sb.ToString();

        }

sfrider0 6 Junior Poster · Answer 2 · 2010-02-03T23:19:25+00:00

string url = "http://www.google.com"
            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
            HttpWebResponse response = (HttpWebResponse)request.GetResponse();
            StreamReader reader = new StreamReader(response.GetResponseStream());
            string output = reader.ReadToEnd();
That's a quick way to get your content. As for parsing it out, you can probably look into regular expressions or, as you said, simply doing it yourself and finding href="blah" and extracting the blah.

Thanks, but I have already parsed the links out and everything. Links within javasript won't show up. The only way I can get to these links is to manually view the source in IE or FireFox

apegram 302 LINQ! Team Colleague · Answer 3 · 2010-02-03T23:51:04+00:00

Thanks, but I have already parsed the links out and everything. Links within javasript won't show up. The only way I can get to these links is to manually view the source in IE or FireFox

I'm afraid I'm missing something. There should be nothing different in the source you would pull down from the server versus what you would see via the browser. Links created by javascript aren't going to be in either source, only the manner of their creation (which may very well include a legible URL). Javascript isn't going to change the source HTML, though, only the in-memory HTML that the browser is rendering.

get ALL hyperlinks from ANY page

Recommended Answers Collapse Answers

All 5 Replies

Recommended Answers