I have recently made a program that gets the links from a webpage, then checks to see if they are valid or not. I used the HtmlElement to get the "HREF" attribute and stored all the links into a list. I also tried using the HttpWebRequest and stored the stream into a string and parsed it out to get the links. These ways seem to work but doesn't give me all the links. I guess those are scripted.

The only way I could get all of the links was to open the page in FireFox, then view Page Source, and save it to a txt file, then use my program to read the file and save it to a string and parse the links out that way.

What I need to know is how or if this is possible to do automatically. I'm just using a webBrowser control. I know there are several built in controls that allow you to refresh a page and things, but is there a way to view and grab the page source?

Recommended Answers

All 5 Replies

Thanks for the reply adatapost, but I've already using that and it doesn't give me all the links. I tried it using this example I found

private void button1_Click(object sender, EventArgs e)
        {
            // used to build entire input
            StringBuilder sb = new StringBuilder();

            // used on each read operation
            byte[] buf = new byte[8192];

            // prepare the web page we will be asking for
            HttpWebRequest request = (HttpWebRequest)
                WebRequest.Create("http://www.google.com");

            // execute the request
            HttpWebResponse response = (HttpWebResponse)
                request.GetResponse();

            // we will read data via the response stream
            Stream resStream = response.GetResponseStream();

            string tempString = null;
            int count = 0;

            do
            {
                // fill the buffer with data
                count = resStream.Read(buf, 0, buf.Length);

                // make sure we read some data
                if (count != 0)
                {
                    // translate from bytes to ASCII text
                    tempString = Encoding.ASCII.GetString(buf, 0, count);

                    // continue building the string
                    sb.Append(tempString);
                }
            }
            while (count > 0); // any more data to read?

            // print out page source
            textBox1.Text = sb.ToString();

        }
string url = "http://www.google.com"
            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
            HttpWebResponse response = (HttpWebResponse)request.GetResponse();
            StreamReader reader = new StreamReader(response.GetResponseStream());
            string output = reader.ReadToEnd();

That's a quick way to get your content. As for parsing it out, you can probably look into regular expressions or, as you said, simply doing it yourself and finding href="blah" and extracting the blah.

string url = "http://www.google.com"
            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
            HttpWebResponse response = (HttpWebResponse)request.GetResponse();
            StreamReader reader = new StreamReader(response.GetResponseStream());
            string output = reader.ReadToEnd();

That's a quick way to get your content. As for parsing it out, you can probably look into regular expressions or, as you said, simply doing it yourself and finding href="blah" and extracting the blah.

Thanks, but I have already parsed the links out and everything. Links within javasript won't show up. The only way I can get to these links is to manually view the source in IE or FireFox

Thanks, but I have already parsed the links out and everything. Links within javasript won't show up. The only way I can get to these links is to manually view the source in IE or FireFox

I'm afraid I'm missing something. There should be nothing different in the source you would pull down from the server versus what you would see via the browser. Links created by javascript aren't going to be in either source, only the manner of their creation (which may very well include a legible URL). Javascript isn't going to change the source HTML, though, only the in-memory HTML that the browser is rendering.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.