Hello... I'm working on a program that gets some info about a certain web page. But one of the features is it needs to find all the links to other webpages on THE SAME site (I think its called an internal link... not sure). I already figured out how to get the page's HTML code, now i just need to know how to find stuff like "<a href="/members/AndyPants/919208">" in the HTML code then output me "http://www.daniweb.com/members/AndyPants/919208".

Thanks in advanced!

Recommended Answers

All 4 Replies

Here's a representation.
I massage the data a little before running it through the Regex. How you will eventually modify this will depend on how the server delivers the html page.

Imports System
Imports System.Collections.Generic
Imports System.IO
Imports System.Linq
Imports System.Net
Imports System.Text.RegularExpressions

Module DW_396137
   Sub Main()
      Dim strURL As String = "http://www.daniweb.com/certificates/stats.php"
      Dim wc As New WebClient()
      Dim rxMemberLinks As New Regex("href=.(?<mem_link>/members/.{1,20}/\d{1,}).>")
      Dim lst_strFullLinks As New List(Of String)

      Try
         Dim fileIn As New StreamReader(wc.OpenRead(strURL))
         fileIn.ReadToEnd().Replace("<", "\n") _
            .Split("\n".ToCharArray(), StringSplitOptions.RemoveEmptyEntries) _
            .Where(Function(s) rxMemberLinks.IsMatch(s)) _
            .ToList() _
            .ForEach(Sub(s) lst_strFullLinks.Add("http://www.DaniWeb.com" & rxMemberLinks.Match(s).Groups("mem_link").Value))

         fileIn.Close()

      Catch exc As Exception
         Console.WriteLine("Exception: " + exc.Message)
         Return
      End Try

      Console.WriteLine("Finished")
   End Sub
End Module

Here's a more general-purpose regular expression: href=(['"])(?!.+://)(?<url>.+?)\1 It finds hrefs to relative URLs; from your example, they seem to be what you're looking for.

Here's a more general-purpose regular expression: href=(['"])(?!.+://)(?<url>.+?)\1 It finds hrefs to relative URLs; from your example, they seem to be what you're looking for.

Sorry it took me so ling... What you gave me was exactly what I needed, but, vb.net gives me an error because of the quotation mark in the regular expression... Could you please fix that? Much appreciated.

Sorry it took me so ling... What you gave me was exactly what I needed, but, vb.net gives me an error because of the quotation mark in the regular expression... Could you please fix that? Much appreciated.

You should do some reading on the String data type:

If you must include a quotation mark as one of the characters in the string, you use two contiguous quotation marks ("").

It looks like this: "href=(['""])(?!.+://)(?<url>.+?)\1"

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.