954,557 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

Finding links with regex

Hello... I'm working on a program that gets some info about a certain web page. But one of the features is it needs to find all the links to other webpages on THE SAME site (I think its called an internal link... not sure). I already figured out how to get the page's HTML code, now i just need to know how to find stuff like "" in the HTML code then output me "http://www.daniweb.com/members/AndyPants/919208".

Thanks in advanced!

AndyPants
Light Poster
35 posts since Oct 2011
Reputation Points: 15
Solved Threads: 1
 

Here's a representation.
I massage the data a little before running it through the Regex. How you will eventually modify this will depend on how the server delivers the html page.

Imports System
Imports System.Collections.Generic
Imports System.IO
Imports System.Linq
Imports System.Net
Imports System.Text.RegularExpressions

Module DW_396137
   Sub Main()
      Dim strURL As String = "http://www.daniweb.com/certificates/stats.php"
      Dim wc As New WebClient()
      Dim rxMemberLinks As New Regex("href=.(?<mem_link>/members/.{1,20}/\d{1,}).>")
      Dim lst_strFullLinks As New List(Of String)

      Try
         Dim fileIn As New StreamReader(wc.OpenRead(strURL))
         fileIn.ReadToEnd().Replace("<", "\n") _
            .Split("\n".ToCharArray(), StringSplitOptions.RemoveEmptyEntries) _
            .Where(Function(s) rxMemberLinks.IsMatch(s)) _
            .ToList() _
            .ForEach(Sub(s) lst_strFullLinks.Add("http://www.DaniWeb.com" & rxMemberLinks.Match(s).Groups("mem_link").Value))

         fileIn.Close()

      Catch exc As Exception
         Console.WriteLine("Exception: " + exc.Message)
         Return
      End Try

      Console.WriteLine("Finished")
   End Sub
End Module
thines01
Postaholic
Team Colleague
2,425 posts since Oct 2009
Reputation Points: 445
Solved Threads: 402
 

Here's a more general-purpose regular expression: href=(['"])(?!.+://)(?<url>.+?)\1

It finds hrefs to relative URLs ; from your example, they seem to be what you're looking for.

gusano79
Posting Pro
521 posts since May 2004
Reputation Points: 182
Solved Threads: 77
 

Here's a more general-purpose regular expression: href=(['"])(?!.+://)(?<url>.+?)\1

It finds hrefs to relative URLs ; from your example, they seem to be what you're looking for.


Sorry it took me so ling... What you gave me was exactly what I needed, but, vb.net gives me an error because of the quotation mark in the regular expression... Could you please fix that? Much appreciated.

AndyPants
Light Poster
35 posts since Oct 2011
Reputation Points: 15
Solved Threads: 1
 
Sorry it took me so ling... What you gave me was exactly what I needed, but, vb.net gives me an error because of the quotation mark in the regular expression... Could you please fix that? Much appreciated.

You should do some reading on the String data type :

If you must include a quotation mark as one of the characters in the string, you use two contiguous quotation marks ("").

It looks like this: "href=(['""])(?!.+://)(?<url>.+?)\1"

gusano79
Posting Pro
521 posts since May 2004
Reputation Points: 182
Solved Threads: 77
 

This article has been dead for over three months

Post: Markdown Syntax: Formatting Help
You
View similar articles that have also been tagged: