0

Hi, I'm writing a web crawling program for my personal site, and I'm looking at using regex to extract the URLs. However, I have both absolute and relative URLs, and I want to match URLs only on my site (mysite.com).

So it would match:

/index.php
image1.jpg
page1.html
Http://mysite.com/
Http://mysite.com/page1.html
Http://Wiki.mysite.com/
Wiki.mysite.com/

but it wouldn't match:

Bob
Www.google.com
Mailto:Admin@mysite.com

Can anyone give me assistance? I'd post what I have so far, but it is this:

Nothing.

2
Contributors
2
Replies
3
Views
6 Years
Discussion Span
Last Post by Xcelled194
0

In every case, you will have "<a href=[^>]*>" but that just finds all links. I'm not sure I agree with your list of matches and non matches. You can spend quite awhile digging through, for instance the Relative Uniform Resource Locators RFC For my take, then:

I think this matches all the fully qualified URLs for 'mysite.com':
http://[optional.]mysite.com[/optional] (where the constant "http:" part may be in any case)

I think anything that does not start with a scheme ("http:" for http links) is considered relative; and if I understand the RFC, it is relative to a "BASE URL" which is possibly specified in the header of the html doc; or derived by a heuristic in the course of handling the http request.

Subdomains ("wiki.mysite.com", "FAQ.mysite.com", etc) are, if I understand correctly, only accessible from the root domain or other subdomains as a fully qualified URL that starts with 'http:'. Maybe. And it may matter what your web server is configured to do.

Anyway, if I'm right, you want to match: "http://.*mysite.com.*" for fully qualified references
And you also want to match anything at all that doesn't start with ".*://" This is not ideally suited to a regex, because 'anything but' is hard unless you are specifying a single character. Since you are writing a program to do this, you can handle it more easily using if/else:

if the_url_string.matches(starts_with_http_case_insensitive) then
  if the_url_string.matches(contains_my_site_case_insensitive) then
    found_match(the_url_string) # fully qualified URL
  endif
else 
  if the_url_string.matches(starts_with_a_scheme) then
    # not a match
   else
     found_match(the_url_string) # relative URL
   endif
endif

Be aware that case insensitive matching for your domain officially only deals with anything before the first '/', ';', '#' or '?'. In your case, I doubt it matters: You are looking internally, so if the spider fails (loudly) that is actually a good thing: It tells you about a badly spelled or otherwise broken link. Which brings me to another point. You might want to collect external links, attempt to follow them only one step, and report failures, so as to keep your own served pages current with respect to the rest of the web.

Edited by griswolf: n/a

0

Thanks for your reply, you've given me some food for thought. Anyway, I have a fairly decent one going,

href="(((Http://)?((.*\.)?mysite\.com))?(/)?.*\.php[^"&;]*)

Unfortunately, theres a major problem. It uses the href to detect a link, but returns it in the match, creating

href="index.php

from

href="index.php"

So for now, I'll just use a second regex replace to strip out the href="

Also, it doesnt return any other file extentions, but they could be included with a "(.*)"

Edited by Xcelled194: n/a

This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.