954,167 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

Extract URL

How can I extract URLs from webpage and want that all url should be from some specific site only like "www.abc.com/32432/file.zip" it should search abc.com and the extenstion can zip,rar,001
any help?

xor83
Newbie Poster
8 posts since Feb 2007
Reputation Points: 10
Solved Threads: 0
 

In Firefox/Windows you would put http://www.abc.com into the address bar, once the site was loaded, press CTRL-U to bring up the source, then CTRL-C/CTRL-V on whatever urls you want. :twisted:

I think the concept you're looking for is a website scraper, there are a lot of different options for doing this from regular expressions, to xpath, which is one of my personal favorites.

Come back with some conceptual code and I'll be more than happy to help you work through it.

mschroeder
Work Harder
Team Colleague
666 posts since Jul 2008
Reputation Points: 279
Solved Threads: 131
 

If you want to extract the url's from the page then I have an existing script that not only extracts to links to other pages but also links to pictures and other media. My script is as follows:

function getlinks($url) {
    $media=preg_split('/(href\=\"|href\=\'|href\=|src\=\"|src\=\'|src\=)/i',$url);
    $media=preg_replace("/([^\'])\'(.*)/is",'$1',$media);
    $media=preg_replace("/([^\"])\"(.*)/is",'$1',$media);
    $media=preg_replace("/([^\>])\>(.*)/is",'$1',$media);
    $media=preg_replace("/([^\'])\'(.*)/i",'$1',$media);
    $media=preg_replace("/([^\"])\"(.*)/i",'$1',$media);
    $media=preg_replace("/([^\>])\>(.*)/i",'$1',$media);
    $media=preg_replace("/([^ ])\ [0-9\'\"\>\/](.*)/is",'$1',$media);
    $media=@preg_replace("/([^ ])\ [0-9\'\"\>\/](.*)/i",'$1',$media);
    $mediaext=preg_replace("/.*[.]([^.]+)/",'$1',$media);
    return $media;
    }
//above function returns an array

May be badley written but does the job. So I shall see if I can do a preg_match function.

=======================
Edit:
I have now written a function that will extract the links more efficiently and is as follows:

<?
function getlinks($url) {
    $data=file_get_contents($url);
    preg_match_all('/(href|src)\=(\"|\')[^\"\'\>]+/i',$data,$media);
    unset($data);
    $data=preg_replace('/(href|src)(\"|\'|\=\"|\=\')(.*)/i',"$3",$media[0]);
    return $data;
    }

//now to use the function
echo "<xmp>";
var_dump(getlinks('http://www.google.com.au'));
echo "</xmp>";
?>

And the function as you can see returns an array of the links.

cwarn23
Occupation: Genius
Team Colleague
3,033 posts since Sep 2007
Reputation Points: 413
Solved Threads: 259
 

Alright, since you responded with a great example of how to do it with regular expressions, I guess I can provide an xpath example using the DOM as i mentioned previously.

<?php
$sUrl = 'http://www.google.com';

$oDom = new DomDocument();
@$oDom->loadHTMLFile( $sUrl );

$oXpath = new DomXpath($oDom);

//Could also be //@href | //@src i just think the one used gives you more finite control over the result set.
$oRes = $oXpath->query("//a/@href | //img/@src | //script/@src");

$i=0;
foreach($oRes as $h1) {
	echo $h1->nodeValue . '';
	$i++;
}

echo $i.' urls found in page.';
http://images.google.com/imghp?hl=en&tab=wi http://maps.google.com/maps?hl=en&tab=wl http://news.google.com/nwshp?hl=en&tab=wn http://video.google.com/?hl=en&tab=wv http://mail.google.com/mail/?hl=en&tab=wm http://www.google.com/intl/en/options/ http://www.google.com/prdhp?hl=en&tab=wf http://groups.google.com/grphp?hl=en&tab=wg http://books.google.com/bkshp?hl=en&tab=wp http://scholar.google.com/schhp?hl=en&tab=ws http://www.google.com/finance?hl=en&tab=we http://blogsearch.google.com/?hl=en&tab=wb http://www.youtube.com/?hl=en&tab=w1 http://www.google.com/calendar/render?hl=en&tab=wc http://picasaweb.google.com/home?hl=en&tab=wq http://docs.google.com/?hl=en&tab=wo http://www.google.com/reader/view/?hl=en&tab=wy http://sites.google.com/?hl=en&tab=w3 http://www.google.com/intl/en/options/
/url?sa=p&pref=ig&pval=3&q=http://www.google.com/ig%3Fhl%3Den%26source%3Diglk&usg=AFQjCNFA18XPfgb7dKnXfKz7x7g1GDH1tg https://www.google.com/accounts/Login?continue=http://www.google.com/&hl=en
/intl/en_ALL/images/logo.gif
/advanced_search?hl=en
/preferences?hl=en
/language_tools?hl=en
/intl/en/ads/
/services/
/intl/en/about.html
/intl/en/privacy.html
29 links found in page.


The only thing to be aware of here, is urls that are relative and not full paths. You would need to put some logic in place to add the domain back to them if its not there already.

mschroeder
Work Harder
Team Colleague
666 posts since Jul 2008
Reputation Points: 279
Solved Threads: 131
 

Wow!, Thanx all for your reply... sorry for late reply....:)
I am searching for the similar problem after 1 year and found this...hahaha ....thx again

xor83
Newbie Poster
8 posts since Feb 2007
Reputation Points: 10
Solved Threads: 0
 

This article has been dead for over three months

Post: Markdown Syntax: Formatting Help
You