How can I extract URLs from webpage and want that all url should be from some specific site only like "www.abc.com/32432/file.zip" it should search abc.com and the extenstion can zip,rar,001
any help?

Recommended Answers

All 4 Replies

In Firefox/Windows you would put http://www.abc.com into the address bar, once the site was loaded, press CTRL-U to bring up the source, then CTRL-C/CTRL-V on whatever urls you want. :twisted:

I think the concept you're looking for is a website scraper, there are a lot of different options for doing this from regular expressions, to xpath, which is one of my personal favorites.

Come back with some conceptual code and I'll be more than happy to help you work through it.

If you want to extract the url's from the page then I have an existing script that not only extracts to links to other pages but also links to pictures and other media. My script is as follows:

function getlinks($url) {
    $media=preg_split('/(href\=\"|href\=\'|href\=|src\=\"|src\=\'|src\=)/i',$url);
    $media=preg_replace("/([^\'])\'(.*)/is",'$1',$media);
    $media=preg_replace("/([^\"])\"(.*)/is",'$1',$media);
    $media=preg_replace("/([^\>])\>(.*)/is",'$1',$media);
    $media=preg_replace("/([^\'])\'(.*)/i",'$1',$media);
    $media=preg_replace("/([^\"])\"(.*)/i",'$1',$media);
    $media=preg_replace("/([^\>])\>(.*)/i",'$1',$media);
    $media=preg_replace("/([^ ])\ [0-9\'\"\>\/](.*)/is",'$1',$media);
    $media=@preg_replace("/([^ ])\ [0-9\'\"\>\/](.*)/i",'$1',$media);
    $mediaext=preg_replace("/.*[.]([^.]+)/",'$1',$media);
    return $media;
    }
//above function returns an array

May be badley written but does the job. So I shall see if I can do a preg_match function.

=======================
Edit:
I have now written a function that will extract the links more efficiently and is as follows:

<?
function getlinks($url) {
    $data=file_get_contents($url);
    preg_match_all('/(href|src)\=(\"|\')[^\"\'\>]+/i',$data,$media);
    unset($data);
    $data=preg_replace('/(href|src)(\"|\'|\=\"|\=\')(.*)/i',"$3",$media[0]);
    return $data;
    }

//now to use the function
echo "<xmp>";
var_dump(getlinks('http://www.google.com.au'));
echo "</xmp>";
?>

And the function as you can see returns an array of the links.

Alright, since you responded with a great example of how to do it with regular expressions, I guess I can provide an xpath example using the DOM as i mentioned previously.

<?php
$sUrl = 'http://www.google.com';

$oDom = new DomDocument();
@$oDom->loadHTMLFile( $sUrl );

$oXpath = new DomXpath($oDom);

//Could also be //@href | //@src i just think the one used gives you more finite control over the result set.
$oRes = $oXpath->query("//a/@href | //img/@src | //script/@src");

$i=0;
foreach($oRes as $h1) {
	echo $h1->nodeValue . '<br>';
	$i++;
}

echo $i.' urls found in page.<br /><br />';
http://images.google.com/imghp?hl=en&tab=wi
http://maps.google.com/maps?hl=en&tab=wl
http://news.google.com/nwshp?hl=en&tab=wn
http://video.google.com/?hl=en&tab=wv
http://mail.google.com/mail/?hl=en&tab=wm
http://www.google.com/intl/en/options/
http://www.google.com/prdhp?hl=en&tab=wf
http://groups.google.com/grphp?hl=en&tab=wg
http://books.google.com/bkshp?hl=en&tab=wp
http://scholar.google.com/schhp?hl=en&tab=ws
http://www.google.com/finance?hl=en&tab=we
http://blogsearch.google.com/?hl=en&tab=wb
http://www.youtube.com/?hl=en&tab=w1
http://www.google.com/calendar/render?hl=en&tab=wc
http://picasaweb.google.com/home?hl=en&tab=wq
http://docs.google.com/?hl=en&tab=wo
http://www.google.com/reader/view/?hl=en&tab=wy
http://sites.google.com/?hl=en&tab=w3
http://www.google.com/intl/en/options/
/url?sa=p&pref=ig&pval=3&q=http://www.google.com/ig%3Fhl%3Den%26source%3Diglk&usg=AFQjCNFA18XPfgb7dKnXfKz7x7g1GDH1tg
https://www.google.com/accounts/Login?continue=http://www.google.com/&hl=en
/intl/en_ALL/images/logo.gif
/advanced_search?hl=en
/preferences?hl=en
/language_tools?hl=en
/intl/en/ads/
/services/
/intl/en/about.html
/intl/en/privacy.html
29 links found in page.

The only thing to be aware of here, is urls that are relative and not full paths. You would need to put some logic in place to add the domain back to them if its not there already.

Wow!, Thanx all for your reply... sorry for late reply....:)
I am searching for the similar problem after 1 year and found this...hahaha ....thx again

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.