Extract URL

Question

xor83 0 Newbie Poster

15 Years Ago

How can I extract URLs from webpage and want that all url should be from some specific site only like "www.abc.com/32432/file.zip" it should search abc.com and the extenstion can zip,rar,001
any help?

php

3 Contributors
4 Replies
267 Views
1 Year Discussion Span
Latest Post 13 Years Ago Latest Post by xor83

All 4 Replies

mschroeder 251 Bestower of Knowledge

15 Years Ago

Alright, since you responded with a great example of how to do it with regular expressions, I guess I can provide an xpath example using the DOM as i mentioned previously.

<?php
$sUrl = 'http://www.google.com';

$oDom = new DomDocument();
@$oDom->loadHTMLFile( $sUrl );

$oXpath = new DomXpath($oDom);

//Could also be //@href | //@src i just think the one used gives you more finite control over the result set.
$oRes = $oXpath->query("//a/@href | //img/@src | //script/@src");

$i=0;
foreach($oRes as $h1) {
	echo $h1->nodeValue . '<br>';
	$i++;
}

echo $i.' urls found in page.<br /><br />';

http://images.google.com/imghp?hl=en&tab=wi
http://maps.google.com/maps?hl=en&tab=wl
http://news.google.com/nwshp?hl=en&tab=wn
http://video.google.com/?hl=en&tab=wv
http://mail.google.com/mail/?hl=en&tab=wm
http://www.google.com/intl/en/options/
http://www.google.com/prdhp?hl=en&tab=wf
http://groups.google.com/grphp?hl=en&tab=wg
http://books.google.com/bkshp?hl=en&tab=wp
http://scholar.google.com/schhp?hl=en&tab=ws
http://www.google.com/finance?hl=en&tab=we
http://blogsearch.google.com/?hl=en&tab=wb
http://www.youtube.com/?hl=en&tab=w1
http://www.google.com/calendar/render?hl=en&tab=wc
http://picasaweb.google.com/home?hl=en&tab=wq
http://docs.google.com/?hl=en&tab=wo
http://www.google.com/reader/view/?hl=en&tab=wy
http://sites.google.com/?hl=en&tab=w3
http://www.google.com/intl/en/options/
/url?sa=p&pref=ig&pval=3&q=http://www.google.com/ig%3Fhl%3Den%26source%3Diglk&usg=AFQjCNFA18XPfgb7dKnXfKz7x7g1GDH1tg
https://www.google.com/accounts/Login?continue=http://www.google.com/&hl=en
/intl/en_ALL/images/logo.gif
/advanced_search?hl=en
/preferences?hl=en
/language_tools?hl=en
/intl/en/ads/
/services/
/intl/en/about.html
/intl/en/privacy.html
29 links found in page.

The only thing to be aware of here, is urls that are relative and not full paths. You would need to put some logic in place to add the domain back to them if its not there already.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

mschroeder 251 Bestower of Knowledge Team Colleague · Answer 1 · 2009-03-03T03:57:31+00:00

In Firefox/Windows you would put http://www.abc.com into the address bar, once the site was loaded, press CTRL-U to bring up the source, then CTRL-C/CTRL-V on whatever urls you want. :twisted:

I think the concept you're looking for is a website scraper, there are a lot of different options for doing this from regular expressions, to xpath, which is one of my personal favorites.

Come back with some conceptual code and I'll be more than happy to help you work through it.

cwarn23 387 Occupation: Genius Team Colleague Featured Poster · Answer 2 · 2009-03-03T12:26:17+00:00

If you want to extract the url's from the page then I have an existing script that not only extracts to links to other pages but also links to pictures and other media. My script is as follows:

function getlinks($url) {
    $media=preg_split('/(href\=\"|href\=\'|href\=|src\=\"|src\=\'|src\=)/i',$url);
    $media=preg_replace("/([^\'])\'(.*)/is",'$1',$media);
    $media=preg_replace("/([^\"])\"(.*)/is",'$1',$media);
    $media=preg_replace("/([^\>])\>(.*)/is",'$1',$media);
    $media=preg_replace("/([^\'])\'(.*)/i",'$1',$media);
    $media=preg_replace("/([^\"])\"(.*)/i",'$1',$media);
    $media=preg_replace("/([^\>])\>(.*)/i",'$1',$media);
    $media=preg_replace("/([^ ])\ [0-9\'\"\>\/](.*)/is",'$1',$media);
    $media=@preg_replace("/([^ ])\ [0-9\'\"\>\/](.*)/i",'$1',$media);
    $mediaext=preg_replace("/.*[.]([^.]+)/",'$1',$media);
    return $media;
    }
//above function returns an array

May be badley written but does the job. So I shall see if I can do a preg_match function.

=======================
Edit:
I have now written a function that will extract the links more efficiently and is as follows:

<?
function getlinks($url) {
    $data=file_get_contents($url);
    preg_match_all('/(href|src)\=(\"|\')[^\"\'\>]+/i',$data,$media);
    unset($data);
    $data=preg_replace('/(href|src)(\"|\'|\=\"|\=\')(.*)/i',"$3",$media[0]);
    return $data;
    }

//now to use the function
echo "<xmp>";
var_dump(getlinks('http://www.google.com.au'));
echo "</xmp>";
?>

And the function as you can see returns an array of the links.

xor83 0 Newbie Poster · Answer 3 · 2010-08-28T20:02:42+00:00

Wow!, Thanx all for your reply... sorry for late reply....:)
I am searching for the similar problem after 1 year and found this...hahaha ....thx again

Extract URL

Recommended Answers Collapse Answers

All 4 Replies

Recommended Answers