| | |
Extract URL
Please support our PHP advertiser: PostgreSQL or MySQL? Compare and contrast the two most popular open source databases
![]() |
•
•
Join Date: Jul 2008
Posts: 148
Reputation:
Solved Threads: 25
In Firefox/Windows you would put http://www.abc.com into the address bar, once the site was loaded, press CTRL-U to bring up the source, then CTRL-C/CTRL-V on whatever urls you want.
I think the concept you're looking for is a website scraper, there are a lot of different options for doing this from regular expressions, to xpath, which is one of my personal favorites.
Come back with some conceptual code and I'll be more than happy to help you work through it.
I think the concept you're looking for is a website scraper, there are a lot of different options for doing this from regular expressions, to xpath, which is one of my personal favorites.
Come back with some conceptual code and I'll be more than happy to help you work through it.
If you're question/problem is solved don't forget to mark the thread as Solved!
-- Code I post is usually but not always tested. If it is tested it will be against 5.2.11 or 5.3.0
-- Code I post is usually but not always tested. If it is tested it will be against 5.2.11 or 5.3.0
If you want to extract the url's from the page then I have an existing script that not only extracts to links to other pages but also links to pictures and other media. My script is as follows:
May be badley written but does the job. So I shall see if I can do a preg_match function.
=======================
Edit:
I have now written a function that will extract the links more efficiently and is as follows:
And the function as you can see returns an array of the links.
php Syntax (Toggle Plain Text)
function getlinks($url) { $media=preg_split('/(href\=\"|href\=\'|href\=|src\=\"|src\=\'|src\=)/i',$url); $media=preg_replace("/([^\'])\'(.*)/is",'$1',$media); $media=preg_replace("/([^\"])\"(.*)/is",'$1',$media); $media=preg_replace("/([^\>])\>(.*)/is",'$1',$media); $media=preg_replace("/([^\'])\'(.*)/i",'$1',$media); $media=preg_replace("/([^\"])\"(.*)/i",'$1',$media); $media=preg_replace("/([^\>])\>(.*)/i",'$1',$media); $media=preg_replace("/([^ ])\ [0-9\'\"\>\/](.*)/is",'$1',$media); $media=@preg_replace("/([^ ])\ [0-9\'\"\>\/](.*)/i",'$1',$media); $mediaext=preg_replace("/.*[.]([^.]+)/",'$1',$media); return $media; } //above function returns an array
=======================
Edit:
I have now written a function that will extract the links more efficiently and is as follows:
php Syntax (Toggle Plain Text)
<? function getlinks($url) { $data=file_get_contents($url); preg_match_all('/(href|src)\=(\"|\')[^\"\'\>]+/i',$data,$media); unset($data); $data=preg_replace('/(href|src)(\"|\'|\=\"|\=\')(.*)/i',"$3",$media[0]); return $data; } //now to use the function echo "<xmp>"; var_dump(getlinks('http://www.google.com.au')); echo "</xmp>"; ?>
Last edited by cwarn23; Mar 3rd, 2009 at 2:37 am. Reason: Added info
Try not to bump 10 year old threads as it can be really annoying.
Like php then read my website at http://syntax.cwarn23.net/
Star-Trek-Atlantis - now that's what I call a movie ^_^
My favourite PC. - MacGyver Fan
Bad english note: dis-iz-2b4u
Like php then read my website at http://syntax.cwarn23.net/
Star-Trek-Atlantis - now that's what I call a movie ^_^
My favourite PC. - MacGyver Fan
Bad english note: dis-iz-2b4u
•
•
Join Date: Jul 2008
Posts: 148
Reputation:
Solved Threads: 25
Alright, since you responded with a great example of how to do it with regular expressions, I guess I can provide an xpath example using the DOM as i mentioned previously.
The only thing to be aware of here, is urls that are relative and not full paths. You would need to put some logic in place to add the domain back to them if its not there already.
php Syntax (Toggle Plain Text)
<?php $sUrl = 'http://www.google.com'; $oDom = new DomDocument(); @$oDom->loadHTMLFile( $sUrl ); $oXpath = new DomXpath($oDom); //Could also be //@href | //@src i just think the one used gives you more finite control over the result set. $oRes = $oXpath->query("//a/@href | //img/@src | //script/@src"); $i=0; foreach($oRes as $h1) { echo $h1->nodeValue . '<br>'; $i++; } echo $i.' urls found in page.<br /><br />';
PHP Syntax (Toggle Plain Text)
http://images.google.com/imghp?hl=en&tab=wi http://maps.google.com/maps?hl=en&tab=wl http://news.google.com/nwshp?hl=en&tab=wn http://video.google.com/?hl=en&tab=wv http://mail.google.com/mail/?hl=en&tab=wm http://www.google.com/intl/en/options/ http://www.google.com/prdhp?hl=en&tab=wf http://groups.google.com/grphp?hl=en&tab=wg http://books.google.com/bkshp?hl=en&tab=wp http://scholar.google.com/schhp?hl=en&tab=ws http://www.google.com/finance?hl=en&tab=we http://blogsearch.google.com/?hl=en&tab=wb http://www.youtube.com/?hl=en&tab=w1 http://www.google.com/calendar/render?hl=en&tab=wc http://picasaweb.google.com/home?hl=en&tab=wq http://docs.google.com/?hl=en&tab=wo http://www.google.com/reader/view/?hl=en&tab=wy http://sites.google.com/?hl=en&tab=w3 http://www.google.com/intl/en/options/ /url?sa=p&pref=ig&pval=3&q=http://www.google.com/ig%3Fhl%3Den%26source%3Diglk&usg=AFQjCNFA18XPfgb7dKnXfKz7x7g1GDH1tg https://www.google.com/accounts/Login?continue=http://www.google.com/&hl=en /intl/en_ALL/images/logo.gif /advanced_search?hl=en /preferences?hl=en /language_tools?hl=en /intl/en/ads/ /services/ /intl/en/about.html /intl/en/privacy.html 29 links found in page.
The only thing to be aware of here, is urls that are relative and not full paths. You would need to put some logic in place to add the domain back to them if its not there already.
Last edited by mschroeder; Mar 3rd, 2009 at 10:16 am.
If you're question/problem is solved don't forget to mark the thread as Solved!
-- Code I post is usually but not always tested. If it is tested it will be against 5.2.11 or 5.3.0
-- Code I post is usually but not always tested. If it is tested it will be against 5.2.11 or 5.3.0
![]() |
Similar Threads
- How to extract data from web databases? (Database Design)
- a help in my url (ColdFusion)
- flash woes (Graphics and Multimedia)
- Surf Sidekick be a vicious beastie (Viruses, Spyware and other Nasties)
- c program to extract system info (C)
- Can't get rid of URL Logic Pop up generator (Viruses, Spyware and other Nasties)
Other Threads in the PHP Forum
- Previous Thread: preview option before form submission
- Next Thread: Show and Hide form from user
| Thread Tools | Search this Thread |
apache api array beginner binary body broken buttons cakephp checkbox class cms code cron curl database date date/time display dynamic ebooks echo email error file files folder form forms function functions global google href htaccess html image include insert ip javascript joomla limit link list login mail mediawiki menu mlm msqli_multi_query multiple mycodeisbad mysql number oop parameter paypal pdf php phpincludeissue problem query radio random recourse recursion regex remote script search seo server sessions sms source sp space speed sql static subdomain syntax system table tag tutorial update upload url validator variable vbulletin video web webdesign white wordpress xml youtube






