| | |
Extract URL
Please support our PHP advertiser: PostgreSQL or MySQL? Compare and contrast the two most popular open source databases
![]() |
•
•
Join Date: Jul 2008
Posts: 147
Reputation:
Solved Threads: 25
In Firefox/Windows you would put http://www.abc.com into the address bar, once the site was loaded, press CTRL-U to bring up the source, then CTRL-C/CTRL-V on whatever urls you want.
I think the concept you're looking for is a website scraper, there are a lot of different options for doing this from regular expressions, to xpath, which is one of my personal favorites.
Come back with some conceptual code and I'll be more than happy to help you work through it.
I think the concept you're looking for is a website scraper, there are a lot of different options for doing this from regular expressions, to xpath, which is one of my personal favorites.
Come back with some conceptual code and I'll be more than happy to help you work through it.
If you're question/problem is solved don't forget to mark the thread as Solved!
-- Code I post is usually but not always tested. If it is tested it will be against 5.2.11 or 5.3.0
-- Code I post is usually but not always tested. If it is tested it will be against 5.2.11 or 5.3.0
If you want to extract the url's from the page then I have an existing script that not only extracts to links to other pages but also links to pictures and other media. My script is as follows:
May be badley written but does the job. So I shall see if I can do a preg_match function.
=======================
Edit:
I have now written a function that will extract the links more efficiently and is as follows:
And the function as you can see returns an array of the links.
php Syntax (Toggle Plain Text)
function getlinks($url) { $media=preg_split('/(href\=\"|href\=\'|href\=|src\=\"|src\=\'|src\=)/i',$url); $media=preg_replace("/([^\'])\'(.*)/is",'$1',$media); $media=preg_replace("/([^\"])\"(.*)/is",'$1',$media); $media=preg_replace("/([^\>])\>(.*)/is",'$1',$media); $media=preg_replace("/([^\'])\'(.*)/i",'$1',$media); $media=preg_replace("/([^\"])\"(.*)/i",'$1',$media); $media=preg_replace("/([^\>])\>(.*)/i",'$1',$media); $media=preg_replace("/([^ ])\ [0-9\'\"\>\/](.*)/is",'$1',$media); $media=@preg_replace("/([^ ])\ [0-9\'\"\>\/](.*)/i",'$1',$media); $mediaext=preg_replace("/.*[.]([^.]+)/",'$1',$media); return $media; } //above function returns an array
=======================
Edit:
I have now written a function that will extract the links more efficiently and is as follows:
php Syntax (Toggle Plain Text)
<? function getlinks($url) { $data=file_get_contents($url); preg_match_all('/(href|src)\=(\"|\')[^\"\'\>]+/i',$data,$media); unset($data); $data=preg_replace('/(href|src)(\"|\'|\=\"|\=\')(.*)/i',"$3",$media[0]); return $data; } //now to use the function echo "<xmp>"; var_dump(getlinks('http://www.google.com.au')); echo "</xmp>"; ?>
Last edited by cwarn23; Mar 3rd, 2009 at 2:37 am. Reason: Added info
Try not to bump 10 year old threads as it can be really annoying.
Like php then read my website at http://syntax.cwarn23.net/
Star-Trek-Atlantis - now that's what I call a movie ^_^
My favourite PC. - MacGyver Fan
Bad english note: dis-iz-2b4u
Like php then read my website at http://syntax.cwarn23.net/
Star-Trek-Atlantis - now that's what I call a movie ^_^
My favourite PC. - MacGyver Fan
Bad english note: dis-iz-2b4u
•
•
Join Date: Jul 2008
Posts: 147
Reputation:
Solved Threads: 25
Alright, since you responded with a great example of how to do it with regular expressions, I guess I can provide an xpath example using the DOM as i mentioned previously.
The only thing to be aware of here, is urls that are relative and not full paths. You would need to put some logic in place to add the domain back to them if its not there already.
php Syntax (Toggle Plain Text)
<?php $sUrl = 'http://www.google.com'; $oDom = new DomDocument(); @$oDom->loadHTMLFile( $sUrl ); $oXpath = new DomXpath($oDom); //Could also be //@href | //@src i just think the one used gives you more finite control over the result set. $oRes = $oXpath->query("//a/@href | //img/@src | //script/@src"); $i=0; foreach($oRes as $h1) { echo $h1->nodeValue . '<br>'; $i++; } echo $i.' urls found in page.<br /><br />';
PHP Syntax (Toggle Plain Text)
http://images.google.com/imghp?hl=en&tab=wi http://maps.google.com/maps?hl=en&tab=wl http://news.google.com/nwshp?hl=en&tab=wn http://video.google.com/?hl=en&tab=wv http://mail.google.com/mail/?hl=en&tab=wm http://www.google.com/intl/en/options/ http://www.google.com/prdhp?hl=en&tab=wf http://groups.google.com/grphp?hl=en&tab=wg http://books.google.com/bkshp?hl=en&tab=wp http://scholar.google.com/schhp?hl=en&tab=ws http://www.google.com/finance?hl=en&tab=we http://blogsearch.google.com/?hl=en&tab=wb http://www.youtube.com/?hl=en&tab=w1 http://www.google.com/calendar/render?hl=en&tab=wc http://picasaweb.google.com/home?hl=en&tab=wq http://docs.google.com/?hl=en&tab=wo http://www.google.com/reader/view/?hl=en&tab=wy http://sites.google.com/?hl=en&tab=w3 http://www.google.com/intl/en/options/ /url?sa=p&pref=ig&pval=3&q=http://www.google.com/ig%3Fhl%3Den%26source%3Diglk&usg=AFQjCNFA18XPfgb7dKnXfKz7x7g1GDH1tg https://www.google.com/accounts/Login?continue=http://www.google.com/&hl=en /intl/en_ALL/images/logo.gif /advanced_search?hl=en /preferences?hl=en /language_tools?hl=en /intl/en/ads/ /services/ /intl/en/about.html /intl/en/privacy.html 29 links found in page.
The only thing to be aware of here, is urls that are relative and not full paths. You would need to put some logic in place to add the domain back to them if its not there already.
Last edited by mschroeder; Mar 3rd, 2009 at 10:16 am.
If you're question/problem is solved don't forget to mark the thread as Solved!
-- Code I post is usually but not always tested. If it is tested it will be against 5.2.11 or 5.3.0
-- Code I post is usually but not always tested. If it is tested it will be against 5.2.11 or 5.3.0
![]() |
Similar Threads
- How to extract data from web databases? (Database Design)
- a help in my url (ColdFusion)
- flash woes (Graphics and Multimedia)
- Surf Sidekick be a vicious beastie (Viruses, Spyware and other Nasties)
- c program to extract system info (C)
- Can't get rid of URL Logic Pop up generator (Viruses, Spyware and other Nasties)
Other Threads in the PHP Forum
- Previous Thread: preview option before form submission
- Next Thread: Show and Hide form from user
| Thread Tools | Search this Thread |
.htaccess apache api array autocomplete beginner binary body broken cakephp class cms code convert cron curl database dataentry date date/time display duplicates dynamic ebooks email emptydisplayvalue error execute explodefunction file firstoptioninphpdroplist folder form forms function functions google hack href htaccess html htmlspecialchars image include ip javasciptvalidation javascript joomla keywords limit link login mail matching mediawiki menu methods multiple mycodeisbad mysql network number object oop paypal pdf php phpincludeissue query random recursive redirect remote script search securephp server sessions shot source sp space speed sql subdomain subscription system table tag tutorial tutorials upload url validator variable vbulletin video web white youtube






