Extract URL

Reply

Join Date: Feb 2007
Posts: 6
Reputation: xor83 is an unknown quantity at this point 
Solved Threads: 0
xor83 xor83 is offline Offline
Newbie Poster

Extract URL

 
0
  #1
Mar 2nd, 2009
How can I extract URLs from webpage and want that all url should be from some specific site only like "www.abc.com/32432/file.zip" it should search abc.com and the extenstion can zip,rar,001
any help?
Reply With Quote Quick reply to this message  
Join Date: Jul 2008
Posts: 148
Reputation: mschroeder is on a distinguished road 
Solved Threads: 25
mschroeder mschroeder is offline Offline
Junior Poster

Re: Extract URL

 
0
  #2
Mar 2nd, 2009
In Firefox/Windows you would put http://www.abc.com into the address bar, once the site was loaded, press CTRL-U to bring up the source, then CTRL-C/CTRL-V on whatever urls you want.

I think the concept you're looking for is a website scraper, there are a lot of different options for doing this from regular expressions, to xpath, which is one of my personal favorites.

Come back with some conceptual code and I'll be more than happy to help you work through it.
If you're question/problem is solved don't forget to mark the thread as Solved!

-- Code I post is usually but not always tested. If it is tested it will be against 5.2.11 or 5.3.0
Reply With Quote Quick reply to this message  
Join Date: Sep 2007
Posts: 1,449
Reputation: cwarn23 has a spectacular aura about cwarn23 has a spectacular aura about cwarn23 has a spectacular aura about 
Solved Threads: 135
cwarn23's Avatar
cwarn23 cwarn23 is online now Online
Nearly a Posting Virtuoso

Re: Extract URL

 
0
  #3
Mar 3rd, 2009
If you want to extract the url's from the page then I have an existing script that not only extracts to links to other pages but also links to pictures and other media. My script is as follows:
  1. function getlinks($url) {
  2. $media=preg_split('/(href\=\"|href\=\'|href\=|src\=\"|src\=\'|src\=)/i',$url);
  3. $media=preg_replace("/([^\'])\'(.*)/is",'$1',$media);
  4. $media=preg_replace("/([^\"])\"(.*)/is",'$1',$media);
  5. $media=preg_replace("/([^\>])\>(.*)/is",'$1',$media);
  6. $media=preg_replace("/([^\'])\'(.*)/i",'$1',$media);
  7. $media=preg_replace("/([^\"])\"(.*)/i",'$1',$media);
  8. $media=preg_replace("/([^\>])\>(.*)/i",'$1',$media);
  9. $media=preg_replace("/([^ ])\ [0-9\'\"\>\/](.*)/is",'$1',$media);
  10. $media=@preg_replace("/([^ ])\ [0-9\'\"\>\/](.*)/i",'$1',$media);
  11. $mediaext=preg_replace("/.*[.]([^.]+)/",'$1',$media);
  12. return $media;
  13. }
  14. //above function returns an array
May be badley written but does the job. So I shall see if I can do a preg_match function.

=======================
Edit:
I have now written a function that will extract the links more efficiently and is as follows:
  1. <?
  2. function getlinks($url) {
  3. $data=file_get_contents($url);
  4. preg_match_all('/(href|src)\=(\"|\')[^\"\'\>]+/i',$data,$media);
  5. unset($data);
  6. $data=preg_replace('/(href|src)(\"|\'|\=\"|\=\')(.*)/i',"$3",$media[0]);
  7. return $data;
  8. }
  9.  
  10. //now to use the function
  11. echo "<xmp>";
  12. var_dump(getlinks('http://www.google.com.au'));
  13. echo "</xmp>";
  14. ?>
And the function as you can see returns an array of the links.
Last edited by cwarn23; Mar 3rd, 2009 at 2:37 am. Reason: Added info
Try not to bump 10 year old threads as it can be really annoying.
Like php then read my website at http://syntax.cwarn23.net/
Star-Trek-Atlantis - now that's what I call a movie ^_^
My favourite PC. - MacGyver Fan
Bad english note: dis-iz-2b4u
Reply With Quote Quick reply to this message  
Join Date: Jul 2008
Posts: 148
Reputation: mschroeder is on a distinguished road 
Solved Threads: 25
mschroeder mschroeder is offline Offline
Junior Poster

Re: Extract URL

 
0
  #4
Mar 3rd, 2009
Alright, since you responded with a great example of how to do it with regular expressions, I guess I can provide an xpath example using the DOM as i mentioned previously.

  1. <?php
  2. $sUrl = 'http://www.google.com';
  3.  
  4. $oDom = new DomDocument();
  5. @$oDom->loadHTMLFile( $sUrl );
  6.  
  7. $oXpath = new DomXpath($oDom);
  8.  
  9. //Could also be //@href | //@src i just think the one used gives you more finite control over the result set.
  10. $oRes = $oXpath->query("//a/@href | //img/@src | //script/@src");
  11.  
  12. $i=0;
  13. foreach($oRes as $h1) {
  14. echo $h1->nodeValue . '<br>';
  15. $i++;
  16. }
  17.  
  18. echo $i.' urls found in page.<br /><br />';

  1. http://images.google.com/imghp?hl=en&tab=wi
  2. http://maps.google.com/maps?hl=en&tab=wl
  3. http://news.google.com/nwshp?hl=en&tab=wn
  4. http://video.google.com/?hl=en&tab=wv
  5. http://mail.google.com/mail/?hl=en&tab=wm
  6. http://www.google.com/intl/en/options/
  7. http://www.google.com/prdhp?hl=en&tab=wf
  8. http://groups.google.com/grphp?hl=en&tab=wg
  9. http://books.google.com/bkshp?hl=en&tab=wp
  10. http://scholar.google.com/schhp?hl=en&tab=ws
  11. http://www.google.com/finance?hl=en&tab=we
  12. http://blogsearch.google.com/?hl=en&tab=wb
  13. http://www.youtube.com/?hl=en&tab=w1
  14. http://www.google.com/calendar/render?hl=en&tab=wc
  15. http://picasaweb.google.com/home?hl=en&tab=wq
  16. http://docs.google.com/?hl=en&tab=wo
  17. http://www.google.com/reader/view/?hl=en&tab=wy
  18. http://sites.google.com/?hl=en&tab=w3
  19. http://www.google.com/intl/en/options/
  20. /url?sa=p&pref=ig&pval=3&q=http://www.google.com/ig%3Fhl%3Den%26source%3Diglk&usg=AFQjCNFA18XPfgb7dKnXfKz7x7g1GDH1tg
  21. https://www.google.com/accounts/Login?continue=http://www.google.com/&hl=en
  22. /intl/en_ALL/images/logo.gif
  23. /advanced_search?hl=en
  24. /preferences?hl=en
  25. /language_tools?hl=en
  26. /intl/en/ads/
  27. /services/
  28. /intl/en/about.html
  29. /intl/en/privacy.html
  30. 29 links found in page.

The only thing to be aware of here, is urls that are relative and not full paths. You would need to put some logic in place to add the domain back to them if its not there already.
Last edited by mschroeder; Mar 3rd, 2009 at 10:16 am.
If you're question/problem is solved don't forget to mark the thread as Solved!

-- Code I post is usually but not always tested. If it is tested it will be against 5.2.11 or 5.3.0
Reply With Quote Quick reply to this message  
Reply

This thread is more than three months old.
Perhaps start a new thread instead?
Message:



Similar Threads
Other Threads in the PHP Forum
Thread Tools Search this Thread



About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC