webcrawler help

Reply

Join Date: Sep 2009
Posts: 22
Reputation: MDanz is an unknown quantity at this point 
Solved Threads: 0
MDanz MDanz is offline Offline
Newbie Poster

webcrawler help

 
0
  #1
Sep 22nd, 2009
just a simple web spider/crawler i'm trying to create to populate my search engine.

  1. <form action="crawl.php" method="get">
  2. <center>
  3. <span class="style1">Crawl Website</span>
  4. <input name="search" type="text" value='' size="25" />
  5. <input type="submit" name="submit" value="Go!">
  6.  
  7. </center>
  8.  
  9. </form>

i type the name of a website, i extract the <title>, <header> from the website.

then i store in mysql.


i need some direction on how to do the bolded, i can't find a tutorial on this
Last edited by MDanz; Sep 22nd, 2009 at 4:02 pm.
Reply With Quote Quick reply to this message  
Join Date: Sep 2009
Posts: 62
Reputation: kylegetson is an unknown quantity at this point 
Solved Threads: 9
kylegetson's Avatar
kylegetson kylegetson is offline Offline
Junior Poster in Training

Re: webcrawler help

 
0
  #2
Sep 22nd, 2009
You would need to start with fopen, curl, or any other http request function. Youll then have to parse the html returned using a regular expression search to find the pieces of the header your looking for.

hope that helps.
Don't pay data charges. txtFeeder.com is a free way to read the web on your mobile, and avoid data charges! **Now txtFeeder has a wireless note feature! Make notes on the go!
-Kyle Getson
Reply With Quote Quick reply to this message  
Join Date: Jul 2009
Posts: 72
Reputation: hireaprogrammer is an unknown quantity at this point 
Solved Threads: 1
hireaprogrammer's Avatar
hireaprogrammer hireaprogrammer is offline Offline
Junior Poster in Training

Re: webcrawler help

 
0
  #3
Sep 23rd, 2009
Visit my recent work : www.searchnaukri.com

If you looking for similar to it then let me know i will provide you code .

Mail me : info@hire-phpdeveloper.com
hire-phpdeveloper.com
Start from $3.99 Per Hour
Yahoo IM :hirephpdeveloper
Web Development with php | Hire a dedicated php developer |Hire indian Php developer
Reply With Quote Quick reply to this message  
Join Date: Sep 2007
Posts: 1,546
Reputation: cwarn23 has a spectacular aura about cwarn23 has a spectacular aura about cwarn23 has a spectacular aura about 
Solved Threads: 137
cwarn23's Avatar
cwarn23 cwarn23 is offline Offline
Posting Virtuoso

Re: webcrawler help

 
0
  #4
Sep 23rd, 2009
I just love making bots. You can view my article at: http://www.syntax.cwarn23.info/PHP_M..._search_engine
The script is as follows:
  1. <form method="post">Scan site: <input type="text" name="site" value="http://" style="width:300px">
  2. <input value="Scan" type="submit"></form>
  3. <?
  4. set_time_limit (0);
  5. if (isset($_POST['site']) && !empty($_POST['site'])) {
  6. /* Formats Allowed */
  7. $formats=array('html'=>true,'htm'=>true,'xhtml'=>true,'xml'=>true,'mhtml'=>true,'xht'=>true,
  8. 'mht'=>true,'asp'=>true,'aspx'=>true,'adp'=>true,'bml'=>true,'cfm'=>true,'cgi'=>true,
  9. 'ihtml'=>true,'jsp'=>true,'las'=>true,'lasso'=>true,'lassoapp'=>true,'pl'=>true,'php'=>true,
  10. 'php1'=>true,'php2'=>true,'php3'=>true,'php4'=>true,'php5'=>true,'php6'=>true,'phtml'=>true,
  11. 'shtml'=>true,'search'=>true,'query'=>true,'forum'=>true,'blog'=>true,'1'=>true,'2'=>true,
  12. '3'=>true,'4'=>true,'5'=>true,'6'=>true,'7'=>true,'8'=>true,'9'=>true,'10'=>true,'11'=>true,
  13. '12'=>true,'13'=>true,'14'=>true,'15'=>true,'16'=>true,'17'=>true,'18'=>true,'19'=>true,
  14. '20'=>true,'01'=>true,'02'=>true,'03'=>true,'04'=>true,'05'=>true,'06'=>true,'07'=>true,
  15. '08'=>true,'09'=>true,'go'=>true,'page'=>true,'file'=>true);
  16.  
  17. function domain ($ddomain) {
  18. return preg_replace('/^((http(s)?:\/\/)?([^\/]+))(.*)/','$1',$ddomain);
  19. }
  20.  
  21. function url_exists($durl)
  22. {
  23. // Version 4.x supported
  24. $handle = curl_init($durl);
  25. if (false === $handle)
  26. {
  27. return false;
  28. }
  29. curl_setopt($handle, CURLOPT_HEADER, true);
  30. curl_setopt($handle, CURLOPT_FAILONERROR, true); // this works
  31. curl_setopt($handle, CURLOPT_HTTPHEADER,
  32. Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") );
  33. curl_setopt($handle, CURLOPT_NOBODY, true);
  34. curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
  35. $connectable = curl_exec($handle);
  36. curl_close($handle);
  37. if (stripos(substr_replace($connectable,'',30),'200 OK')) {
  38. return true;
  39. } else {
  40. return false;
  41. }
  42. }
  43. $fdata='';
  44. //below function will only get links within own domain and not links outside the site.
  45. function getlinks($generateurlf) {
  46. global $formats;
  47. global $f_data;
  48. $f_data=file_get_contents($generateurlf);
  49. $datac=$f_data;
  50. preg_match_all('/(href|src)\=(\"|\')([^\"\'\>]+)/i',$datac,$media);
  51. unset($datac);
  52. $datac=$media[3];
  53. unset($media);
  54. $datab=array();
  55. $str_start=array('http'=>true,'www.'=>true);
  56. foreach($datac AS $dfile) {
  57. $generateurle=$generateurlf;
  58. $format=strtolower(preg_replace('/(.*)[.]([^.\?]+)(\?(.*))?/','$2',basename($generateurle.$dfile)));
  59. if (!isset($str_start[substr_replace($dfile,'',4)])) {
  60. if (substr_replace($generateurle,'',0, -1)!=='/') {
  61. $generateurle=preg_replace('/(.*)\/[^\/]+/is', "$1", $generateurle);
  62. } else {
  63. $generateurle=substr_replace($generateurle,'',-1);
  64. }
  65.  
  66. if (substr_replace($dfile,'',1)=='/') {
  67. if (domain($generateurle)==domain($generateurle.$dfile)) {
  68. if (isset($formats[$format])
  69. || substr($generateurle.$dfile,-1)=='/' || substr_count(basename($generateurle.$dfile),'.')==0) {
  70. $datab[]=$generateurle.$dfile;
  71. }
  72. }
  73. } else if (substr($dfile,0,2)=='./') {
  74. $dfile=substr($dfile,2);
  75. if (isset($formats[$format])) {$datab[]=$generateurle.'/'.$dfile;}
  76. } else if (substr_replace($dfile,'',1)=='.') {
  77. while (preg_match('/\.\.\/(.*)/i', $dfile)) {
  78. $dfile=substr_replace($dfile,'',0,3);
  79. $generateurle=preg_replace('/(.*)\/[^\/]+/i', "$1", $generateurle);
  80. }
  81. if (domain($generateurle)==domain($generateurle.'/'.$dfile)) {
  82. if (isset($formats[$format]) || substr($generateurle.'/'.$dfile,-1)=='/'
  83. || substr_count(basename($generateurle.'/'.$dfile),'.')==0) {
  84. $datab[]=$generateurle.'/'.$dfile;
  85. }
  86. }
  87. } else {
  88. if (domain($generateurle)==domain($generateurle.'/'.$dfile)) {
  89. if (isset($formats[$format]) || substr($generateurle.'/'.$dfile,-1)=='/'
  90. || substr_count(basename($generateurle.'/'.$dfile),'.')==0) {
  91. $datab[]=$generateurle.'/'.$dfile;
  92. }
  93. }
  94. }
  95. } else {
  96. if (domain($generateurle)==domain($dfile)) {
  97. if (isset($formats[$format]) || substr($dfile,-1)=='/' || substr_count(basename($dfile),'.')==0) {
  98. $datab[]=$dfile;
  99. }
  100. }
  101. }
  102. unset($format);
  103. }
  104. unset($datac);
  105. unset($dfile);
  106. return $datab;
  107. }
  108.  
  109.  
  110.  
  111.  
  112.  
  113. //=============================================
  114. /* Modify only code between these two lines and $formats variable above. */
  115.  
  116. function generate($url) {
  117. echo $url.'<br>';
  118. global $f_data; //Data of file contents
  119. //do something with webpage $f_data.
  120. unset($f_data);
  121. }
  122.  
  123.  
  124. //=============================================
  125. // Below is what actually process the search engine
  126. $sites=array();
  127. $sites[]=stripslashes($_POST['site']);
  128. for ($i=0;isset($sites[$i]);$i++) {
  129. foreach (getlinks(stripslashes($sites[$i])) AS $val) {
  130. if (!isset($sites[$val])) {
  131. $sites[]=$val;
  132. $sites[$val]=true;
  133. }
  134. } unset($val);
  135. if (url_exists($sites[$i])) {
  136. generate($sites[$i]);
  137. flush();
  138. }
  139. }
  140. }
  141. ?>
Be warned they can chew a lot of cpu and bandwidth. Good luck.
Try not to bump 10 year old threads as it can be really annoying.
http://syntax.cwarn23.net/
Smilies: ^_* +_+ v_v -_- *~*`
My favourite PC. - Oopy Doopy Do 2U2!
Reply With Quote Quick reply to this message  
Join Date: Sep 2009
Posts: 22
Reputation: MDanz is an unknown quantity at this point 
Solved Threads: 0
MDanz MDanz is offline Offline
Newbie Poster

Re: webcrawler help

 
0
  #5
Sep 23rd, 2009
ok here is my start... its basicaly a quick add not a spider.


  1. <?php
  2. // create curl resource
  3. $ch = curl_init();
  4.  
  5. // set url
  6. curl_setopt($ch, CURLOPT_URL, "www.realgm.com");
  7.  
  8. //return the transfer as a string
  9. curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  10.  
  11. // $output contains the output string
  12. $output = curl_exec($ch);
  13.  
  14. // close curl resource to free up system resources
  15. curl_close($ch);
  16. ?>
can someone help me adjust this code so i get the <title> and <head>
Reply With Quote Quick reply to this message  
Join Date: Sep 2007
Posts: 1,546
Reputation: cwarn23 has a spectacular aura about cwarn23 has a spectacular aura about cwarn23 has a spectacular aura about 
Solved Threads: 137
cwarn23's Avatar
cwarn23 cwarn23 is offline Offline
Posting Virtuoso

Re: webcrawler help

 
0
  #6
Sep 23rd, 2009
Try using this on this output variable:
  1. preg_match_all('#<head>.*(<title>.*</title>|).*</head>#',$output,$header);
  2. echo '<xmp>';
  3. print_r($header);
  4. echo '</xmp>';
Last edited by cwarn23; Sep 23rd, 2009 at 5:41 pm.
Try not to bump 10 year old threads as it can be really annoying.
http://syntax.cwarn23.net/
Smilies: ^_* +_+ v_v -_- *~*`
My favourite PC. - Oopy Doopy Do 2U2!
Reply With Quote Quick reply to this message  
Join Date: Sep 2009
Posts: 22
Reputation: MDanz is an unknown quantity at this point 
Solved Threads: 0
MDanz MDanz is offline Offline
Newbie Poster

Re: webcrawler help

 
0
  #7
Sep 24th, 2009
like this?


  1. <?php
  2. // create curl resource
  3. $ch = curl_init();
  4.  
  5. // set url
  6. curl_setopt($ch, CURLOPT_URL, "www.realgm.com");
  7.  
  8. //return the transfer as a string
  9. curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  10.  
  11. // $output contains the output string
  12. $output = curl_exec($ch);
  13.  
  14.  
  15.  
  16.  
  17.  
  18.  
  19.  
  20. preg_match_all('#<head>.*(<title>.*</title>|).*</head>#',$output,$header);
  21.  
  22. echo '<xmp>';
  23.  
  24. print_r($header);
  25.  
  26. echo '</xmp>';
  27.  
  28. // close curl resource to free up system resources
  29. curl_close($ch);
  30. ?>

i tried this and it says

Array ( [0] => Array ( ) [1] => Array ( ) )
Reply With Quote Quick reply to this message  
Join Date: Sep 2007
Posts: 1,546
Reputation: cwarn23 has a spectacular aura about cwarn23 has a spectacular aura about cwarn23 has a spectacular aura about 
Solved Threads: 137
cwarn23's Avatar
cwarn23 cwarn23 is offline Offline
Posting Virtuoso

Re: webcrawler help

 
0
  #8
Sep 24th, 2009
My previous code was from the top of my head but I have tested it and should be as follows:
  1. <?php
  2. // create curl resource
  3. $ch = curl_init();
  4.  
  5. // set url
  6. curl_setopt($ch, CURLOPT_URL, "www.realgm.com");
  7.  
  8. //return the transfer as a string
  9. curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  10.  
  11. // $output contains the output string
  12. $output = curl_exec($ch);
  13. preg_match_all('#<head>.*<title>(.*)</title>.*</head>#Usi',$output,$header);
  14.  
  15. echo '<xmp>';
  16. print_r($header);
  17. echo '</xmp>';
  18.  
  19. // close curl resource to free up system resources
  20. curl_close($ch);
  21. ?>
Try not to bump 10 year old threads as it can be really annoying.
http://syntax.cwarn23.net/
Smilies: ^_* +_+ v_v -_- *~*`
My favourite PC. - Oopy Doopy Do 2U2!
Reply With Quote Quick reply to this message  
Join Date: Sep 2009
Posts: 22
Reputation: MDanz is an unknown quantity at this point 
Solved Threads: 0
MDanz MDanz is offline Offline
Newbie Poster

Re: webcrawler help

 
0
  #9
Sep 24th, 2009
hi thanx got it working.. just one more thing..

how do i echo a single part of this

so i just want the <title></title> how do i echo that alone?

<title>RealGM: Sports Is Our Business</title>

<meta name="description" content="Real GM">

<meta name="keywords" content="trade checker, draft simulator, nba, simulator, nba news, nba trades, wiretap, nba transactions, nba draft, nba salaries, basketball, rumors, sports, jordan, hill, carter, shaq, mcgrady, kobe, duncan, kidd, garnett, payton, lebron, carmelo, wade, bosh, hawks, celtics, hornets, bulls, cavs, mavericks, nuggets, pistons, warriors, rockets, pacers, lakers, clippers, heat, bucks, timberwolves, nets, knicks, magic, trailblazers, suns, kings, supersonics, spurs, raptors, jazz, grizzlies, wizards, collective bargaining agreement, trade, sign, free agent, renounce, waive, realgm, general manager, gm">

<meta NAME="description" CONTENT="The only site on the web that allows you to sign, trade, waive, and renouce players from NBA teams. Come along and see what it is like to be a GM of a NBA team. All based on the real rules that the big boys must play by."> <meta http-equiv="Content-Style-Type" content="text/css">



btw are all websites built like this with keywords, description, title?
Last edited by MDanz; Sep 24th, 2009 at 2:50 pm.
Reply With Quote Quick reply to this message  
Join Date: Sep 2007
Posts: 1,546
Reputation: cwarn23 has a spectacular aura about cwarn23 has a spectacular aura about cwarn23 has a spectacular aura about 
Solved Threads: 137
cwarn23's Avatar
cwarn23 cwarn23 is offline Offline
Posting Virtuoso

Re: webcrawler help

 
0
  #10
Sep 24th, 2009
In my script, to echo the title simply use echo $header[1][0]; or to echo the entire header use echo $header[0][0]; Also virtually all webpages have the title tag but not all pages have the meta tags.
Try not to bump 10 year old threads as it can be really annoying.
http://syntax.cwarn23.net/
Smilies: ^_* +_+ v_v -_- *~*`
My favourite PC. - Oopy Doopy Do 2U2!
Reply With Quote Quick reply to this message  
Reply

This thread is more than three months old.
Perhaps start a new thread instead?
Message:



Similar Threads
Other Threads in the PHP Forum


Views: 677 | Replies: 9
Thread Tools Search this Thread



Tag cloud for PHP
About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC