943,811 Members | Top Members by Rank

Ad:
  • PHP Discussion Thread
  • Unsolved
  • Views: 1191
  • PHP RSS
Sep 22nd, 2009
0

webcrawler help

Expand Post »
just a simple web spider/crawler i'm trying to create to populate my search engine.

PHP Syntax (Toggle Plain Text)
  1. <form action="crawl.php" method="get">
  2. <center>
  3. <span class="style1">Crawl Website</span>
  4. <input name="search" type="text" value='' size="25" />
  5. <input type="submit" name="submit" value="Go!">
  6.  
  7. </center>
  8.  
  9. </form>

i type the name of a website, i extract the <title>, <header> from the website.

then i store in mysql.


i need some direction on how to do the bolded, i can't find a tutorial on this
Last edited by MDanz; Sep 22nd, 2009 at 4:02 pm.
Similar Threads
Reputation Points: 10
Solved Threads: 0
Junior Poster
MDanz is offline Offline
146 posts
since Sep 2009
Sep 22nd, 2009
0

Re: webcrawler help

You would need to start with fopen, curl, or any other http request function. Youll then have to parse the html returned using a regular expression search to find the pieces of the header your looking for.

hope that helps.
Reputation Points: 26
Solved Threads: 12
Junior Poster in Training
kylegetson is offline Offline
89 posts
since Sep 2009
Sep 23rd, 2009
0

Re: webcrawler help

I just love making bots. You can view my article at: http://www.syntax.cwarn23.info/PHP_M..._search_engine
The script is as follows:
php Syntax (Toggle Plain Text)
  1. <form method="post">Scan site: <input type="text" name="site" value="http://" style="width:300px">
  2. <input value="Scan" type="submit"></form>
  3. <?
  4. set_time_limit (0);
  5. if (isset($_POST['site']) && !empty($_POST['site'])) {
  6. /* Formats Allowed */
  7. $formats=array('html'=>true,'htm'=>true,'xhtml'=>true,'xml'=>true,'mhtml'=>true,'xht'=>true,
  8. 'mht'=>true,'asp'=>true,'aspx'=>true,'adp'=>true,'bml'=>true,'cfm'=>true,'cgi'=>true,
  9. 'ihtml'=>true,'jsp'=>true,'las'=>true,'lasso'=>true,'lassoapp'=>true,'pl'=>true,'php'=>true,
  10. 'php1'=>true,'php2'=>true,'php3'=>true,'php4'=>true,'php5'=>true,'php6'=>true,'phtml'=>true,
  11. 'shtml'=>true,'search'=>true,'query'=>true,'forum'=>true,'blog'=>true,'1'=>true,'2'=>true,
  12. '3'=>true,'4'=>true,'5'=>true,'6'=>true,'7'=>true,'8'=>true,'9'=>true,'10'=>true,'11'=>true,
  13. '12'=>true,'13'=>true,'14'=>true,'15'=>true,'16'=>true,'17'=>true,'18'=>true,'19'=>true,
  14. '20'=>true,'01'=>true,'02'=>true,'03'=>true,'04'=>true,'05'=>true,'06'=>true,'07'=>true,
  15. '08'=>true,'09'=>true,'go'=>true,'page'=>true,'file'=>true);
  16.  
  17. function domain ($ddomain) {
  18. return preg_replace('/^((http(s)?:\/\/)?([^\/]+))(.*)/','$1',$ddomain);
  19. }
  20.  
  21. function url_exists($durl)
  22. {
  23. // Version 4.x supported
  24. $handle = curl_init($durl);
  25. if (false === $handle)
  26. {
  27. return false;
  28. }
  29. curl_setopt($handle, CURLOPT_HEADER, true);
  30. curl_setopt($handle, CURLOPT_FAILONERROR, true); // this works
  31. curl_setopt($handle, CURLOPT_HTTPHEADER,
  32. Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") );
  33. curl_setopt($handle, CURLOPT_NOBODY, true);
  34. curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
  35. $connectable = curl_exec($handle);
  36. curl_close($handle);
  37. if (stripos(substr_replace($connectable,'',30),'200 OK')) {
  38. return true;
  39. } else {
  40. return false;
  41. }
  42. }
  43. $fdata='';
  44. //below function will only get links within own domain and not links outside the site.
  45. function getlinks($generateurlf) {
  46. global $formats;
  47. global $f_data;
  48. $f_data=file_get_contents($generateurlf);
  49. $datac=$f_data;
  50. preg_match_all('/(href|src)\=(\"|\')([^\"\'\>]+)/i',$datac,$media);
  51. unset($datac);
  52. $datac=$media[3];
  53. unset($media);
  54. $datab=array();
  55. $str_start=array('http'=>true,'www.'=>true);
  56. foreach($datac AS $dfile) {
  57. $generateurle=$generateurlf;
  58. $format=strtolower(preg_replace('/(.*)[.]([^.\?]+)(\?(.*))?/','$2',basename($generateurle.$dfile)));
  59. if (!isset($str_start[substr_replace($dfile,'',4)])) {
  60. if (substr_replace($generateurle,'',0, -1)!=='/') {
  61. $generateurle=preg_replace('/(.*)\/[^\/]+/is', "$1", $generateurle);
  62. } else {
  63. $generateurle=substr_replace($generateurle,'',-1);
  64. }
  65.  
  66. if (substr_replace($dfile,'',1)=='/') {
  67. if (domain($generateurle)==domain($generateurle.$dfile)) {
  68. if (isset($formats[$format])
  69. || substr($generateurle.$dfile,-1)=='/' || substr_count(basename($generateurle.$dfile),'.')==0) {
  70. $datab[]=$generateurle.$dfile;
  71. }
  72. }
  73. } else if (substr($dfile,0,2)=='./') {
  74. $dfile=substr($dfile,2);
  75. if (isset($formats[$format])) {$datab[]=$generateurle.'/'.$dfile;}
  76. } else if (substr_replace($dfile,'',1)=='.') {
  77. while (preg_match('/\.\.\/(.*)/i', $dfile)) {
  78. $dfile=substr_replace($dfile,'',0,3);
  79. $generateurle=preg_replace('/(.*)\/[^\/]+/i', "$1", $generateurle);
  80. }
  81. if (domain($generateurle)==domain($generateurle.'/'.$dfile)) {
  82. if (isset($formats[$format]) || substr($generateurle.'/'.$dfile,-1)=='/'
  83. || substr_count(basename($generateurle.'/'.$dfile),'.')==0) {
  84. $datab[]=$generateurle.'/'.$dfile;
  85. }
  86. }
  87. } else {
  88. if (domain($generateurle)==domain($generateurle.'/'.$dfile)) {
  89. if (isset($formats[$format]) || substr($generateurle.'/'.$dfile,-1)=='/'
  90. || substr_count(basename($generateurle.'/'.$dfile),'.')==0) {
  91. $datab[]=$generateurle.'/'.$dfile;
  92. }
  93. }
  94. }
  95. } else {
  96. if (domain($generateurle)==domain($dfile)) {
  97. if (isset($formats[$format]) || substr($dfile,-1)=='/' || substr_count(basename($dfile),'.')==0) {
  98. $datab[]=$dfile;
  99. }
  100. }
  101. }
  102. unset($format);
  103. }
  104. unset($datac);
  105. unset($dfile);
  106. return $datab;
  107. }
  108.  
  109.  
  110.  
  111.  
  112.  
  113. //=============================================
  114. /* Modify only code between these two lines and $formats variable above. */
  115.  
  116. function generate($url) {
  117. echo $url.'<br>';
  118. global $f_data; //Data of file contents
  119. //do something with webpage $f_data.
  120. unset($f_data);
  121. }
  122.  
  123.  
  124. //=============================================
  125. // Below is what actually process the search engine
  126. $sites=array();
  127. $sites[]=stripslashes($_POST['site']);
  128. for ($i=0;isset($sites[$i]);$i++) {
  129. foreach (getlinks(stripslashes($sites[$i])) AS $val) {
  130. if (!isset($sites[$val])) {
  131. $sites[]=$val;
  132. $sites[$val]=true;
  133. }
  134. } unset($val);
  135. if (url_exists($sites[$i])) {
  136. generate($sites[$i]);
  137. flush();
  138. }
  139. }
  140. }
  141. ?>
Be warned they can chew a lot of cpu and bandwidth. Good luck.
Sponsor
Featured Poster
Reputation Points: 410
Solved Threads: 258
Occupation: Genius
cwarn23 is offline Offline
3,004 posts
since Sep 2007
Sep 23rd, 2009
0

Re: webcrawler help

ok here is my start... its basicaly a quick add not a spider.


PHP Syntax (Toggle Plain Text)
  1. <?php
  2. // create curl resource
  3. $ch = curl_init();
  4.  
  5. // set url
  6. curl_setopt($ch, CURLOPT_URL, "www.realgm.com");
  7.  
  8. //return the transfer as a string
  9. curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  10.  
  11. // $output contains the output string
  12. $output = curl_exec($ch);
  13.  
  14. // close curl resource to free up system resources
  15. curl_close($ch);
  16. ?>
can someone help me adjust this code so i get the <title> and <head>
Reputation Points: 10
Solved Threads: 0
Junior Poster
MDanz is offline Offline
146 posts
since Sep 2009
Sep 23rd, 2009
0

Re: webcrawler help

Try using this on this output variable:
php Syntax (Toggle Plain Text)
  1. preg_match_all('#<head>.*(<title>.*</title>|).*</head>#',$output,$header);
  2. echo '<xmp>';
  3. print_r($header);
  4. echo '</xmp>';
Last edited by cwarn23; Sep 23rd, 2009 at 5:41 pm.
Sponsor
Featured Poster
Reputation Points: 410
Solved Threads: 258
Occupation: Genius
cwarn23 is offline Offline
3,004 posts
since Sep 2007
Sep 24th, 2009
0

Re: webcrawler help

like this?


PHP Syntax (Toggle Plain Text)
  1. <?php
  2. // create curl resource
  3. $ch = curl_init();
  4.  
  5. // set url
  6. curl_setopt($ch, CURLOPT_URL, "www.realgm.com");
  7.  
  8. //return the transfer as a string
  9. curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  10.  
  11. // $output contains the output string
  12. $output = curl_exec($ch);
  13.  
  14.  
  15.  
  16.  
  17.  
  18.  
  19.  
  20. preg_match_all('#<head>.*(<title>.*</title>|).*</head>#',$output,$header);
  21.  
  22. echo '<xmp>';
  23.  
  24. print_r($header);
  25.  
  26. echo '</xmp>';
  27.  
  28. // close curl resource to free up system resources
  29. curl_close($ch);
  30. ?>

i tried this and it says

Array ( [0] => Array ( ) [1] => Array ( ) )
Reputation Points: 10
Solved Threads: 0
Junior Poster
MDanz is offline Offline
146 posts
since Sep 2009
Sep 24th, 2009
0

Re: webcrawler help

My previous code was from the top of my head but I have tested it and should be as follows:
php Syntax (Toggle Plain Text)
  1. <?php
  2. // create curl resource
  3. $ch = curl_init();
  4.  
  5. // set url
  6. curl_setopt($ch, CURLOPT_URL, "www.realgm.com");
  7.  
  8. //return the transfer as a string
  9. curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  10.  
  11. // $output contains the output string
  12. $output = curl_exec($ch);
  13. preg_match_all('#<head>.*<title>(.*)</title>.*</head>#Usi',$output,$header);
  14.  
  15. echo '<xmp>';
  16. print_r($header);
  17. echo '</xmp>';
  18.  
  19. // close curl resource to free up system resources
  20. curl_close($ch);
  21. ?>
Sponsor
Featured Poster
Reputation Points: 410
Solved Threads: 258
Occupation: Genius
cwarn23 is offline Offline
3,004 posts
since Sep 2007
Sep 24th, 2009
0

Re: webcrawler help

hi thanx got it working.. just one more thing..

how do i echo a single part of this

so i just want the <title></title> how do i echo that alone?

<title>RealGM: Sports Is Our Business</title>

<meta name="description" content="Real GM">

<meta name="keywords" content="trade checker, draft simulator, nba, simulator, nba news, nba trades, wiretap, nba transactions, nba draft, nba salaries, basketball, rumors, sports, jordan, hill, carter, shaq, mcgrady, kobe, duncan, kidd, garnett, payton, lebron, carmelo, wade, bosh, hawks, celtics, hornets, bulls, cavs, mavericks, nuggets, pistons, warriors, rockets, pacers, lakers, clippers, heat, bucks, timberwolves, nets, knicks, magic, trailblazers, suns, kings, supersonics, spurs, raptors, jazz, grizzlies, wizards, collective bargaining agreement, trade, sign, free agent, renounce, waive, realgm, general manager, gm">

<meta NAME="description" CONTENT="The only site on the web that allows you to sign, trade, waive, and renouce players from NBA teams. Come along and see what it is like to be a GM of a NBA team. All based on the real rules that the big boys must play by."> <meta http-equiv="Content-Style-Type" content="text/css">



btw are all websites built like this with keywords, description, title?
Last edited by MDanz; Sep 24th, 2009 at 2:50 pm.
Reputation Points: 10
Solved Threads: 0
Junior Poster
MDanz is offline Offline
146 posts
since Sep 2009
Sep 24th, 2009
0

Re: webcrawler help

In my script, to echo the title simply use echo $header[1][0]; or to echo the entire header use echo $header[0][0]; Also virtually all webpages have the title tag but not all pages have the meta tags.
Sponsor
Featured Poster
Reputation Points: 410
Solved Threads: 258
Occupation: Genius
cwarn23 is offline Offline
3,004 posts
since Sep 2007

This thread is more than three months old

No one has posted to this discussion for at least three months. Please let old threads die and do not reply to them unless you feel you have something new and valuable to contribute that absolutely must be added to make the discussion complete. Otherwise, please start a new thread in this forum instead.
Message:
Previous Thread in PHP Forum Timeline: Send email with DB records
Next Thread in PHP Forum Timeline: Adding a static option to a php generated drop list





About Us | Contact Us | Advertise | Acceptable Use Policy
Forum Index | Build Custom RSS Feed


Follow us on Twitter


© 2011 DaniWeb® LLC