0

just a simple web spider/crawler i'm trying to create to populate my search engine.

<form action="crawl.php" method="get">
 <center>
   <span class="style1">Crawl Website</span> 
   <input name="search" type="text" value='' size="25" />
   <input type="submit" name="submit" value="Go!">

</center>

</form>

i type the name of a website, i extract the <title>, <header> from the website.

then i store in mysql.


i need some direction on how to do the bolded, i can't find a tutorial on this

Edited by MDanz: n/a

3
Contributors
8
Replies
9
Views
7 Years
Discussion Span
Last Post by cwarn23
0

You would need to start with fopen, curl, or any other http request function. Youll then have to parse the html returned using a regular expression search to find the pieces of the header your looking for.

hope that helps.

0

I just love making bots. You can view my article at: http://www.syntax.cwarn23.info/PHP_Making_a_search_engine
The script is as follows:

<form method="post">Scan site: <input type="text" name="site" value="http://" style="width:300px">
<input value="Scan" type="submit"></form>
<?
set_time_limit (0);
if (isset($_POST['site']) && !empty($_POST['site'])) {
/* Formats Allowed */
$formats=array('html'=>true,'htm'=>true,'xhtml'=>true,'xml'=>true,'mhtml'=>true,'xht'=>true,
'mht'=>true,'asp'=>true,'aspx'=>true,'adp'=>true,'bml'=>true,'cfm'=>true,'cgi'=>true,
'ihtml'=>true,'jsp'=>true,'las'=>true,'lasso'=>true,'lassoapp'=>true,'pl'=>true,'php'=>true,
'php1'=>true,'php2'=>true,'php3'=>true,'php4'=>true,'php5'=>true,'php6'=>true,'phtml'=>true,
'shtml'=>true,'search'=>true,'query'=>true,'forum'=>true,'blog'=>true,'1'=>true,'2'=>true,
'3'=>true,'4'=>true,'5'=>true,'6'=>true,'7'=>true,'8'=>true,'9'=>true,'10'=>true,'11'=>true,
'12'=>true,'13'=>true,'14'=>true,'15'=>true,'16'=>true,'17'=>true,'18'=>true,'19'=>true,
'20'=>true,'01'=>true,'02'=>true,'03'=>true,'04'=>true,'05'=>true,'06'=>true,'07'=>true,
'08'=>true,'09'=>true,'go'=>true,'page'=>true,'file'=>true);
 
function domain ($ddomain) {
return preg_replace('/^((http(s)?:\/\/)?([^\/]+))(.*)/','$1',$ddomain);
}
 
function url_exists($durl)
		{
		// Version 4.x supported
		$handle   = curl_init($durl);
		if (false === $handle)
			{
			return false;
			}
		curl_setopt($handle, CURLOPT_HEADER, true);
		curl_setopt($handle, CURLOPT_FAILONERROR, true);  // this works
		curl_setopt($handle, CURLOPT_HTTPHEADER, 
Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") );
		curl_setopt($handle, CURLOPT_NOBODY, true);
		curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
		$connectable = curl_exec($handle);
		curl_close($handle);  
        if (stripos(substr_replace($connectable,'',30),'200 OK')) {
            return true;
            } else {
            return false;
            }
		}
 $fdata='';
//below function will only get links within own domain and not links outside the site.
function getlinks($generateurlf) {
    global $formats;
    global $f_data;
    $f_data=file_get_contents($generateurlf);
    $datac=$f_data;
    preg_match_all('/(href|src)\=(\"|\')([^\"\'\>]+)/i',$datac,$media);
    unset($datac);
    $datac=$media[3];
    unset($media);
    $datab=array();
    $str_start=array('http'=>true,'www.'=>true);
    foreach($datac AS $dfile) {
        $generateurle=$generateurlf;
		$format=strtolower(preg_replace('/(.*)[.]([^.\?]+)(\?(.*))?/','$2',basename($generateurle.$dfile)));
        if (!isset($str_start[substr_replace($dfile,'',4)])) {
            if (substr_replace($generateurle,'',0, -1)!=='/') {
                $generateurle=preg_replace('/(.*)\/[^\/]+/is', "$1", $generateurle);
                } else {
                $generateurle=substr_replace($generateurle,'',-1);
                }
 
            if (substr_replace($dfile,'',1)=='/') {
                if (domain($generateurle)==domain($generateurle.$dfile)) {
                    if (isset($formats[$format]) 
                        || substr($generateurle.$dfile,-1)=='/' || substr_count(basename($generateurle.$dfile),'.')==0) {
                        $datab[]=$generateurle.$dfile;
                        }
                    }
                } else if (substr($dfile,0,2)=='./') {
                $dfile=substr($dfile,2);
                if (isset($formats[$format])) {$datab[]=$generateurle.'/'.$dfile;}
                } else if (substr_replace($dfile,'',1)=='.') {
                while (preg_match('/\.\.\/(.*)/i', $dfile)) {
                $dfile=substr_replace($dfile,'',0,3);
                $generateurle=preg_replace('/(.*)\/[^\/]+/i', "$1", $generateurle);
                }
                if (domain($generateurle)==domain($generateurle.'/'.$dfile)) {
                    if (isset($formats[$format]) || substr($generateurle.'/'.$dfile,-1)=='/' 
                        || substr_count(basename($generateurle.'/'.$dfile),'.')==0) {
                        $datab[]=$generateurle.'/'.$dfile;
                        }
                    }
                } else {
                if (domain($generateurle)==domain($generateurle.'/'.$dfile)) {
                    if (isset($formats[$format]) || substr($generateurle.'/'.$dfile,-1)=='/' 
                        || substr_count(basename($generateurle.'/'.$dfile),'.')==0) {
                        $datab[]=$generateurle.'/'.$dfile;
                        }
                    }
                }
            } else {
            if (domain($generateurle)==domain($dfile)) {
                if (isset($formats[$format]) || substr($dfile,-1)=='/' || substr_count(basename($dfile),'.')==0) {
                    $datab[]=$dfile;
                    }
                }
            }
		unset($format);
        }
    unset($datac);
    unset($dfile);
    return $datab;
    }
 
 
 
 
 
//=============================================
/* Modify only code between these two lines and $formats variable above. */
 
function generate($url) {
    echo $url.'<br>';
    global $f_data; //Data of file contents
    //do something with webpage $f_data.
    unset($f_data);
    }
 
 
//=============================================
// Below is what actually process the search engine
$sites=array();
$sites[]=stripslashes($_POST['site']);
for ($i=0;isset($sites[$i]);$i++) {
    foreach (getlinks(stripslashes($sites[$i])) AS $val) {
        if (!isset($sites[$val])) {
            $sites[]=$val;
            $sites[$val]=true;
            }
        } unset($val);
    if (url_exists($sites[$i])) {
        generate($sites[$i]);
        flush();
        }
    }
}
?>

Be warned they can chew a lot of cpu and bandwidth. Good luck.

0

ok here is my start... its basicaly a quick add not a spider.

<?php
        // create curl resource
        $ch = curl_init();

        // set url
        curl_setopt($ch, CURLOPT_URL, "www.realgm.com");

        //return the transfer as a string
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

        // $output contains the output string
        $output = curl_exec($ch);

        // close curl resource to free up system resources
        curl_close($ch);     
?>

can someone help me adjust this code so i get the <title> and <head>

0

Try using this on this output variable:

preg_match_all('#<head>.*(<title>.*</title>|).*</head>#',$output,$header);
echo '<xmp>';
print_r($header);
echo '</xmp>';

Edited by cwarn23: n/a

0

like this?

<?php
        // create curl resource
        $ch = curl_init();

        // set url
        curl_setopt($ch, CURLOPT_URL, "www.realgm.com");

        //return the transfer as a string
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

        // $output contains the output string
        $output = curl_exec($ch);
        






      preg_match_all('#<head>.*(<title>.*</title>|).*</head>#',$output,$header);

      echo '<xmp>';

      print_r($header);

      echo '</xmp>';
      
         // close curl resource to free up system resources
        curl_close($ch);
      ?>

i tried this and it says

Array ( [0] => Array ( ) [1] => Array ( ) )

0

My previous code was from the top of my head but I have tested it and should be as follows:

<?php
        // create curl resource
        $ch = curl_init();

        // set url
        curl_setopt($ch, CURLOPT_URL, "www.realgm.com");

        //return the transfer as a string
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

        // $output contains the output string
        $output = curl_exec($ch);
      preg_match_all('#<head>.*<title>(.*)</title>.*</head>#Usi',$output,$header);

      echo '<xmp>';
      print_r($header);
      echo '</xmp>';
      
         // close curl resource to free up system resources
        curl_close($ch);
      ?>
0

hi thanx got it working.. just one more thing..

how do i echo a single part of this

so i just want the <title></title> how do i echo that alone?

<title>RealGM: Sports Is Our Business</title>

<meta name="description" content="Real GM">

<meta name="keywords" content="trade checker, draft simulator, nba, simulator, nba news, nba trades, wiretap, nba transactions, nba draft, nba salaries, basketball, rumors, sports, jordan, hill, carter, shaq, mcgrady, kobe, duncan, kidd, garnett, payton, lebron, carmelo, wade, bosh, hawks, celtics, hornets, bulls, cavs, mavericks, nuggets, pistons, warriors, rockets, pacers, lakers, clippers, heat, bucks, timberwolves, nets, knicks, magic, trailblazers, suns, kings, supersonics, spurs, raptors, jazz, grizzlies, wizards, collective bargaining agreement, trade, sign, free agent, renounce, waive, realgm, general manager, gm">

<meta NAME="description" CONTENT="The only site on the web that allows you to sign, trade, waive, and renouce players from NBA teams. Come along and see what it is like to be a GM of a NBA team. All based on the real rules that the big boys must play by."> <meta http-equiv="Content-Style-Type" content="text/css">

btw are all websites built like this with keywords, description, title?

Edited by MDanz: n/a

0

In my script, to echo the title simply use echo $header[1][0]; or to echo the entire header use echo $header[0][0]; Also virtually all webpages have the title tag but not all pages have the meta tags.

This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.