webcrawler help

Question

MDanz 0 Junior Poster

15 Years Ago

just a simple web spider/crawler i'm trying to create to populate my search engine.

<form action="crawl.php" method="get">
 <center>
   <span class="style1">Crawl Website</span> 
   <input name="search" type="text" value='' size="25" />
   <input type="submit" name="submit" value="Go!">

</center>

</form>

i type the name of a website, i extract the <title>, <header> from the website.

then i store in mysql.

i need some direction on how to do the bolded, i can't find a tutorial on this

php

Edited 15 Years Ago by MDanz because: n/a

3 Contributors
8 Replies
162 Views
2 Days Discussion Span
Latest Post 15 Years Ago Latest Post by cwarn23

All 8 Replies

cwarn23 387 Occupation: Genius

15 Years Ago

I just love making bots. You can view my article at: http://www.syntax.cwarn23.info/PHP_Making_a_search_engine
The script is as follows:

<form method="post">Scan site: <input type="text" name="site" value="http://" style="width:300px">
<input value="Scan" type="submit"></form>
<?
set_time_limit (0);
if (isset($_POST['site']) && !empty($_POST['site'])) {
/* Formats Allowed */
$formats=array('html'=>true,'htm'=>true,'xhtml'=>true,'xml'=>true,'mhtml'=>true,'xht'=>true,
'mht'=>true,'asp'=>true,'aspx'=>true,'adp'=>true,'bml'=>true,'cfm'=>true,'cgi'=>true,
'ihtml'=>true,'jsp'=>true,'las'=>true,'lasso'=>true,'lassoapp'=>true,'pl'=>true,'php'=>true,
'php1'=>true,'php2'=>true,'php3'=>true,'php4'=>true,'php5'=>true,'php6'=>true,'phtml'=>true,
'shtml'=>true,'search'=>true,'query'=>true,'forum'=>true,'blog'=>true,'1'=>true,'2'=>true,
'3'=>true,'4'=>true,'5'=>true,'6'=>true,'7'=>true,'8'=>true,'9'=>true,'10'=>true,'11'=>true,
'12'=>true,'13'=>true,'14'=>true,'15'=>true,'16'=>true,'17'=>true,'18'=>true,'19'=>true,
'20'=>true,'01'=>true,'02'=>true,'03'=>true,'04'=>true,'05'=>true,'06'=>true,'07'=>true,
'08'=>true,'09'=>true,'go'=>true,'page'=>true,'file'=>true);
 
function domain ($ddomain) {
return preg_replace('/^((http(s)?:\/\/)?([^\/]+))(.*)/','$1',$ddomain);
}
 
function url_exists($durl)
		{
		// Version 4.x supported
		$handle   = curl_init($durl);
		if (false === $handle)
			{
			return false;
			}
		curl_setopt($handle, CURLOPT_HEADER, true);
		curl_setopt($handle, CURLOPT_FAILONERROR, true);  // this works
		curl_setopt($handle, CURLOPT_HTTPHEADER, 
Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") );
		curl_setopt($handle, CURLOPT_NOBODY, true);
		curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
		$connectable = curl_exec($handle);
		curl_close($handle);  
        if (stripos(substr_replace($connectable,'',30),'200 OK')) {
            return true;
            } else {
            return false;
            }
		}
 $fdata='';
//below function will only get links within own domain and not links outside the site.
function getlinks($generateurlf) {
    global $formats;
    global $f_data;
    $f_data=file_get_contents($generateurlf);
    $datac=$f_data;
    preg_match_all('/(href|src)\=(\"|\')([^\"\'\>]+)/i',$datac,$media);
    unset($datac);
    $datac=$media[3];
    unset($media);
    $datab=array();
    $str_start=array('http'=>true,'www.'=>true);
    foreach($datac AS $dfile) {
        $generateurle=$generateurlf;
		$format=strtolower(preg_replace('/(.*)[.]([^.\?]+)(\?(.*))?/','$2',basename($generateurle.$dfile)));
        if (!isset($str_start[substr_replace($dfile,'',4)])) {
            if (substr_replace($generateurle,'',0, -1)!=='/') {
                $generateurle=preg_replace('/(.*)\/[^\/]+/is', "$1", $generateurle);
                } else {
                $generateurle=substr_replace($generateurle,'',-1);
                }
 
            if (substr_replace($dfile,'',1)=='/') {
                if (domain($generateurle)==domain($generateurle.$dfile)) {
                    if (isset($formats[$format]) 
                        || substr($generateurle.$dfile,-1)=='/' || substr_count(basename($generateurle.$dfile),'.')==0) {
                        $datab[]=$generateurle.$dfile;
                        }
                    }
                } else if (substr($dfile,0,2)=='./') {
                $dfile=substr($dfile,2);
                if (isset($formats[$format])) {$datab[]=$generateurle.'/'.$dfile;}
                } else if (substr_replace($dfile,'',1)=='.') {
                while (preg_match('/\.\.\/(.*)/i', $dfile)) {
                $dfile=substr_replace($dfile,'',0,3);
                $generateurle=preg_replace('/(.*)\/[^\/]+/i', "$1", $generateurle);
                }
                if (domain($generateurle)==domain($generateurle.'/'.$dfile)) {
                    if (isset($formats[$format]) || substr($generateurle.'/'.$dfile,-1)=='/' 
                        || substr_count(basename($generateurle.'/'.$dfile),'.')==0) {
                        $datab[]=$generateurle.'/'.$dfile;
                        }
                    }
                } else {
                if (domain($generateurle)==domain($generateurle.'/'.$dfile)) {
                    if (isset($formats[$format]) || substr($generateurle.'/'.$dfile,-1)=='/' 
                        || substr_count(basename($generateurle.'/'.$dfile),'.')==0) {
                        $datab[]=$generateurle.'/'.$dfile;
                        }
                    }
                }
            } else {
            if (domain($generateurle)==domain($dfile)) {
                if (isset($formats[$format]) || substr($dfile,-1)=='/' || substr_count(basename($dfile),'.')==0) {
                    $datab[]=$dfile;
                    }
                }
            }
		unset($format);
        }
    unset($datac);
    unset($dfile);
    return $datab;
    }
 
 
 
 
 
//=============================================
/* Modify only code between these two lines and $formats variable above. */
 
function generate($url) {
    echo $url.'<br>';
    global $f_data; //Data of file contents
    //do something with webpage $f_data.
    unset($f_data);
    }
 
 
//=============================================
// Below is what actually process the search engine
$sites=array();
$sites[]=stripslashes($_POST['site']);
for ($i=0;isset($sites[$i]);$i++) {
    foreach (getlinks(stripslashes($sites[$i])) AS $val) {
        if (!isset($sites[$val])) {
            $sites[]=$val;
            $sites[$val]=true;
            }
        } unset($val);
    if (url_exists($sites[$i])) {
        generate($sites[$i]);
        flush();
        }
    }
}
?>

Be warned they can chew a lot of cpu and bandwidth. Good luck.

cwarn23 387 Occupation: Genius

15 Years Ago

Try using this on this output variable:

preg_match_all('#<head>.*(<title>.*</title>|).*</head>#',$output,$header);
echo '<xmp>';
print_r($header);
echo '</xmp>';

Edited 15 Years Ago by cwarn23 because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

kylegetson 16 Junior Poster in Training · Answer 1 · 2009-09-23T08:39:46+00:00

You would need to start with fopen, curl, or any other http request function. Youll then have to parse the html returned using a regular expression search to find the pieces of the header your looking for.

hope that helps.

MDanz 0 Junior Poster · Answer 2 · 2009-09-23T21:40:54+00:00

ok here is my start... its basicaly a quick add not a spider.

<?php
        // create curl resource
        $ch = curl_init();

        // set url
        curl_setopt($ch, CURLOPT_URL, "www.realgm.com");

        //return the transfer as a string
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

        // $output contains the output string
        $output = curl_exec($ch);

        // close curl resource to free up system resources
        curl_close($ch);     
?>

can someone help me adjust this code so i get the <title> and <head>

MDanz 0 Junior Poster · Answer 3 · 2009-09-24T17:52:15+00:00

like this?

<?php
        // create curl resource
        $ch = curl_init();

        // set url
        curl_setopt($ch, CURLOPT_URL, "www.realgm.com");

        //return the transfer as a string
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

        // $output contains the output string
        $output = curl_exec($ch);
        






      preg_match_all('#<head>.*(<title>.*</title>|).*</head>#',$output,$header);

      echo '<xmp>';

      print_r($header);

      echo '</xmp>';
      
         // close curl resource to free up system resources
        curl_close($ch);
      ?>

i tried this and it says

Array ( [0] => Array ( ) [1] => Array ( ) )

cwarn23 387 Occupation: Genius Team Colleague Featured Poster · Answer 4 · 2009-09-24T18:30:31+00:00

My previous code was from the top of my head but I have tested it and should be as follows:

<?php
        // create curl resource
        $ch = curl_init();

        // set url
        curl_setopt($ch, CURLOPT_URL, "www.realgm.com");

        //return the transfer as a string
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

        // $output contains the output string
        $output = curl_exec($ch);
      preg_match_all('#<head>.*<title>(.*)</title>.*</head>#Usi',$output,$header);

      echo '<xmp>';
      print_r($header);
      echo '</xmp>';
      
         // close curl resource to free up system resources
        curl_close($ch);
      ?>

MDanz 0 Junior Poster · Answer 5 · 2009-09-24T23:27:39+00:00

hi thanx got it working.. just one more thing..

how do i echo a single part of this

so i just want the <title></title> how do i echo that alone?

<title>RealGM: Sports Is Our Business</title>

btw are all websites built like this with keywords, description, title?

cwarn23 387 Occupation: Genius Team Colleague Featured Poster · Answer 6 · 2009-09-25T04:04:36+00:00

In my script, to echo the title simply use echo $header[1][0]; or to echo the entire header use echo $header[0][0]; Also virtually all webpages have the title tag but not all pages have the meta tags.

webcrawler help

Recommended Answers Collapse Answers

All 8 Replies

Recommended Answers