Although you didn't entirely answer my question (scan the website or single webpage) I will assume you want to scan the website in which case will require a bot. I have recently written a bot to scan for site security holes and the bot template is as follows:
<?
set_time_limit(0);
function domain($domainb) {
$bits = explode('/', $domainb);
if ($bits[0]=='http:' || $bits[0]=='https:')
{
return $bits[0].'//'.$bits[2].'/';
} else {
return 'http://'.$bits[0].'/';
}
unset($bits);
}
if (isset($_GET['site'])) {
echo '<head><title>Bot scanning website - '.domain($_GET['site']).'</title></head><body>';
} else {
echo '<head><title>Bot scanner</title></head><body>';
}
echo '<center><font size=5 face=\'arial black\'><b>PHP Bot Scanner</b></font><br><form method=\'get\' style=\'margin:0px; padding:0px;\'><input type=\'text\' name=\'site\' size=64 value="'.$_GET['site'].'"><input type=\'submit\' value=\'Scan\'></form></center>';
if (substr_replace($_GET['site'],'',3)=='ftp') {
exit('You may not connect to the ftp protocole');
}
if (!isset($_GET['site'])) { exit(''); }
$_GET['site']=domain($_GET['site']);
function url_exists($durl)
{
// Version 4.x supported
$handle = curl_init($durl);
if (false === $handle)
{
return false;
}
curl_setopt($handle, CURLOPT_HEADER, true);
curl_setopt($handle, CURLOPT_FAILONERROR, true); // this works
curl_setopt($handle, CURLOPT_HTTPHEADER, Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") ); // request as if Firefox
curl_setopt($handle, CURLOPT_NOBODY, true);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$connectable = curl_exec($handle);
curl_close($handle);
if (preg_match('/200 OK/i',substr_replace($connectable,'',30))) {
return true;
} else {
return false;
}
}
//below function will only get links within own domain and not links outside the site.
function getlinks($generateurlf) {
$datac=file_get_contents($generateurlf);
preg_match_all('/(href|src)\=(\"|\')[^\"\'\>]+/i',$datac,$media);
unset($datac);
$datac=preg_replace('/(href|src)(\"|\'|\=\"|\=\')(.*)/i',"$3",$media[0]);
$datab=array();
foreach($datac AS $dfile) {
$generateurle=$generateurlf;
if (!in_array(substr_replace($dfile,'',4),array('http','www.'))) {
if (substr_replace($generateurle,'',0, -1)!=='/') {
$generateurle=preg_replace('/(.*)\/[^\/]+/is', "$1", $generateurle);
} else {
$generateurle=substr_replace($generateurle,'',-1);
}
if (substr_replace($dfile,'',1)=='/') {
if (domain($generateurle)==domain($generateurle.$dfile)) {
if (in_array(strtolower(preg_replace('/(.*)[.]([^.\?]+)(\?(.*))?/','$2',basename($generateurle.$dfile))),array('html','htm','xhtml','xml','mhtml','xht','mht','asp','aspx','adp','bml','cfm','cgi','ihtml','jsp','las','lasso','lassoapp','pl','php','php1','php2','php3','php4','php5','php6','phtml','shtml','search','query','forum','blog','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','01','02','03','04','05','06','07','08','09','go','page','file')) || substr($generateurle.$dfile,-1)=='/' || !preg_match('/[\.]/i',basename($generateurle.$dfile))) {
$datab[]=$generateurle.$dfile;
}
}
} else if (substr_replace($dfile,'',1)=='.') {
while (preg_match('/\.\.\/(.*)/i', $dfile)) {
$dfile=substr_replace($dfile,'',0,3);
$generateurle=preg_replace('/(.*)\/[^\/]+/i', "$1", $generateurle);
}
if (domain($generateurle)==domain($generateurle.'/'.$dfile)) {
if (in_array(strtolower(preg_replace('/(.*)[.]([^.\?]+)(\?(.*))?/','$2',basename($generateurle.'/'.$dfile))),array('html','htm','xhtml','xml','mhtml','xht','mht','asp','aspx','adp','bml','cfm','cgi','ihtml','jsp','las','lasso','lassoapp','pl','php','php1','php2','php3','php4','php5','php6','phtml','shtml','search','query','forum','blog','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','01','02','03','04','05','06','07','08','09','go','page','file')) || substr($generateurle.'/'.$dfile,-1)=='/' || !preg_match('/[\.]/i',basename($generateurle.'/'.$dfile))) {
$datab[]=$generateurle.'/'.$dfile;
}
}
} else {
if (domain($generateurle)==domain($generateurle.'/'.$dfile)) {
if (in_array(strtolower(preg_replace('/(.*)[.]([^.\?]+)(\?(.*))?/','$2',basename($generateurle.'/'.$dfile))),array('html','htm','xhtml','xml','mhtml','xht','mht','asp','aspx','adp','bml','cfm','cgi','ihtml','jsp','las','lasso','lassoapp','pl','php','php1','php2','php3','php4','php5','php6','phtml','shtml','search','query','forum','blog','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','01','02','03','04','05','06','07','08','09','go','page','file')) || substr($generateurle.'/'.$dfile,-1)=='/' || !preg_match('/[\.]/i',basename($generateurle.'/'.$dfile))) {
$datab[]=$generateurle.'/'.$dfile;
}
}
}
} else {
if (domain($generateurle)==domain($dfile)) {
if (in_array(strtolower(preg_replace('/(.*)[.]([^.\?]+)(\?(.*))?/','$2',basename($dfile))),array('html','htm','xhtml','xml','mhtml','xht','mht','asp','aspx','adp','bml','cfm','cgi','ihtml','jsp','las','lasso','lassoapp','pl','php','php1','php2','php3','php4','php5','php6','phtml','shtml','search','query','forum','blog','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','01','02','03','04','05','06','07','08','09','go','page','file')) || substr($dfile,-1)=='/' || !preg_match('/[\.]/i',basename($dfile))) {
$datab[]=$dfile;
}
}
}
}
unset($datac);
unset($dfile);
return $datab;
}
$loopurl['sites']=array($_GET['site']);
foreach (getlinks($_GET['site']) AS $link) {
if (!in_array($link,$loopurl['sites'])) {
$loopurl['sites'][]=$link;
}
}
unset($link);
function generate($genurl) {
$data=file_get_contents($genurl);
//add there what you want to do with the page contents in the variable $data.
}
for ($loopid=0;isset($loopurl['sites'][$loopid]);$loopid++) {
if (url_exists($loopurl['sites'][$loopid])) {
foreach (getlinks($loopurl['sites'][$loopid]) AS $link) {
if (!in_array($link,$loopurl['sites'])) {
$loopurl['sites'][]=$link;
}
}
unset($link);
echo generate($loopurl['sites'][$loopid]);
flush();
}
usleep (5000);
}
echo '<br><b>Bot scan complete.</b></body>';
?>
And to edit this, just place in the generate() function the code you want performed on each page. Also the only parts that should be modified are after the giant space.
Just a note on the theory behind this. The bot above will scan the website and index all the pages into a database then whenever somebody searches a website that is in the index then it can check the most relevent pages within the selected website. Alternatively you can piggy back off google.