bots topic for user Crohole

Question

cwarn23 387 Occupation: Genius

15 Years Ago

I have been pm about this question from Crohole so many times that I thought I would setup a topic where others can join in. Crohole has asked about setting up a bot like I have many times but his questions seem to be looping back to the beginning. So I'll explain briefly what to do for making a bot and if Crohole has any more problems he could post here instead of sending lots of pm's to me. So first the basic script that I have setup for a bot is as follows:

<form method="post">Scan site: <input type="text" name="site" value="http://" style="width:300px">
<input value="Scan" type="submit"></form>
<?
set_time_limit (0);

if (!function_exists('stripos')) {
  function stripos($str,$needle,$offset=0) {
      return strpos(strtolower($str),strtolower($needle),$offset);
  }
}

if (isset($_POST['site']) && !empty($_POST['site'])) {
/* Formats Allowed */
$formats=array('html'=>true,'htm'=>true,'xhtml'=>true,'xml'=>true,'mhtml'=>true,'xht'=>true,
'mht'=>true,'asp'=>true,'aspx'=>true,'adp'=>true,'bml'=>true,'cfm'=>true,'cgi'=>true,
'ihtml'=>true,'jsp'=>true,'las'=>true,'lasso'=>true,'lassoapp'=>true,'pl'=>true,'php'=>true,
'php1'=>true,'php2'=>true,'php3'=>true,'php4'=>true,'php5'=>true,'php6'=>true,'phtml'=>true,
'shtml'=>true,'search'=>true,'query'=>true,'forum'=>true,'blog'=>true,'1'=>true,'2'=>true,
'3'=>true,'4'=>true,'5'=>true,'6'=>true,'7'=>true,'8'=>true,'9'=>true,'10'=>true,'11'=>true,
'12'=>true,'13'=>true,'14'=>true,'15'=>true,'16'=>true,'17'=>true,'18'=>true,'19'=>true,
'20'=>true,'01'=>true,'02'=>true,'03'=>true,'04'=>true,'05'=>true,'06'=>true,'07'=>true,
'08'=>true,'09'=>true,'go'=>true,'page'=>true,'file'=>true);

function domain ($ddomain) {
return preg_replace('/^((http(s)?:\/\/)?([^\/]+))(.*)/','$1',$ddomain);
}

function url_exists($durl)
		{
		// Version 4.x supported
		$handle   = curl_init($durl);
		if (false === $handle)
			{
			return false;
			}
		curl_setopt($handle, CURLOPT_HEADER, true);
		curl_setopt($handle, CURLOPT_FAILONERROR, true);  // this works
		curl_setopt($handle, CURLOPT_HTTPHEADER, 
Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") );
		curl_setopt($handle, CURLOPT_NOBODY, true);
		curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
		$connectable = curl_exec($handle);
		curl_close($handle);  
        if (stripos(substr_replace($connectable,'',30),'200 OK')) {
            return true;
            } else {
            return false;
            }
		}
 $fdata='';
//below function will only get links within own domain and not links outside the site.
function getlinks($generateurlf) {
    global $formats;
    global $f_data;
    $f_data=file_get_contents($generateurlf);
    $datac=$f_data;
    preg_match_all('/(href|src)\=(\"|\')([^\"\'\>]+)/i',$datac,$media);
    unset($datac);
    $datac=$media[3];
    unset($media);
    $datab=array();
    $str_start=array('http'=>true,'www.'=>true);
    foreach($datac AS $dfile) {
        $generateurle=$generateurlf;
		$format=strtolower(preg_replace('/(.*)[.]([^.\?]+)(\?(.*))?/','$2',basename($generateurle.$dfile)));
        if (!isset($str_start[substr_replace($dfile,'',4)])) {
            if (substr_replace($generateurle,'',0, -1)!=='/') {
                $generateurle=preg_replace('/(.*)\/[^\/]+/is', "$1", $generateurle);
                } else {
                $generateurle=substr_replace($generateurle,'',-1);
                }
 
            if (substr_replace($dfile,'',1)=='/') {
                if (domain($generateurle)==domain($generateurle.$dfile)) {
                    if (isset($formats[$format]) 
                        || substr($generateurle.$dfile,-1)=='/' || substr_count(basename($generateurle.$dfile),'.')==0) {
                        $datab[]=$generateurle.$dfile;
                        }
                    }
                } else if (substr($dfile,0,2)=='./') {
                $dfile=substr($dfile,2);
                if (isset($formats[$format])) {$datab[]=$generateurle.'/'.$dfile;}
                } else if (substr_replace($dfile,'',1)=='.') {
                while (preg_match('/\.\.\/(.*)/i', $dfile)) {
                $dfile=substr_replace($dfile,'',0,3);
                $generateurle=preg_replace('/(.*)\/[^\/]+/i', "$1", $generateurle);
                }
                if (domain($generateurle)==domain($generateurle.'/'.$dfile)) {
                    if (isset($formats[$format]) || substr($generateurle.'/'.$dfile,-1)=='/' 
                        || substr_count(basename($generateurle.'/'.$dfile),'.')==0) {
                        $datab[]=$generateurle.'/'.$dfile;
                        }
                    }
                } else {
                if (domain($generateurle)==domain($generateurle.'/'.$dfile)) {
                    if (isset($formats[$format]) || substr($generateurle.'/'.$dfile,-1)=='/' 
                        || substr_count(basename($generateurle.'/'.$dfile),'.')==0) {
                        $datab[]=$generateurle.'/'.$dfile;
                        }
                    }
                }
            } else {
            if (domain($generateurle)==domain($dfile)) {
                if (isset($formats[$format]) || substr($dfile,-1)=='/' || substr_count(basename($dfile),'.')==0) {
                    $datab[]=$dfile;
                    }
                }
            }
		unset($format);
        }
    unset($datac);
    unset($dfile);
    return $datab;
    }
 
 
 
 
  
//=============================================
/* Modify only code between these two lines and $formats variable above. */

function generate($url) {
    echo $url.'<br>';
    global $f_data; //Data of file contents
    //do something with webpage $f_data.
    unset($f_data);
    }


//=============================================
// Below is what actually process the search engine
$sites=array();
$sites[]=stripslashes($_POST['site']);
for ($i=0;isset($sites[$i]);$i++) {
    foreach (getlinks(stripslashes($sites[$i])) AS $val) {
        if (!isset($sites[$val])) {
            $sites[]=$val;
            $sites[$val]=true;
            }
        } unset($val);
    if (url_exists($sites[$i])) {
        generate($sites[$i]);
        flush();
        }
    }
}
?>

Now the only part that really needs changing for recording results is the generate() function. That is where that function is defined and is clearly marked between to long commented bars. That is how it goes so any problems Crohole then post here instead of pm-ing me.

php

2 Contributors
10 Replies
126 Views
2 Weeks Discussion Span
Latest Post 15 Years Ago Latest Post by cwarn23

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

crohole 0 Light Poster · Answer 1 · 2009-08-04T09:45:54+00:00

Ok, thank's for your help.
Now, I want to build a web crawler to search website that have link to my website.
For example :
search the website that have link to daniweb.com..
The result will be any website that link to daniweb.com..
-----------------------------------------------------------------------------------------

Please tell me how to do it...I hope for your help.

cwarn23 387 Occupation: Genius Team Colleague Featured Poster · Answer 2 · 2009-08-14T16:25:26+00:00

If you mean for the bot to stick to one domain I believe the above script does that and as for how. It is fairly simple. First you get the domain of the site entered then the domain of the url scanning and if they match it will add that page to the array loop interface. Always wanted to use that term. So below is an example of what is happening in the above script to stick to the one domain.

function domain ($ddomain) {
return preg_replace('/^((http(s)?:\/\/)?([^\/]+))(.*)/','$1',$ddomain);
}
$site='http://www.google.com/test/';
$link_inside_homepage='http://www.google.com/image.png';
if (domain($site)==domain($link_inside_homepage)) {
//then checks if the file file extension is in the array list
    //if that passes the link is added into the loop interface
    //where it will be processed.
}

Hope that helps you understand as the script in the code in the first post already has the requested functionality embeded.

crohole 0 Light Poster · Answer 3 · 2009-08-16T14:14:44+00:00

Mr. Cwarn32, that's the script to check the link about 1 site..I ask how to check all website that have my link in there. For example :
my link is : crohole.com
And I want to detect where are the website that have link to crohole.com. It detect all website in the world. The result will be like this :

Your site : crohole.com

Reciprocal links :
1. idu.info
2. rock.com
3. hash.net

and so on. I hope you understand with my question. Sorry if my english is bad.

cwarn23 387 Occupation: Genius Team Colleague Featured Poster · Answer 4 · 2009-08-16T14:49:43+00:00

Mr. Cwarn32, that's the script to check the link about 1 site..I ask how to check all website that have my link in there. For example :
my link is : crohole.com
And I want to detect where are the website that have link to crohole.com. It detect all website in the world. The result will be like this :
Your site : crohole.com
Reciprocal links :
1. idu.info
2. rock.com
3. hash.net
and so on. I hope you understand with my question. Sorry if my english is bad.

The script in the first post does exactly that. Give it a try and you will see what I mean.

crohole 0 Light Poster · Answer 5 · 2009-08-16T15:20:30+00:00

Mr. Cwarn32, that's the script to check the link about 1 site..I ask how to check all website that have my link in there. For example :
my link is : crohole.com
And I want to detect where are the website that have link to crohole.com. It detect all website in the world. The result will be like this :

Your site : crohole.com

Reciprocal links :
1. idu.info
2. rock.com
3. hash.net

and so on. I hope you understand with my question. Sorry if my english is bad.

cwarn23 387 Occupation: Genius Team Colleague Featured Poster · Answer 6 · 2009-08-16T15:32:09+00:00

If you mean for to see who has linked to your website perhaps using an api would be best because scanning the entire internet is no easy task.

crohole 0 Light Poster · Answer 7 · 2009-08-18T12:44:22+00:00

I has a class like this, but I don't knnow how to make it like I wanted. please tell me.

<?php
// include the class nusoap. This class can be obtained from http://dietrich.ganx4.com/nusoap/index.php
// Once downloaded, put it in somewhere in your site tree, and change the next line to reflect that

include("nusoap.php");


// create a instance of the SOAP client object

// remember that this script is the client,
// accessing the web service provided by Google

$soapclient = new soapclient("http://api.google.com/search/beta2");

// uncomment the next line to see debug messages
// $soapclient->debug_flag = 1;

// set up an array containing input parameters to be
// passed to the remote procedure

class clsGoogleApi {

	// These properties are used in the class:
	var $theResultSet; // holds the results of the search as given by google api
	var $theResults=array(); //holds the results, and is intended to do the traversing
	var $theRowShown=0; // internal field. Holds the index to the last row shown
	var $theMaxResults; // internal field. Holds the given max results parameter to the constructor
	var $flgError = false; // indicates if was there error or not
	var $theSearchQuery; // the Search query as returned by Google Api
	var $theEstimatedResultsCount; // The number of results found by the Api

	function clsGoogleApi($search_what,$start,$maxResults) {

		global $soapclient;
		$params = array(
			 'key' => 'yourkeyhere',   // Google license key This is a valid license. But get your own license, by going to www.google.com/api
			 'q'   => $search_what,                         // search term
			 'start' => $start,                             // start from result n
			 'maxResults' => $maxResults,                   // show a total of n results
			 'filter' => false,                             // remove similar results
			 'restrict' => '',                              // restrict by topic
			 'safeSearch' => false,                         // remove adult links
			 'lr' => '',                                    // restrict by language
			 'ie' => '',                                    // input encoding
			 'oe' => ''                                     // output encoding
		);
		// invoke the method on the server
		$this->theResultSet=$soapclient->call("doGoogleSearch", $params, "urn:GoogleSearch", "urn:GoogleSearch");
		$this->theMaxResults=$maxResults;

		// print the results of the search
		if ($this->theResultSet['faultstring']) {
			echo $this->theResultSet['faultstring']."<br>";
			$this->flgError=true;
		} else  {
			$this->flgError=false;
			$this->theRowShown=0;
			$this->theSearchQuery=$this->theResultSet['searchQuery'];
			$this->theEstimatedResultsCount=$this->theResultSet['estimatedTotalResultsCount'];
			if (is_array($this->theResultSet['resultElements'])) {
				$this->theResults=array();
				foreach ($this->theResultSet['resultElements'] as $r) {
					$result["URL"]=$r['URL'];
					$result["cached-size"]=$r['cachedSize'];
					$result["snippet"]=$r['snippet'];
					$result["directory category"]=$r['directoryCategory'];
					$result["related information present"]=$r['relatedInformationPresent'];
					$result["directory title"]=$r['DirectoryTitle'];
					$result["summary"]=utf8_decode($r['summary']);
					$result["title"]=utf8_decode($r['title']);
					$this->theResults[]=$result;
				}
			}
		}
	}
	
	function getResultNextItem() {
		$result=$this->theResults[$this->theRowShown];
		$this->theRowShown++;
		if (($this->theRowShown > $this->theMaxResults) or ($this->theRowShown > $this->theEstimatedResultsCount))  {
			$result=false;
		}
		return $result;
	}
}

/*
This is an example on how to use the class.

  $myQuery=new clsGoogleApi("michael jackson",0,25); // Search for Michael Jackos, starting on the first found record, and getting a max of 25 items

    if ($myQuery->flgError) { // if error found do something
        echo "Error!";
    } else {
		echo "Search of ". $myQuery->theSearchQuery." got ".$myQuery->theEstimatedResultsCount." results<hr>";
		$item=0;
		echo "<ul>";
        while ($result=$myQuery->getResultNextItem()) {
			$item++;
			echo "<li> $item - ".$result["title"]." (".$result["URL"].")<br>".$result["snippet"]."(".$result["cached-size"].")";
        }
		echo "</ul>";
    }
*/
?>

cwarn23 387 Occupation: Genius Team Colleague Featured Poster · Answer 8 · 2009-08-20T15:34:32+00:00

I managed to find the following page that you can just enter your website (and possibly webpage) then it will report back every link to your site/page and a number at the top.
http://www.tech-faq.com/who-links-to-me.shtml
Hope that is of some use to you.

crohole 0 Light Poster · Answer 9 · 2009-08-21T13:20:23+00:00

That's some example. In my mind it wold be look like this :
http://www.wholinks2me.com/

Please tell me how to do something like that.

cwarn23 387 Occupation: Genius Team Colleague Featured Poster · Answer 10 · 2009-08-21T15:56:59+00:00

Try something like the following script:

$site='www.daniweb.com'; //no http and no slashes
echo file_get_contents('http://www.wholinks2me.com/link/'.$site);

And it would even be possible to use regex to format/style the page and at option to store each group of data in variables/arrays. That's the best I can think of because that api code mentioned in your earlier post makes no sense without documentation.