My friend and I are working on an Internet bot. We want to make a bot that given a website, would index into a table.

Example --

Given the website: www.daniweb.com

Add to table:

www.daniweb.com/c++
www.daniweb.com/c++/forum
www.daniweb.com/java
etc...

Any suggestion on how to do this?

Recommended Answers

All 7 Replies

I have a template that I can give you and it is as follows:

<form method="post">Scan site: <input type="text" name="site" value="http://" style="width:300px">
<input value="Scan" type="submit"></form>
<?
set_time_limit (0);
if (isset($_POST['site']) && !empty($_POST['site'])) {
/* Formats Allowed */
$formats=array('html'=>true,'htm'=>true,'xhtml'=>true,'xml'=>true,'mhtml'=>true,'xht'=>true,
'mht'=>true,'asp'=>true,'aspx'=>true,'adp'=>true,'bml'=>true,'cfm'=>true,'cgi'=>true,
'ihtml'=>true,'jsp'=>true,'las'=>true,'lasso'=>true,'lassoapp'=>true,'pl'=>true,'php'=>true,
'php1'=>true,'php2'=>true,'php3'=>true,'php4'=>true,'php5'=>true,'php6'=>true,'phtml'=>true,
'shtml'=>true,'search'=>true,'query'=>true,'forum'=>true,'blog'=>true,'1'=>true,'2'=>true,
'3'=>true,'4'=>true,'5'=>true,'6'=>true,'7'=>true,'8'=>true,'9'=>true,'10'=>true,'11'=>true,
'12'=>true,'13'=>true,'14'=>true,'15'=>true,'16'=>true,'17'=>true,'18'=>true,'19'=>true,
'20'=>true,'01'=>true,'02'=>true,'03'=>true,'04'=>true,'05'=>true,'06'=>true,'07'=>true,
'08'=>true,'09'=>true,'go'=>true,'page'=>true,'file'=>true);
 
function domain ($ddomain) {
return preg_replace('/^((http(s)?:\/\/)?([^\/]+))(.*)/','$1',$ddomain);
}
 
function url_exists($durl)
		{
		// Version 4.x supported
		$handle   = curl_init($durl);
		if (false === $handle)
			{
			return false;
			}
		curl_setopt($handle, CURLOPT_HEADER, true);
		curl_setopt($handle, CURLOPT_FAILONERROR, true);  // this works
		curl_setopt($handle, CURLOPT_HTTPHEADER, 
Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") );
		curl_setopt($handle, CURLOPT_NOBODY, true);
		curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
		$connectable = curl_exec($handle);
		curl_close($handle);  
        if (stripos(substr_replace($connectable,'',30),'200 OK')) {
            return true;
            } else {
            return false;
            }
		}
 $fdata='';
//below function will only get links within own domain and not links outside the site.
function getlinks($generateurlf) {
    global $formats;
    global $f_data;
    $f_data=file_get_contents($generateurlf);
    $datac=$f_data;
    preg_match_all('/(href|src)\=(\"|\')([^\"\'\>]+)/i',$datac,$media);
    unset($datac);
    $datac=$media[3];
    unset($media);
    $datab=array();
    $str_start=array('http'=>true,'www.'=>true);
    foreach($datac AS $dfile) {
        $generateurle=$generateurlf;
		$format=strtolower(preg_replace('/(.*)[.]([^.\?]+)(\?(.*))?/','$2',basename($generateurle.$dfile)));
        if (!isset($str_start[substr_replace($dfile,'',4)])) {
            if (substr_replace($generateurle,'',0, -1)!=='/') {
                $generateurle=preg_replace('/(.*)\/[^\/]+/is', "$1", $generateurle);
                } else {
                $generateurle=substr_replace($generateurle,'',-1);
                }
 
            if (substr_replace($dfile,'',1)=='/') {
                if (domain($generateurle)==domain($generateurle.$dfile)) {
                    if (isset($formats[$format]) 
                        || substr($generateurle.$dfile,-1)=='/' || substr_count(basename($generateurle.$dfile),'.')==0) {
                        $datab[]=$generateurle.$dfile;
                        }
                    }
                } else if (substr($dfile,0,2)=='./') {
                $dfile=substr($dfile,2);
                if (isset($formats[$format])) {$datab[]=$generateurle.'/'.$dfile;}
                } else if (substr_replace($dfile,'',1)=='.') {
                while (preg_match('/\.\.\/(.*)/i', $dfile)) {
                $dfile=substr_replace($dfile,'',0,3);
                $generateurle=preg_replace('/(.*)\/[^\/]+/i', "$1", $generateurle);
                }
                if (domain($generateurle)==domain($generateurle.'/'.$dfile)) {
                    if (isset($formats[$format]) || substr($generateurle.'/'.$dfile,-1)=='/' 
                        || substr_count(basename($generateurle.'/'.$dfile),'.')==0) {
                        $datab[]=$generateurle.'/'.$dfile;
                        }
                    }
                } else {
                if (domain($generateurle)==domain($generateurle.'/'.$dfile)) {
                    if (isset($formats[$format]) || substr($generateurle.'/'.$dfile,-1)=='/' 
                        || substr_count(basename($generateurle.'/'.$dfile),'.')==0) {
                        $datab[]=$generateurle.'/'.$dfile;
                        }
                    }
                }
            } else {
            if (domain($generateurle)==domain($dfile)) {
                if (isset($formats[$format]) || substr($dfile,-1)=='/' || substr_count(basename($dfile),'.')==0) {
                    $datab[]=$dfile;
                    }
                }
            }
		unset($format);
        }
    unset($datac);
    unset($dfile);
    return $datab;
    }
 
 
 
 
 
//=============================================
/* Modify only code between these two lines and $formats variable above. */
 
function generate($url) {
    echo $url.'<br>';
    global $f_data; //Data of file contents
    //do something with webpage $f_data.
    unset($f_data);
    }
 
 
//=============================================
// Below is what actually process the search engine
$sites=array();
$sites[]=stripslashes($_POST['site']);
for ($i=0;isset($sites[$i]);$i++) {
    foreach (getlinks(stripslashes($sites[$i])) AS $val) {
        if (!isset($sites[$val])) {
            $sites[]=$val;
            $sites[$val]=true;
            }
        } unset($val);
    if (url_exists($sites[$i])) {
        generate($sites[$i]);
        flush();
        }
    }
}
?>
commented: nice! +5

Thanks mate. I actually saw your post from 2007. I was going through one of your code on your website.

I'll try and understand what your trying to do. I'll write on here if I have any question.

Appreciate you help in advance.

Cheers

drjay

My friend and I are working on an Internet bot. We want to make a bot that given a website, would index into a table.

Example --

Given the website: www.daniweb.com

Add to table:

www.daniweb.com/c++
www.daniweb.com/c++/forum
www.daniweb.com/java
etc...

Any suggestion on how to do this?

To explain how this is done:

First you make a http request to the page you want to parse.

eg:

$url = 'http://example.com/';
$content = file_get_contents($url);

Then you parse the HTML. A good option is using DOMDocument.
http://docs.php.net/manual/en/domdocument.loadhtml.php

The loadHTML() function of DOMDOcument will parse HTML into a DOM Tree. You can then use DOM methods, or hand the DOM over to SimpleXML if you find that easier.

$url = 'http://example.com/';
$content = file_get_contents($url);
$Dom = DOMDocument::loadHTML($content);
// you can traverse the DOM
$Xml = simplexml_import_dom($Dom);
// or use the simpleXML methods, which may be simpler

You can also just use regular expressions.

$url = 'http://example.com/';
$content = file_get_contents($url);

// parse what you want with a regular expression
$regex = '/href=["\'](.*?)["\']/i';
preg_match_all($regex, $content, $matches);

var_dump($matches);

To explain how this is done:

First you make a http request to the page you want to parse.

eg:

$url = 'http://example.com/';
$content = file_get_contents($url);

Then you parse the HTML. A good option is using DOMDocument.
http://docs.php.net/manual/en/domdocument.loadhtml.php

The loadHTML() function of DOMDOcument will parse HTML into a DOM Tree. You can then use DOM methods, or hand the DOM over to SimpleXML if you find that easier.

$url = 'http://example.com/';
$content = file_get_contents($url);
$Dom = DOMDocument::loadHTML($content);
// you can traverse the DOM
$Xml = simplexml_import_dom($Dom);
// or use the simpleXML methods, which may be simpler

You can also just use regular expressions.

$url = 'http://example.com/';
$content = file_get_contents($url);

// parse what you want with a regular expression
$regex = '/href=["\'](.*?)["\']/i';
preg_match_all($regex, $content, $matches);

var_dump($matches);

The reason why my script is a little more complex than that is because the script I wrote scans every page in the website specified. So in other words it will go to the specified page, grab all the links then do whatever it needs to do with the page then start to loop through the same process with all of the internal links it finds in the website. Also you may want to note that the regex should really be as follows:

$regex = '/(href|src)=["\'](.*?)["\']/i';

My friend and I are working on an Internet bot. We want to make a bot that given a website, would index into a table.

Example --

Given the website: www.daniweb.com

Add to table:

www.daniweb.com/c++
www.daniweb.com/c++/forum
www.daniweb.com/java
etc...

Any suggestion on how to do this?

The reason why my script is a little more complex than that is because the script I wrote scans every page in the website specified. So in other words it will go to the specified page, grab all the links then do whatever it needs to do with the page then start to loop through the same process with all of the internal links it finds in the website. Also you may want to note that the regex should really be as follows:

$regex = '/(href|src)=["\'](.*?)["\']/i';

It is daunting to try and understand a large piece of code. That is why I posted the process, in a few simple lines.

Great script by the way.. :D

And yes, the regex can be improved, that is just a simple example.

Thanks the both of you. I have a better idea of what to do now.

I don't really understand the regular expression example.

$regex = '/(href|src)=["\'](.*?)["\']/i';

href and src is the html tags for links.

so href|src means href OR src?
equals to (=) ...

well I don't know what this is: ["\'](.*?)["\']/i'?

Appreciate if one of you can explain when time permits.

cheers

drjay

Thanks the both of you. I have a better idea of what to do now.

I don't really understand the regular expression example.

$regex = '/(href|src)=["\'](.*?)["\']/i';

href and src is the html tags for links.

so href|src means href OR src?
equals to (=) ...

well I don't know what this is: ["\'](.*?)["\']/i'?

Appreciate if one of you can explain when time permits.

cheers

drjay

A good resource on regex is: http://www.regular-expressions.info/
The special characters in regex are documented here: http://www.regular-expressions.info/reference.html

so href|src means href OR src?
yes, correct.

equals to (=) ...
The = is a literal character here (no special meaning). It would probably be best to escape it. The escape character in regex is \. This is the same as in most languages, and PHP strings.

well I don't know what this is: ["\'](.*?)["\']/i'?

Square brackets [] make any character within them match the pattern. So the pattern [ab] will match either a or b.

For the (.*?), this matches any number of any characters. For example:

<div>(.*?)</div> will match a <div>, any number of characters, followed by </div>.

In detail: (.*?) consists of a dot . which represents any character, a * which represents any number of the left pattern, and ? which means to stop matching as soon as we find a match for the right pattern.

The \ you see in there is as I mentioned the escape character. Anything following an \ has no special meaning. So \* would match a literal *.

The only exception to this are the \n \t \s and \r characters, which have special meaning when escaped as such. There may be a few more..

Regex may be a bit difficult to grasp at first, though if you try writing a few simple ones of your own you'll see how powerful it can be.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.