Apologies first if this is covered elsewhere - I searched but could not find.

I am looking for a way to search the web for the presence of a JavaScript code snippet within the HTML <body> of a web page. I would specify the code snippet and send the bot on its way, it would come back with either a number of results, or a list of pages.

I realise there are billions of web pages so don't know whether this is feasilble or not.

The purpose is to determine the number of participating sites in a particular network. (Currently uncontrolled so any site could have the code on)

Any ideas on this?

Thanks

Recommended Answers

All 12 Replies

Apologies first if this is covered elsewhere - I searched but could not find.

I am looking for a way to search the web for the presence of a JavaScript code snippet within the HTML <body> of a web page. I would specify the code snippet and send the bot on its way, it would come back with either a number of results, or a list of pages.

I realise there are billions of web pages so don't know whether this is feasilble or not.

The purpose is to determine the number of participating sites in a particular network. (Currently uncontrolled so any site could have the code on)

Any ideas on this?

Thanks

I don't think you will find too many php web spider scripts only because php is an interpreted language and will be quite slow. I would look into Java probably.

Hi,

I agree with RobRob here, PHP just isn't powerful enough to be doing this sort of task. PHP is not a compiled language, so it is interpreted into machine-readable code on-the-fly. This is very slow in comparison to compiled languages such as C and C++ which naturally operate at a closer level to the hardware. Even if you had unlimited power and resources, PHP is very restrictive as it only runs once, and thanks to execution timeouts, you can't do very much.

If you're still interested in making a spider, I must say that it's no easy feat. You would be much better looking into C and C++ in subsidary forums, with most of you're focus spent on understanding networking.

Thanks,
Christopher Lord

How about trying a Google search? You could search within the HTML for your Javascrpit code. If you would like it in script form, try using the Google Ajax Search widget and simply count the results!!!

I don't think you will find too many php web spider scripts only because php is an interpreted language and will be quite slow.

I agree with RobRob here, PHP just isn't powerful enough to be doing this sort of task.

Well from the quotes above, I may just be able to prove all of that wrong with a PHP bot series I am working on. This post is not advertiesment but is to make the users 'R0bb0b' and 'heenix' and 'monsterpot' aware of what PHP is really capable of.
I have managed to make a bot which will index all of the urls that a linked to (within a tree type format) and will keep on indexing the web until it runs out of links. I will provide you with the source for you to make your javascript finder and currently the below script only indexes urls which is needed to scan more than one page.

if (isset($_GET['url']))
	{
	include("db.php");
	mysql_connect($dbhost,$accountname,$password)
	or die("Could not connect to MySQL server");
	mysql_select_db($database) or die(mysql_error()."Could not select database");
	$file=file_get_contents($_GET['url']);
	$links=preg_split('/(href\=\'|href\=\"|href\=)/is',$file);
	//$file=preg_replace('/(.*)(\ href\=\"|\ href\=\'|\ href=)(.*\ |.*\"|.*\'|.*>)/is','$2|^=^|',$file,1);
	//echo $file."<br>";
	
	mysql_query("INSERT INTO `indextemp` SET `url`='".$_GET['url']."', `stage`='1'");
	$id=1;
	while (isset($links[$id]))
		{
		$links[$id]=preg_replace("/([^\'])\'(.*)/is",'$1',$links[$id]);
		$links[$id]=preg_replace("/([^\"])\"(.*)/is",'$1',$links[$id]);
		$links[$id]=preg_replace("/([^\>])\>(.*)/is",'$1',$links[$id]);
		$links[$id]=preg_replace("/([^ ])\ (.*)/is",'$1',$links[$id]);
		$links[$id]=preg_replace("/([^\'])\'(.*)/i",'$1',$links[$id]);
		$links[$id]=preg_replace("/([^\"])\"(.*)/i",'$1',$links[$id]);
		$links[$id]=preg_replace("/([^\>])\>(.*)/i",'$1',$links[$id]);
		$links[$id]=@preg_replace("/([^ ])\ (.*)/i",'$1',$links[$id]);
		
		$ifexists=mysql_query("SELECT * FROM `indextemp` WHERE `url`='".$links[$id]."'");
		if (mysql_num_rows($ifexists)==0 && strlen($links[$id])>16)
			{
			mysql_query("INSERT INTO `indextemp` SET `url`='".$links[$id]."', `stage`='0'");
			echo $links[$id]."<br>";
			}
		$id+=1;
		}
	unset ($links);
	$continue=1;
	while ($continue=1)
		{
		
		$sqllinksa=mysql_query("SELECT * FROM `indextemp` WHERE `stage`='0'");
		while ($sqllinks=mysql_fetch_array($sqllinksa))
			{
			$file=file_get_contents($sqllinks['url']);
			$links=preg_split('/(href\=\'|href\=\"|href\=)/is',$file);
			
			mysql_query("UPDATE `indextemp` SET `stage`='1' WHERE `url`='".$sqllinks['url']."'");
			$id=1;
			while (isset($links[$id]))
				{
				$links[$id]=preg_replace("/([^\'])\'(.*)/is",'$1',$links[$id]);
				$links[$id]=preg_replace("/([^\"])\"(.*)/is",'$1',$links[$id]);
				$links[$id]=preg_replace("/([^\>])\>(.*)/is",'$1',$links[$id]);
				$links[$id]=preg_replace("/([^ ])\ (.*)/is",'$1',$links[$id]);
				$links[$id]=preg_replace("/([^\'])\'(.*)/i",'$1',$links[$id]);
				$links[$id]=preg_replace("/([^\"])\"(.*)/i",'$1',$links[$id]);
				$links[$id]=preg_replace("/([^\>])\>(.*)/i",'$1',$links[$id]);
				$links[$id]=preg_replace("/([^ ])\ (.*)/i",'$1',$links[$id]);
				
				$ifexist=mysql_query("SELECT * FROM `indextemp` WHERE `url`='".$links[$id]."'");
				if (strlen($links[$id])>5 && mysql_num_rows($ifexist)==0)
					{
					mysql_query("INSERT INTO `indextemp` SET `url`='".$links[$id]."', `stage`='0'");
					echo $links[$id]."<br>";
					} else { unset($ifexists['url']); }
				$id+=1;
				}
			
			
			$ifexists=mysql_query("SELECT * FROM `indextemp` WHERE `url`='".$links[$id]."'");
			$ifexists=mysql_fetch_array($ifexists);
			if (!isset($ifexists['url']) && strlen($links[$id])>5)
				{
				mysql_query("INSERT INTO `indextemp` SET `url`='".$links[$id]."', `stage`='0'");
				echo $links[$id]."<br>";
				}
			}
		$checkcontinue=mysql_query("SELECT * FROM `indextemp` WHERE `stage`='0'");
		if (mysql_num_rows($checkcontinue)==0)
			{
			$continue=0;
			break;
			}
		}
		//
	}
echo "<form><input type='text' name='url' size=50><input type='submit' value='index'></form>";
?>

Below is a second file named 'db.php' with sql configurations.

<?
$accountname='root';
$password='';
$dbhost='localhost';
$database='mydatabasename';
?>

Also, in the mysql database, there is a table named 'tempindex' with the columns 'url' and 'stage'. (but remove all the quotes in the previous sentence.)

Comment to monsterpot:
If you are still interested in making a PHP bot to find your javascript on the web then just let me know as I can help you there. Also I would need to know exactly what that javascript code is and what parts of that javascript code can change.

Well from the quotes above, I may just be able to prove all of that wrong with a PHP bot series I am working on. This post is not advertiesment but is to make the users 'R0bb0b' and 'heenix' and 'monsterpot' aware of what PHP is really capable of.
I have managed to make a bot which will index all of the urls that a linked to (within a tree type format) and will keep on indexing the web until it runs out of links. I will provide you with the source for you to make your javascript finder and currently the below script only indexes urls which is needed to scan more than one page.

Sure, nobody said it couldn't be done, it'll just take all day, possibly all week depending on how many sites you plan to spider. Google actually claims to index 2e9 pages. Let's say you get a fourth that at some point(500,000,000), and let me be generous and say that each page has an average of 10 links per page. If I am reading your script right that would be 5,000,000,000 loops. Last time I clocked PHP it was running about 600000 loops per minute on a 3g processor with 1g ram. That would average just over 833 minutes given that you had absolutely no memory leaks while the script is running and this doesn't include the file_get_contents() and mysql_query() functions that you will be running. I personally wouldn't do it with php because PHP is not a multi-threading language(which would be extremely helpful in this case) and PHP is several times slower than C++.

But for a side project, maybe.

Sure, nobody said it couldn't be done, it'll just take all day, possibly all week depending on how many sites you plan to spider. Google actually claims to index 2e9 pages.

Just to add to the statistics, I remember reading in Google 'to the power of project' that Google now claims to have 1 trillion pages indexed. Also from the Google web alerts I have
(eg. inurl:cwarn23.info) I have found that Google usually revisits the same website within 8 days but one time in the past year has taken 14 days. Also I have tested my script to index an average of 16 pages per second (at the most). So I hope you find those statistics useful.

The stupidity in this thread made me register an account. Originally I was googling for the same thing the original poster is looking for.

When you crawl the web, you will spend most time waiting for network packets and saving the data some place. A spider is a perfect example of a piece of code, where execution time does not matter at all. You could write it in commodore basic 2.0 and wouldn't notice a difference.

Creating a spider in an unsuitable language like C++ will double your development effort for an actual performance gain in the first percentile.

The execution time limit in PHP is actually configurable. (Doh). It's usually disabled for command line execution. PHP only runs once? What happens then? The script self destructs?

Recommending java over PHP for performance reasons only makes sense if you are religious and worship The Java.

Naturally running a web crawler is not a task for days or even weeks. It's closer to years. If the op was looking to crawl a single site or two, he'd probably use one of the perfectly fine windows client applications and not look for a script.

There are opcode caches for PHP, which make it a just-in-time compiled language.

Assuming that there are no memory leaks is quite generous. Can we also assume that the world is round? When you "clock" a programming language, it would be kinda helpful to know what that loop was running, which operating system you were using, the bus width, and the compiler flags for the executable. The amount of memory does rather not matter.

Oh, and ...

<?php
$count = 0;
$now = microtime(true);
while ( ($now+1) > microtime(true)) $count++;
print "Loops per second: ".number_format($count)."\n";
?>

workhorse:~# php loop.php
Loops per second: 2,222,026

... you are full of it.

But let's assume you actually benchmarked the script in question. Let's also assume an average text weight of 50kb for a web page. Then your 3g processor (mobile phone?) could spider 30 gigabyte per minute. That's ~500 megabytes per second. Phat subsystem there. Mysql cluster with memory tables on 10GbE?

There are simple ways to split the websites to crawl between several instances of the script. You do not need threads. You can multi-task.

You remind me of that dude who threatened to "hack my website" and backed that claim with a traceroute. Please stop giving technical advice. Thank you.

But to answer the original question:

http://vision-media.ca/resources/php/create-a-php-web-crawler-or-scraper-5-minutes

Just modify one of the regexp to identify the script tags you are looking for and use the get_links function to identify the next target.

Dump the links into a mysql table together with a flag if/when crawled and feed the spider from that table. Dump positives into another table with the URL. You can then query the results from that one any time.

If you run this on an external server or webspace, observe your bandwidth/volume limits. The internet is big.

I had try the script..did it just crawl in homepage..And it crawl external link to like google and ect..

How to make it just crawl the internal link in all pages, and it never crawl the same internal link..Please tell me how to do that..? Urgent

Please reply me. I'am using script like before to crawl all link in all pages. But it crawl every links.

How to make it crawl internal link and save into database the different link each other..

Please, i'am waiting for the answer.

Sorry I haven't been tuned in for the past couple cro of days since i've just switched from windows to linux but you might like this link: http://syntax.cwarn23.info/PHP:_Making_a_search_engine. Also there is an update to my previous code which is as follows:

<form method="post">Scan site: <input type="text" name="site" value="http://" style="width:300px">
<input value="Scan" type="submit"></form>
<?
set_time_limit (0);
if (isset($_POST['site']) && !empty($_POST['site'])) {
/* Formats Allowed */
$formats=array('html'=>true,'htm'=>true,'xhtml'=>true,'xml'=>true,'mhtml'=>true,'xht'=>true,
'mht'=>true,'asp'=>true,'aspx'=>true,'adp'=>true,'bml'=>true,'cfm'=>true,'cgi'=>true,
'ihtml'=>true,'jsp'=>true,'las'=>true,'lasso'=>true,'lassoapp'=>true,'pl'=>true,'php'=>true,
'php1'=>true,'php2'=>true,'php3'=>true,'php4'=>true,'php5'=>true,'php6'=>true,'phtml'=>true,
'shtml'=>true,'search'=>true,'query'=>true,'forum'=>true,'blog'=>true,'1'=>true,'2'=>true,
'3'=>true,'4'=>true,'5'=>true,'6'=>true,'7'=>true,'8'=>true,'9'=>true,'10'=>true,'11'=>true,
'12'=>true,'13'=>true,'14'=>true,'15'=>true,'16'=>true,'17'=>true,'18'=>true,'19'=>true,
'20'=>true,'01'=>true,'02'=>true,'03'=>true,'04'=>true,'05'=>true,'06'=>true,'07'=>true,
'08'=>true,'09'=>true,'go'=>true,'page'=>true,'file'=>true);

function domain ($ddomain) {
return preg_replace('/^((http(s)?:\/\/)?([^\/]+))(.*)/','$1',$ddomain);
}

function url_exists($durl)
		{
		// Version 4.x supported
		$handle   = curl_init($durl);
		if (false === $handle)
			{
			return false;
			}
		curl_setopt($handle, CURLOPT_HEADER, true);
		curl_setopt($handle, CURLOPT_FAILONERROR, true);  // this works
		curl_setopt($handle, CURLOPT_HTTPHEADER, 
Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") );
		curl_setopt($handle, CURLOPT_NOBODY, true);
		curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
		$connectable = curl_exec($handle);
		curl_close($handle);  
        if (stripos(substr_replace($connectable,'',30),'200 OK')) {
            return true;
            } else {
            return false;
            }
		}
 $fdata='';
//below function will only get links within own domain and not links outside the site.
function getlinks($generateurlf) {
    global $formats;
    global $f_data;
    $f_data=file_get_contents($generateurlf);
    $datac=$f_data;
    preg_match_all('/(href|src)\=(\"|\')([^\"\'\>]+)/i',$datac,$media);
    unset($datac);
    $datac=$media[3];
    unset($media);
    $datab=array();
    $str_start=array('http'=>true,'www.'=>true);
    foreach($datac AS $dfile) {
        $generateurle=$generateurlf;
		$format=strtolower(preg_replace('/(.*)[.]([^.\?]+)(\?(.*))?/','$2',basename($generateurle.$dfile)));
        if (!isset($str_start[substr_replace($dfile,'',4)])) {
            if (substr_replace($generateurle,'',0, -1)!=='/') {
                $generateurle=preg_replace('/(.*)\/[^\/]+/is', "$1", $generateurle);
                } else {
                $generateurle=substr_replace($generateurle,'',-1);
                }
 
            if (substr_replace($dfile,'',1)=='/') {
                if (domain($generateurle)==domain($generateurle.$dfile)) {
                    if (isset($formats[$format]) 
                        || substr($generateurle.$dfile,-1)=='/' || substr_count(basename($generateurle.$dfile),'.')==0) {
                        $datab[]=$generateurle.$dfile;
                        }
                    }
                } else if (substr($dfile,0,2)=='./') {
                $dfile=substr($dfile,2);
                if (isset($formats[$format])) {$datab[]=$generateurle.'/'.$dfile;}
                } else if (substr_replace($dfile,'',1)=='.') {
                while (preg_match('/\.\.\/(.*)/i', $dfile)) {
                $dfile=substr_replace($dfile,'',0,3);
                $generateurle=preg_replace('/(.*)\/[^\/]+/i', "$1", $generateurle);
                }
                if (domain($generateurle)==domain($generateurle.'/'.$dfile)) {
                    if (isset($formats[$format]) || substr($generateurle.'/'.$dfile,-1)=='/' 
                        || substr_count(basename($generateurle.'/'.$dfile),'.')==0) {
                        $datab[]=$generateurle.'/'.$dfile;
                        }
                    }
                } else {
                if (domain($generateurle)==domain($generateurle.'/'.$dfile)) {
                    if (isset($formats[$format]) || substr($generateurle.'/'.$dfile,-1)=='/' 
                        || substr_count(basename($generateurle.'/'.$dfile),'.')==0) {
                        $datab[]=$generateurle.'/'.$dfile;
                        }
                    }
                }
            } else {
            if (domain($generateurle)==domain($dfile)) {
                if (isset($formats[$format]) || substr($dfile,-1)=='/' || substr_count(basename($dfile),'.')==0) {
                    $datab[]=$dfile;
                    }
                }
            }
		unset($format);
        }
    unset($datac);
    unset($dfile);
    return $datab;
    }
 
 
 
 
  
//=============================================
/* Modify only code between these two lines and $formats variable above. */

function generate($url) {
    echo $url.'<br>';
    global $f_data; //Data of file contents
    //do something with webpage $f_data.
    unset($f_data);
    }


//=============================================
// Below is what actually process the search engine
$sites=array();
$sites[]=stripslashes($_POST['site']);
for ($i=0;isset($sites[$i]);$i++) {
    foreach (getlinks(stripslashes($sites[$i])) AS $val) {
        if (!isset($sites[$val])) {
            $sites[]=$val;
            $sites[$val]=true;
            }
        } unset($val);
    if (url_exists($sites[$i])) {
        generate($sites[$i]);
        flush();
        }
    }
}
?>

You can try this.

<?php
//set_time_limit (0);
function crawl_page($url, $depth = 5){
    $seen = array();
    if(($depth == 0) or (in_array($url, $seen))){
        return;
    }   
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_TIMEOUT, 30);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    $result = curl_exec ($ch);
    curl_close ($ch);
    if( $result ){
        $stripped_file = strip_tags($result, "<a>");
        preg_match_all("/<a[\s]+[^>]*?href[\s]?=[\s\"\']+"."(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", $stripped_file, $matches, PREG_SET_ORDER );    
        foreach($matches as $match){
            $seen[] = $match[1];
        }   
    }   
    print_r(array_unique($seen));
}   
crawl_page("http://www.sitetobecrawled.com/",3);
?>

You can then pass it through a loop as explained in this crawl bot tutorial.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.