| | |
Looking for PHP Web Spider Script
Please support our PHP advertiser: PostgreSQL or MySQL? Compare and contrast the two most popular open source databases
![]() |
•
•
Join Date: Sep 2008
Posts: 1
Reputation:
Solved Threads: 0
Apologies first if this is covered elsewhere - I searched but could not find.
I am looking for a way to search the web for the presence of a JavaScript code snippet within the HTML <body> of a web page. I would specify the code snippet and send the bot on its way, it would come back with either a number of results, or a list of pages.
I realise there are billions of web pages so don't know whether this is feasilble or not.
The purpose is to determine the number of participating sites in a particular network. (Currently uncontrolled so any site could have the code on)
Any ideas on this?
Thanks
I am looking for a way to search the web for the presence of a JavaScript code snippet within the HTML <body> of a web page. I would specify the code snippet and send the bot on its way, it would come back with either a number of results, or a list of pages.
I realise there are billions of web pages so don't know whether this is feasilble or not.
The purpose is to determine the number of participating sites in a particular network. (Currently uncontrolled so any site could have the code on)
Any ideas on this?
Thanks
•
•
•
•
Apologies first if this is covered elsewhere - I searched but could not find.
I am looking for a way to search the web for the presence of a JavaScript code snippet within the HTML <body> of a web page. I would specify the code snippet and send the bot on its way, it would come back with either a number of results, or a list of pages.
I realise there are billions of web pages so don't know whether this is feasilble or not.
The purpose is to determine the number of participating sites in a particular network. (Currently uncontrolled so any site could have the code on)
Any ideas on this?
Thanks
Last edited by R0bb0b; Sep 27th, 2008 at 4:37 pm.
“Be who you are and say what you feel because those who mind don't matter and those who matter don't mind.” - Dr. Seuss
-- The documentation is inevitable, you may get away with it for a little while but eventually you too will have to do the deed.
-- The documentation is inevitable, you may get away with it for a little while but eventually you too will have to do the deed.
•
•
Join Date: Sep 2008
Posts: 15
Reputation:
Solved Threads: 1
Hi,
I agree with RobRob here, PHP just isn't powerful enough to be doing this sort of task. PHP is not a compiled language, so it is interpreted into machine-readable code on-the-fly. This is very slow in comparison to compiled languages such as C and C++ which naturally operate at a closer level to the hardware. Even if you had unlimited power and resources, PHP is very restrictive as it only runs once, and thanks to execution timeouts, you can't do very much.
If you're still interested in making a spider, I must say that it's no easy feat. You would be much better looking into C and C++ in subsidary forums, with most of you're focus spent on understanding networking.
Thanks,
Christopher Lord
I agree with RobRob here, PHP just isn't powerful enough to be doing this sort of task. PHP is not a compiled language, so it is interpreted into machine-readable code on-the-fly. This is very slow in comparison to compiled languages such as C and C++ which naturally operate at a closer level to the hardware. Even if you had unlimited power and resources, PHP is very restrictive as it only runs once, and thanks to execution timeouts, you can't do very much.
If you're still interested in making a spider, I must say that it's no easy feat. You would be much better looking into C and C++ in subsidary forums, with most of you're focus spent on understanding networking.
Thanks,
Christopher Lord
How about trying a Google search? You could search within the HTML for your Javascrpit code. If you would like it in script form, try using the Google Ajax Search widget and simply count the results!!!
Last edited by FlashCreations; Sep 28th, 2008 at 4:43 pm.
•
•
•
•
••••I don't think you will find too many php web spider scripts only because php is an interpreted language and will be quite slow.••••I agree with RobRob here, PHP just isn't powerful enough to be doing this sort of task.
I have managed to make a bot which will index all of the urls that a linked to (within a tree type format) and will keep on indexing the web until it runs out of links. I will provide you with the source for you to make your javascript finder and currently the below script only indexes urls which is needed to scan more than one page.
php Syntax (Toggle Plain Text)
if (isset($_GET['url'])) { include("db.php"); mysql_connect($dbhost,$accountname,$password) or die("Could not connect to MySQL server"); mysql_select_db($database) or die(mysql_error()."Could not select database"); $file=file_get_contents($_GET['url']); $links=preg_split('/(href\=\'|href\=\"|href\=)/is',$file); //$file=preg_replace('/(.*)(\ href\=\"|\ href\=\'|\ href=)(.*\ |.*\"|.*\'|.*>)/is','$2|^=^|',$file,1); //echo $file."<br>"; mysql_query("INSERT INTO `indextemp` SET `url`='".$_GET['url']."', `stage`='1'"); $id=1; while (isset($links[$id])) { $links[$id]=preg_replace("/([^\'])\'(.*)/is",'$1',$links[$id]); $links[$id]=preg_replace("/([^\"])\"(.*)/is",'$1',$links[$id]); $links[$id]=preg_replace("/([^\>])\>(.*)/is",'$1',$links[$id]); $links[$id]=preg_replace("/([^ ])\ (.*)/is",'$1',$links[$id]); $links[$id]=preg_replace("/([^\'])\'(.*)/i",'$1',$links[$id]); $links[$id]=preg_replace("/([^\"])\"(.*)/i",'$1',$links[$id]); $links[$id]=preg_replace("/([^\>])\>(.*)/i",'$1',$links[$id]); $links[$id]=@preg_replace("/([^ ])\ (.*)/i",'$1',$links[$id]); $ifexists=mysql_query("SELECT * FROM `indextemp` WHERE `url`='".$links[$id]."'"); if (mysql_num_rows($ifexists)==0 && strlen($links[$id])>16) { mysql_query("INSERT INTO `indextemp` SET `url`='".$links[$id]."', `stage`='0'"); echo $links[$id]."<br>"; } $id+=1; } unset ($links); $continue=1; while ($continue=1) { $sqllinksa=mysql_query("SELECT * FROM `indextemp` WHERE `stage`='0'"); while ($sqllinks=mysql_fetch_array($sqllinksa)) { $file=file_get_contents($sqllinks['url']); $links=preg_split('/(href\=\'|href\=\"|href\=)/is',$file); mysql_query("UPDATE `indextemp` SET `stage`='1' WHERE `url`='".$sqllinks['url']."'"); $id=1; while (isset($links[$id])) { $links[$id]=preg_replace("/([^\'])\'(.*)/is",'$1',$links[$id]); $links[$id]=preg_replace("/([^\"])\"(.*)/is",'$1',$links[$id]); $links[$id]=preg_replace("/([^\>])\>(.*)/is",'$1',$links[$id]); $links[$id]=preg_replace("/([^ ])\ (.*)/is",'$1',$links[$id]); $links[$id]=preg_replace("/([^\'])\'(.*)/i",'$1',$links[$id]); $links[$id]=preg_replace("/([^\"])\"(.*)/i",'$1',$links[$id]); $links[$id]=preg_replace("/([^\>])\>(.*)/i",'$1',$links[$id]); $links[$id]=preg_replace("/([^ ])\ (.*)/i",'$1',$links[$id]); $ifexist=mysql_query("SELECT * FROM `indextemp` WHERE `url`='".$links[$id]."'"); if (strlen($links[$id])>5 && mysql_num_rows($ifexist)==0) { mysql_query("INSERT INTO `indextemp` SET `url`='".$links[$id]."', `stage`='0'"); echo $links[$id]."<br>"; } else { unset($ifexists['url']); } $id+=1; } $ifexists=mysql_query("SELECT * FROM `indextemp` WHERE `url`='".$links[$id]."'"); $ifexists=mysql_fetch_array($ifexists); if (!isset($ifexists['url']) && strlen($links[$id])>5) { mysql_query("INSERT INTO `indextemp` SET `url`='".$links[$id]."', `stage`='0'"); echo $links[$id]."<br>"; } } $checkcontinue=mysql_query("SELECT * FROM `indextemp` WHERE `stage`='0'"); if (mysql_num_rows($checkcontinue)==0) { $continue=0; break; } } // } echo "<form><input type='text' name='url' size=50><input type='submit' value='index'></form>"; ?>
Below is a second file named 'db.php' with sql configurations.
php Syntax (Toggle Plain Text)
<? $accountname='root'; $password=''; $dbhost='localhost'; $database='mydatabasename'; ?>
Comment to monsterpot:
If you are still interested in making a PHP bot to find your javascript on the web then just let me know as I can help you there. Also I would need to know exactly what that javascript code is and what parts of that javascript code can change.
Try not to bump 10 year old threads as it can be really annoying.
http://syntax.cwarn23.net/
My favourite PC. - Oopy Doopy Do 2U2!
http://syntax.cwarn23.net/
Smilies: ^_* +_+ v_v -_- *~*` My favourite PC. - Oopy Doopy Do 2U2!
•
•
•
•
Well from the quotes above, I may just be able to prove all of that wrong with a PHP bot series I am working on. This post is not advertiesment but is to make the users 'R0bb0b' and 'heenix' and 'monsterpot' aware of what PHP is really capable of.
I have managed to make a bot which will index all of the urls that a linked to (within a tree type format) and will keep on indexing the web until it runs out of links. I will provide you with the source for you to make your javascript finder and currently the below script only indexes urls which is needed to scan more than one page.
But for a side project, maybe.
Last edited by R0bb0b; Sep 29th, 2008 at 1:34 am.
“Be who you are and say what you feel because those who mind don't matter and those who matter don't mind.” - Dr. Seuss
-- The documentation is inevitable, you may get away with it for a little while but eventually you too will have to do the deed.
-- The documentation is inevitable, you may get away with it for a little while but eventually you too will have to do the deed.
•
•
•
•
Sure, nobody said it couldn't be done, it'll just take all day, possibly all week depending on how many sites you plan to spider. Google actually claims to index 2e9 pages.
(eg. inurl:cwarn23.info) I have found that Google usually revisits the same website within 8 days but one time in the past year has taken 14 days. Also I have tested my script to index an average of 16 pages per second (at the most). So I hope you find those statistics useful.
Try not to bump 10 year old threads as it can be really annoying.
http://syntax.cwarn23.net/
My favourite PC. - Oopy Doopy Do 2U2!
http://syntax.cwarn23.net/
Smilies: ^_* +_+ v_v -_- *~*` My favourite PC. - Oopy Doopy Do 2U2!
•
•
Join Date: Dec 2008
Posts: 2
Reputation:
Solved Threads: 0
The stupidity in this thread made me register an account. Originally I was googling for the same thing the original poster is looking for.
When you crawl the web, you will spend most time waiting for network packets and saving the data some place. A spider is a perfect example of a piece of code, where execution time does not matter at all. You could write it in commodore basic 2.0 and wouldn't notice a difference.
Creating a spider in an unsuitable language like C++ will double your development effort for an actual performance gain in the first percentile.
The execution time limit in PHP is actually configurable. (Doh). It's usually disabled for command line execution. PHP only runs once? What happens then? The script self destructs?
Recommending java over PHP for performance reasons only makes sense if you are religious and worship The Java.
Naturally running a web crawler is not a task for days or even weeks. It's closer to years. If the op was looking to crawl a single site or two, he'd probably use one of the perfectly fine windows client applications and not look for a script.
There are opcode caches for PHP, which make it a just-in-time compiled language.
Assuming that there are no memory leaks is quite generous. Can we also assume that the world is round? When you "clock" a programming language, it would be kinda helpful to know what that loop was running, which operating system you were using, the bus width, and the compiler flags for the executable. The amount of memory does rather not matter.
Oh, and ...
... you are full of it.
But let's assume you actually benchmarked the script in question. Let's also assume an average text weight of 50kb for a web page. Then your 3g processor (mobile phone?) could spider 30 gigabyte per minute. That's ~500 megabytes per second. Phat subsystem there. Mysql cluster with memory tables on 10GbE?
There are simple ways to split the websites to crawl between several instances of the script. You do not need threads. You can multi-task.
You remind me of that dude who threatened to "hack my website" and backed that claim with a traceroute. Please stop giving technical advice. Thank you.
When you crawl the web, you will spend most time waiting for network packets and saving the data some place. A spider is a perfect example of a piece of code, where execution time does not matter at all. You could write it in commodore basic 2.0 and wouldn't notice a difference.
Creating a spider in an unsuitable language like C++ will double your development effort for an actual performance gain in the first percentile.
The execution time limit in PHP is actually configurable. (Doh). It's usually disabled for command line execution. PHP only runs once? What happens then? The script self destructs?
Recommending java over PHP for performance reasons only makes sense if you are religious and worship The Java.
Naturally running a web crawler is not a task for days or even weeks. It's closer to years. If the op was looking to crawl a single site or two, he'd probably use one of the perfectly fine windows client applications and not look for a script.
There are opcode caches for PHP, which make it a just-in-time compiled language.
Assuming that there are no memory leaks is quite generous. Can we also assume that the world is round? When you "clock" a programming language, it would be kinda helpful to know what that loop was running, which operating system you were using, the bus width, and the compiler flags for the executable. The amount of memory does rather not matter.
Oh, and ...
PHP Syntax (Toggle Plain Text)
<?php $count = 0; $now = microtime(true); while ( ($now+1) > microtime(true)) $count++; print "Loops per second: ".number_format($count)."\n"; ?> workhorse:~# php loop.php Loops per second: 2,222,026
... you are full of it.
But let's assume you actually benchmarked the script in question. Let's also assume an average text weight of 50kb for a web page. Then your 3g processor (mobile phone?) could spider 30 gigabyte per minute. That's ~500 megabytes per second. Phat subsystem there. Mysql cluster with memory tables on 10GbE?
There are simple ways to split the websites to crawl between several instances of the script. You do not need threads. You can multi-task.
You remind me of that dude who threatened to "hack my website" and backed that claim with a traceroute. Please stop giving technical advice. Thank you.
•
•
Join Date: Dec 2008
Posts: 2
Reputation:
Solved Threads: 0
But to answer the original question:
http://vision-media.ca/resources/php...aper-5-minutes
Just modify one of the regexp to identify the script tags you are looking for and use the get_links function to identify the next target.
Dump the links into a mysql table together with a flag if/when crawled and feed the spider from that table. Dump positives into another table with the URL. You can then query the results from that one any time.
If you run this on an external server or webspace, observe your bandwidth/volume limits. The internet is big.
http://vision-media.ca/resources/php...aper-5-minutes
Just modify one of the regexp to identify the script tags you are looking for and use the get_links function to identify the next target.
Dump the links into a mysql table together with a flag if/when crawled and feed the spider from that table. Dump positives into another table with the URL. You can then query the results from that one any time.
If you run this on an external server or webspace, observe your bandwidth/volume limits. The internet is big.
Last edited by Mantriur; Dec 30th, 2008 at 9:33 am.
![]() |
Similar Threads
- [For Hire] Php, Mysql, ASP, AJAX, DHTML programmer (Post your Resume)
- Making a simple indexer (PHP)
- vBulletin mod_rewrite (PHP)
- Need Traffic to Your Site? Let us with 10 Years Experience Show You How! (Post your Resume)
- Get Huge Traffic and Get Listed Hundreds Of Times in The Major Search Engines (Promotion and Marketing Plans)
Other Threads in the PHP Forum
- Previous Thread: how to go back to a php page keeping the search criteria?
- Next Thread: Foreach Loop Error (Duplicates Last Entry)
Views: 4677 | Replies: 11
| Thread Tools | Search this Thread |
Tag cloud for PHP
.htaccess access ajax apache api array beginner binary broken cakephp checkbox class clean cms code countingeverycharactersfromastring cron curl database date directory display download dynamic echo email error file files folder form forms function functions google href htaccess html image include insert integration ip java javascript joomla jquery limit link login loop mail match menu methods mlm mod_rewrite multiple mysql oop parse paypal pdf php problem query radio random recursion regex remote script search select server sessions simple sms soap source space spam speed sql structure syntax system table tutorial update updates upload url validation validator variable video web xml youtube






