| | |
webcrawler help
Please support our PHP advertiser: PostgreSQL or MySQL? Compare and contrast the two most popular open source databases
![]() |
•
•
Join Date: Sep 2009
Posts: 22
Reputation:
Solved Threads: 0
just a simple web spider/crawler i'm trying to create to populate my search engine.
i type the name of a website, i extract the <title>, <header> from the website.
then i store in mysql.
i need some direction on how to do the bolded, i can't find a tutorial on this
PHP Syntax (Toggle Plain Text)
<form action="crawl.php" method="get"> <center> <span class="style1">Crawl Website</span> <input name="search" type="text" value='' size="25" /> <input type="submit" name="submit" value="Go!"> </center> </form>
i type the name of a website, i extract the <title>, <header> from the website.
then i store in mysql.
i need some direction on how to do the bolded, i can't find a tutorial on this
Last edited by MDanz; Sep 22nd, 2009 at 4:02 pm.
You would need to start with fopen, curl, or any other http request function. Youll then have to parse the html returned using a regular expression search to find the pieces of the header your looking for.
hope that helps.
hope that helps.
Don't pay data charges. txtFeeder.com is a free way to read the web on your mobile, and avoid data charges! **Now txtFeeder has a wireless note feature! Make notes on the go!
-Kyle Getson
-Kyle Getson
Visit my recent work : www.searchnaukri.com
If you looking for similar to it then let me know i will provide you code .
Mail me : info@hire-phpdeveloper.com
If you looking for similar to it then let me know i will provide you code .
Mail me : info@hire-phpdeveloper.com
hire-phpdeveloper.com
Start from $3.99 Per Hour
Web Development with php | Hire a dedicated php developer |Hire indian Php developer
Start from $3.99 Per Hour
Yahoo IM :hirephpdeveloperWeb Development with php | Hire a dedicated php developer |Hire indian Php developer
I just love making bots. You can view my article at: http://www.syntax.cwarn23.info/PHP_M..._search_engine
The script is as follows: Be warned they can chew a lot of cpu and bandwidth. Good luck.
The script is as follows:
php Syntax (Toggle Plain Text)
<form method="post">Scan site: <input type="text" name="site" value="http://" style="width:300px"> <input value="Scan" type="submit"></form> <? set_time_limit (0); if (isset($_POST['site']) && !empty($_POST['site'])) { /* Formats Allowed */ $formats=array('html'=>true,'htm'=>true,'xhtml'=>true,'xml'=>true,'mhtml'=>true,'xht'=>true, 'mht'=>true,'asp'=>true,'aspx'=>true,'adp'=>true,'bml'=>true,'cfm'=>true,'cgi'=>true, 'ihtml'=>true,'jsp'=>true,'las'=>true,'lasso'=>true,'lassoapp'=>true,'pl'=>true,'php'=>true, 'php1'=>true,'php2'=>true,'php3'=>true,'php4'=>true,'php5'=>true,'php6'=>true,'phtml'=>true, 'shtml'=>true,'search'=>true,'query'=>true,'forum'=>true,'blog'=>true,'1'=>true,'2'=>true, '3'=>true,'4'=>true,'5'=>true,'6'=>true,'7'=>true,'8'=>true,'9'=>true,'10'=>true,'11'=>true, '12'=>true,'13'=>true,'14'=>true,'15'=>true,'16'=>true,'17'=>true,'18'=>true,'19'=>true, '20'=>true,'01'=>true,'02'=>true,'03'=>true,'04'=>true,'05'=>true,'06'=>true,'07'=>true, '08'=>true,'09'=>true,'go'=>true,'page'=>true,'file'=>true); function domain ($ddomain) { return preg_replace('/^((http(s)?:\/\/)?([^\/]+))(.*)/','$1',$ddomain); } function url_exists($durl) { // Version 4.x supported $handle = curl_init($durl); if (false === $handle) { return false; } curl_setopt($handle, CURLOPT_HEADER, true); curl_setopt($handle, CURLOPT_FAILONERROR, true); // this works curl_setopt($handle, CURLOPT_HTTPHEADER, Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") ); curl_setopt($handle, CURLOPT_NOBODY, true); curl_setopt($handle, CURLOPT_RETURNTRANSFER, true); $connectable = curl_exec($handle); curl_close($handle); if (stripos(substr_replace($connectable,'',30),'200 OK')) { return true; } else { return false; } } $fdata=''; //below function will only get links within own domain and not links outside the site. function getlinks($generateurlf) { global $formats; global $f_data; $f_data=file_get_contents($generateurlf); $datac=$f_data; preg_match_all('/(href|src)\=(\"|\')([^\"\'\>]+)/i',$datac,$media); unset($datac); $datac=$media[3]; unset($media); $datab=array(); $str_start=array('http'=>true,'www.'=>true); foreach($datac AS $dfile) { $generateurle=$generateurlf; $format=strtolower(preg_replace('/(.*)[.]([^.\?]+)(\?(.*))?/','$2',basename($generateurle.$dfile))); if (!isset($str_start[substr_replace($dfile,'',4)])) { if (substr_replace($generateurle,'',0, -1)!=='/') { $generateurle=preg_replace('/(.*)\/[^\/]+/is', "$1", $generateurle); } else { $generateurle=substr_replace($generateurle,'',-1); } if (substr_replace($dfile,'',1)=='/') { if (domain($generateurle)==domain($generateurle.$dfile)) { if (isset($formats[$format]) || substr($generateurle.$dfile,-1)=='/' || substr_count(basename($generateurle.$dfile),'.')==0) { $datab[]=$generateurle.$dfile; } } } else if (substr($dfile,0,2)=='./') { $dfile=substr($dfile,2); if (isset($formats[$format])) {$datab[]=$generateurle.'/'.$dfile;} } else if (substr_replace($dfile,'',1)=='.') { while (preg_match('/\.\.\/(.*)/i', $dfile)) { $dfile=substr_replace($dfile,'',0,3); $generateurle=preg_replace('/(.*)\/[^\/]+/i', "$1", $generateurle); } if (domain($generateurle)==domain($generateurle.'/'.$dfile)) { if (isset($formats[$format]) || substr($generateurle.'/'.$dfile,-1)=='/' || substr_count(basename($generateurle.'/'.$dfile),'.')==0) { $datab[]=$generateurle.'/'.$dfile; } } } else { if (domain($generateurle)==domain($generateurle.'/'.$dfile)) { if (isset($formats[$format]) || substr($generateurle.'/'.$dfile,-1)=='/' || substr_count(basename($generateurle.'/'.$dfile),'.')==0) { $datab[]=$generateurle.'/'.$dfile; } } } } else { if (domain($generateurle)==domain($dfile)) { if (isset($formats[$format]) || substr($dfile,-1)=='/' || substr_count(basename($dfile),'.')==0) { $datab[]=$dfile; } } } unset($format); } unset($datac); unset($dfile); return $datab; } //============================================= /* Modify only code between these two lines and $formats variable above. */ function generate($url) { echo $url.'<br>'; global $f_data; //Data of file contents //do something with webpage $f_data. unset($f_data); } //============================================= // Below is what actually process the search engine $sites=array(); $sites[]=stripslashes($_POST['site']); for ($i=0;isset($sites[$i]);$i++) { foreach (getlinks(stripslashes($sites[$i])) AS $val) { if (!isset($sites[$val])) { $sites[]=$val; $sites[$val]=true; } } unset($val); if (url_exists($sites[$i])) { generate($sites[$i]); flush(); } } } ?>
Try not to bump 10 year old threads as it can be really annoying.
http://syntax.cwarn23.net/
My favourite PC. - Oopy Doopy Do 2U2!
http://syntax.cwarn23.net/
Smilies: ^_* +_+ v_v -_- *~*` My favourite PC. - Oopy Doopy Do 2U2!
•
•
Join Date: Sep 2009
Posts: 22
Reputation:
Solved Threads: 0
ok here is my start... its basicaly a quick add not a spider.
can someone help me adjust this code so i get the <title> and <head>
PHP Syntax (Toggle Plain Text)
<?php // create curl resource $ch = curl_init(); // set url curl_setopt($ch, CURLOPT_URL, "www.realgm.com"); //return the transfer as a string curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // $output contains the output string $output = curl_exec($ch); // close curl resource to free up system resources curl_close($ch); ?>
Try using this on this output variable:
php Syntax (Toggle Plain Text)
preg_match_all('#<head>.*(<title>.*</title>|).*</head>#',$output,$header); echo '<xmp>'; print_r($header); echo '</xmp>';
Last edited by cwarn23; Sep 23rd, 2009 at 5:41 pm.
Try not to bump 10 year old threads as it can be really annoying.
http://syntax.cwarn23.net/
My favourite PC. - Oopy Doopy Do 2U2!
http://syntax.cwarn23.net/
Smilies: ^_* +_+ v_v -_- *~*` My favourite PC. - Oopy Doopy Do 2U2!
•
•
Join Date: Sep 2009
Posts: 22
Reputation:
Solved Threads: 0
like this?
i tried this and it says
Array ( [0] => Array ( ) [1] => Array ( ) )
PHP Syntax (Toggle Plain Text)
<?php // create curl resource $ch = curl_init(); // set url curl_setopt($ch, CURLOPT_URL, "www.realgm.com"); //return the transfer as a string curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // $output contains the output string $output = curl_exec($ch); preg_match_all('#<head>.*(<title>.*</title>|).*</head>#',$output,$header); echo '<xmp>'; print_r($header); echo '</xmp>'; // close curl resource to free up system resources curl_close($ch); ?>
i tried this and it says
Array ( [0] => Array ( ) [1] => Array ( ) )
My previous code was from the top of my head but I have tested it and should be as follows:
php Syntax (Toggle Plain Text)
<?php // create curl resource $ch = curl_init(); // set url curl_setopt($ch, CURLOPT_URL, "www.realgm.com"); //return the transfer as a string curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // $output contains the output string $output = curl_exec($ch); preg_match_all('#<head>.*<title>(.*)</title>.*</head>#Usi',$output,$header); echo '<xmp>'; print_r($header); echo '</xmp>'; // close curl resource to free up system resources curl_close($ch); ?>
Try not to bump 10 year old threads as it can be really annoying.
http://syntax.cwarn23.net/
My favourite PC. - Oopy Doopy Do 2U2!
http://syntax.cwarn23.net/
Smilies: ^_* +_+ v_v -_- *~*` My favourite PC. - Oopy Doopy Do 2U2!
•
•
Join Date: Sep 2009
Posts: 22
Reputation:
Solved Threads: 0
hi thanx got it working.. just one more thing..
how do i echo a single part of this
so i just want the <title></title> how do i echo that alone?
<title>RealGM: Sports Is Our Business</title>
<meta name="description" content="Real GM">
<meta name="keywords" content="trade checker, draft simulator, nba, simulator, nba news, nba trades, wiretap, nba transactions, nba draft, nba salaries, basketball, rumors, sports, jordan, hill, carter, shaq, mcgrady, kobe, duncan, kidd, garnett, payton, lebron, carmelo, wade, bosh, hawks, celtics, hornets, bulls, cavs, mavericks, nuggets, pistons, warriors, rockets, pacers, lakers, clippers, heat, bucks, timberwolves, nets, knicks, magic, trailblazers, suns, kings, supersonics, spurs, raptors, jazz, grizzlies, wizards, collective bargaining agreement, trade, sign, free agent, renounce, waive, realgm, general manager, gm">
<meta NAME="description" CONTENT="The only site on the web that allows you to sign, trade, waive, and renouce players from NBA teams. Come along and see what it is like to be a GM of a NBA team. All based on the real rules that the big boys must play by."> <meta http-equiv="Content-Style-Type" content="text/css">
btw are all websites built like this with keywords, description, title?
how do i echo a single part of this
so i just want the <title></title> how do i echo that alone?
<title>RealGM: Sports Is Our Business</title>
<meta name="description" content="Real GM">
<meta name="keywords" content="trade checker, draft simulator, nba, simulator, nba news, nba trades, wiretap, nba transactions, nba draft, nba salaries, basketball, rumors, sports, jordan, hill, carter, shaq, mcgrady, kobe, duncan, kidd, garnett, payton, lebron, carmelo, wade, bosh, hawks, celtics, hornets, bulls, cavs, mavericks, nuggets, pistons, warriors, rockets, pacers, lakers, clippers, heat, bucks, timberwolves, nets, knicks, magic, trailblazers, suns, kings, supersonics, spurs, raptors, jazz, grizzlies, wizards, collective bargaining agreement, trade, sign, free agent, renounce, waive, realgm, general manager, gm">
<meta NAME="description" CONTENT="The only site on the web that allows you to sign, trade, waive, and renouce players from NBA teams. Come along and see what it is like to be a GM of a NBA team. All based on the real rules that the big boys must play by."> <meta http-equiv="Content-Style-Type" content="text/css">
btw are all websites built like this with keywords, description, title?
Last edited by MDanz; Sep 24th, 2009 at 2:50 pm.
In my script, to echo the title simply use
echo $header[1][0]; or to echo the entire header use echo $header[0][0]; Also virtually all webpages have the title tag but not all pages have the meta tags. Try not to bump 10 year old threads as it can be really annoying.
http://syntax.cwarn23.net/
My favourite PC. - Oopy Doopy Do 2U2!
http://syntax.cwarn23.net/
Smilies: ^_* +_+ v_v -_- *~*` My favourite PC. - Oopy Doopy Do 2U2!
![]() |
Similar Threads
- Open Source Webcrawler? (Existing Scripts)
- WebCrawler / Bot needed (Existing Scripts)
- Which is worse? (Java)
- WebCrawler problem (Java)
- Google Programming Searchengine (Computer Science)
Other Threads in the PHP Forum
- Previous Thread: Send email with DB records
- Next Thread: Adding a static option to a php generated drop list
Views: 677 | Replies: 9
| Thread Tools | Search this Thread |
Tag cloud for PHP
.htaccess access ajax apache api array beginner binary broken cakephp checkbox class cms code cron curl database date directory display download dynamic ebooks echo email error file files folder form forms function functions google href htaccess html image include insert integration ip java javascript joomla jquery js limit link login loop mail mediawiki menu methods mlm mod_rewrite multiple mysql oop parse paypal pdf php problem query radio random recursion regex remote script search select server sessions sms soap source space speed sql stored structure subdomain syntax system table tutorial update updates upload url validation validator variable video web xml youtube






