| | |
Making a simple indexer
Please support our PHP advertiser: PostgreSQL or MySQL? Compare and contrast the two most popular open source databases
![]() |
•
•
•
•
Hi, I would like to know a resource or simple solution for this,
I have a mySQL database with lots of links. I need said solution to visit each link and download the HTML file for it. Like an indexer which is told where to index.
To do this the correct way, you have to create a spider. Downloading web pages can be intensive both for your server and the remote server, so you need to have intervals between making connections if they are to the same host.
You also have to adhere to the remote hosts robots.txt policy.
Simple solution: how to save a single file
PHP Syntax (Toggle Plain Text)
$text = file_get_contents($url); file_put_contents($save_location, $text);
file_put_contents() requires PHP5. Or you can create an equivalent function.
PHP Syntax (Toggle Plain Text)
if (!file_exists('file_put_contents')) { // create a simple user function to emulate PHP5's file_put_contents() function file_put_contents($file, $contents) { fwrite($fp = fopen($file), $contents, strlen($contents)); fclose($fp); } }
If you are doing this frequently though on a large scale, its good to have a spider. This is can be written in PHP (though not the best language for it). The difference with a spider and a normal script is that a PHP spider should be written on the CLI version of PHP and run as a CGI. This allows it to run as a service, or daemon. The daemon can then donwload files for an indefinite period and not time out, while being considerate of the resources they use on the remote server (allowing breaks between downloads).
You would also use CURL or fsockopen() as it allows you to open sockets with more control (you can keep a HTTP1.1 session on the server while downloading several pages), and allow the remote host to know you are a robot/spider and also follow spidering policies in robots.txt on the server.
Spiders can also be implemented with PHP compiled as an Apache module and the inability to register a service or run a daemon on the server. It would require a interval based trigger for the script such as web page hits, or email receipts from SMTP, or even using an external website to ping your script.
Hope that helps a bit.
www.fijiwebdesign.com - web design and development and fun
Cpanel Email - Let users Register email accounts on your website upon registration
Ajax Chat - Fully browser based chat!
Cpanel Email - Let users Register email accounts on your website upon registration
Ajax Chat - Fully browser based chat!
•
•
•
•
Thanks so much for the help! Is it possible to save only the data between the <body> tags?
Something like:
PHP Syntax (Toggle Plain Text)
preg_match("/<body([^>]*)>(.*?)<\/body>/i", $txt, $matches);
You will have to check for multiple lines if the <body> tags spans more than 1 line. See regex docs for that... I think its the modifier, n or m. Not sure which.
www.fijiwebdesign.com - web design and development and fun
Cpanel Email - Let users Register email accounts on your website upon registration
Ajax Chat - Fully browser based chat!
Cpanel Email - Let users Register email accounts on your website upon registration
Ajax Chat - Fully browser based chat!
•
•
Join Date: Dec 2007
Posts: 74
Reputation:
Solved Threads: 2
Thanks so much. Another thing, for the original code, would it work like this:
connect to database
specify $save_location (Database info, which table etc)
<?php
//assuming you're connected to db
$getlist=mysql_query("SELECT url FROM url_table");
while($row=mysql_fetch_array($getlist)){
$text = file_get_contents($row[0]);
file_put_contents($save_location, $row[0]);
}
?>
Would this save pages from url's specified in a DB?
connect to database
specify $save_location (Database info, which table etc)
<?php
//assuming you're connected to db
$getlist=mysql_query("SELECT url FROM url_table");
while($row=mysql_fetch_array($getlist)){
$text = file_get_contents($row[0]);
file_put_contents($save_location, $row[0]);
}
?>
Would this save pages from url's specified in a DB?
•
•
•
•
Thanks so much. Another thing, for the original code, would it work like this:
connect to database
specify $save_location (Database info, which table etc)
PHP Syntax (Toggle Plain Text)
<?php //assuming you're connected to db $getlist=mysql_query("SELECT url FROM url_table"); while($row=mysql_fetch_array($getlist)){ $text = file_get_contents($row[0]); file_put_contents($save_location, $row[0]); } ?>
Would this save pages from url's specified in a DB?
You'll have to specify a directory where your local files will be saved. Then make sure you save the local version with a valid filename.
What I usually do is make a cryptographic hash of the URL, and use that as the filename of the local file. The hashes can be MD5's or SHA1's etc.
Eg:
PHP Syntax (Toggle Plain Text)
<?php // a directory for saving the URLS $dir = 'cahed_sites/'; //assuming you're connected to db $getlist=mysql_query("SELECT url FROM url_table"); while($row=mysql_fetch_array($getlist)){ $text = file_get_contents($row[0]); // make sure we have something... if ($text) { $filename = sha1($text); file_put_contents($dir.$filename.'.html', $text); } } ?>
You can also create a new column in your url_table, called `file` or similar. Then update the table when you create the local copy.
eg:
PHP Syntax (Toggle Plain Text)
// a directory for saving the URLS $dir = 'cahed_sites/'; //assuming you're connected to db $getlist=mysql_query("SELECT url FROM url_table"); while($row=mysql_fetch_array($getlist)){ $text = file_get_contents($row[0]); // make sure we have something... if ($text) { $filename = sha1($row[0]); // if we succeed, update db row if (file_put_contents($dir.$filename.'.html', $text)) { $query = "UPDATE `url_table` SET `file` = '$filename' WHERE `url` = '{$row[0]}' LIMIT 1"; mysql_query($query); } } }
this allows you to know which urls have a local copy, and thus make adjustments in your code for failures such as HTTP errors, Site downtimes etc. or make updates only to failed downloads etc.
Otherwise, you have do a
file_exists(sha1($url))
each time you want to check if a local copy of a url exists. (which is slow).
www.fijiwebdesign.com - web design and development and fun
Cpanel Email - Let users Register email accounts on your website upon registration
Ajax Chat - Fully browser based chat!
Cpanel Email - Let users Register email accounts on your website upon registration
Ajax Chat - Fully browser based chat!
•
•
•
•
Isn't there a simpler way to do this? All it has to do is save the htm file to a different table.
To save to a table, just save the $text from each URL:
PHP Syntax (Toggle Plain Text)
$text = file_get_contents($row[0]);
Other notes:
Use a blob field if you will be saving any binary data, or just text if only HTML etc.
I'd make the table column UTF-8, and convert the encoded data from each URL in order to save multilingual data together. (php-utf8 lib may come in handy if you will do any parsing like you mentioned earlier - saving <body> only etc.)
www.fijiwebdesign.com - web design and development and fun
Cpanel Email - Let users Register email accounts on your website upon registration
Ajax Chat - Fully browser based chat!
Cpanel Email - Let users Register email accounts on your website upon registration
Ajax Chat - Fully browser based chat!
![]() |
Other Threads in the PHP Forum
- Previous Thread: Is PHP knowledge neccessary to use phpbb?
- Next Thread: Inserting recurring event/appointment type using time and date range
| Thread Tools | Search this Thread |
.htaccess alerts apache api archive array autocomplete beginner binary broken cakephp checkbox class cms code convert cron curl database dataentry date display duplicates dynamic echo email emptydisplayvalue error execute explodefunction file files firstoptioninphpdroplist folder form forms function functions google hack href htaccess html htmlspecialchars image include insert ip javasciptvalidation javascript joomla keywords limit link login mail matching menu methods mlm multiple mysql network object oop paypal pdf php problem query radio random recursion recursive redirect remote script search securephp server sessions shot sms source space sql subscription syntax system table tutorial tutorials update upload url validator variable video web youtube






