Hi, I would like to know a resource or simple solution for this,
I have a mySQL database with lots of links. I need said solution to visit each link and download the HTML file for it. Like an indexer which is told where to index.
If you have CURL installed with PHP it is more efficient.
To do this the correct way, you have to create a spider. Downloading web pages can be intensive both for your server and the remote server, so you need to have intervals between making connections if they are to the same host.
You also have to adhere to the remote hosts robots.txt policy.
Simple solution: how to save a single file
$text = file_get_contents($url);
file_put_contents($save_location, $text);
file_put_contents() requires PHP5. Or you can create an equivalent function.
if (!file_exists('file_put_contents')) {
// create a simple user function to emulate PHP5's file_put_contents()
function file_put_contents($file, $contents) {
fwrite($fp = fopen($file), $contents, strlen($contents));
fclose($fp);
}
}
(lol, copy and paste from my last post)
If you are doing this frequently though on a large scale, its good to have a spider. This is can be written in PHP (though not the best language for it). The difference with a spider and a normal script is that a PHP spider should be written on the CLI version of PHP and run as a CGI. This allows it to run as a service, or daemon. The daemon can then donwload files for an indefinite period and not time out, while being considerate of the resources they use on the remote server (allowing breaks between downloads).
You would also use CURL or fsockopen() as it allows you to open sockets with more control (you can keep a HTTP1.1 session on the server while downloading several pages), and allow the remote host to know you are a robot/spider and also follow spidering policies in robots.txt on the server.
Spiders can also be implemented with PHP compiled as an Apache module and the inability to register a service or run a daemon on the server. It would require a interval based trigger for the script such as web page hits, or email receipts from SMTP, or even using an external website to ping your script.
Hope that helps a bit.