954,561 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

Making a simple indexer

Hi, I would like to know a resource or simple solution for this,
I have a mySQL database with lots of links. I need said solution to visit each link and download the HTML file for it. Like an indexer which is told where to index.

redZERO
Junior Poster in Training
82 posts since Dec 2007
Reputation Points: 10
Solved Threads: 2
 
Hi, I would like to know a resource or simple solution for this, I have a mySQL database with lots of links. I need said solution to visit each link and download the HTML file for it. Like an indexer which is told where to index.

If you have CURL installed with PHP it is more efficient.

To do this the correct way, you have to create a spider. Downloading web pages can be intensive both for your server and the remote server, so you need to have intervals between making connections if they are to the same host.

You also have to adhere to the remote hosts robots.txt policy.

Simple solution: how to save a single file

$text = file_get_contents($url);
file_put_contents($save_location, $text);


file_put_contents() requires PHP5. Or you can create an equivalent function.

if (!file_exists('file_put_contents')) {
// create a simple user function to emulate PHP5's file_put_contents() 
function file_put_contents($file, $contents) {
fwrite($fp = fopen($file), $contents, strlen($contents));
fclose($fp);
}
}

(lol, copy and paste from my last post)

If you are doing this frequently though on a large scale, its good to have a spider. This is can be written in PHP (though not the best language for it). The difference with a spider and a normal script is that a PHP spider should be written on the CLI version of PHP and run as a CGI. This allows it to run as a service, or daemon. The daemon can then donwload files for an indefinite period and not time out, while being considerate of the resources they use on the remote server (allowing breaks between downloads).
You would also use CURL or fsockopen() as it allows you to open sockets with more control (you can keep a HTTP1.1 session on the server while downloading several pages), and allow the remote host to know you are a robot/spider and also follow spidering policies in robots.txt on the server.

Spiders can also be implemented with PHP compiled as an Apache module and the inability to register a service or run a daemon on the server. It would require a interval based trigger for the script such as web page hits, or email receipts from SMTP, or even using an external website to ping your script.

Hope that helps a bit.

digital-ether
Nearly a Posting Virtuoso
Moderator
1,293 posts since Sep 2005
Reputation Points: 461
Solved Threads: 101
 

Thanks so much for the help! Is it possible to save only the data between the tags?

redZERO
Junior Poster in Training
82 posts since Dec 2007
Reputation Points: 10
Solved Threads: 2
 

use regular expressions

Look here:
http://ca3.php.net/preg-match-all

FireNet
Posting Whiz in Training
258 posts since May 2004
Reputation Points: 108
Solved Threads: 7
 
Thanks so much for the help! Is it possible to save only the data between the tags?

Like said, regex will do the job.

Something like:

preg_match("/<body([^>]*)>(.*?)<\/body>/i", $txt, $matches);


You will have to check for multiple lines if the tags spans more than 1 line. See regex docs for that... I think its the modifier, n or m. Not sure which.

digital-ether
Nearly a Posting Virtuoso
Moderator
1,293 posts since Sep 2005
Reputation Points: 461
Solved Threads: 101
 

Thanks so much. Another thing, for the original code, would it work like this:

connect to database
specify $save_location (Database info, which table etc)
<?php
//assuming you're connected to db
$getlist=mysql_query("SELECT url FROM url_table");
while($row=mysql_fetch_array($getlist)){
$text = file_get_contents($row[0]);
file_put_contents($save_location, $row[0]);
}
?>

Would this save pages from url's specified in a DB?

redZERO
Junior Poster in Training
82 posts since Dec 2007
Reputation Points: 10
Solved Threads: 2
 

Thanks so much. Another thing, for the original code, would it work like this:

connect to database specify $save_location (Database info, which table etc)

<?php
//assuming you're connected to db
$getlist=mysql_query("SELECT url FROM url_table");
while($row=mysql_fetch_array($getlist)){
   $text = file_get_contents($row[0]);
   file_put_contents($save_location, $row[0]);
}
?>

Would this save pages from url's specified in a DB?

The file_put_contents($save_location, $row[0]); is wrong.

You'll have to specify a directory where your local files will be saved. Then make sure you save the local version with a valid filename.
What I usually do is make a cryptographic hash of the URL, and use that as the filename of the local file. The hashes can be MD5's or SHA1's etc.

Eg:

<?php

// a directory for saving the URLS
$dir = 'cahed_sites/';

//assuming you're connected to db
$getlist=mysql_query("SELECT url FROM url_table");
while($row=mysql_fetch_array($getlist)){
   $text = file_get_contents($row[0]);

// make sure we have something... 
if ($text) {
$filename = sha1($text);
file_put_contents($dir.$filename.'.html', $text);
}
   
}
?>


You can also create a new column in your url_table, called `file` or similar. Then update the table when you create the local copy.

eg:

// a directory for saving the URLS
$dir = 'cahed_sites/';

//assuming you're connected to db
$getlist=mysql_query("SELECT url FROM url_table");
while($row=mysql_fetch_array($getlist)){
   $text = file_get_contents($row[0]);

// make sure we have something... 
if ($text) {
$filename = sha1($row[0]);

// if we succeed, update db row
if (file_put_contents($dir.$filename.'.html', $text)) {
$query = "UPDATE `url_table` SET `file` = '$filename' WHERE `url` = '{$row[0]}' LIMIT 1";
mysql_query($query);
}

}
   
}


this allows you to know which urls have a local copy, and thus make adjustments in your code for failures such as HTTP errors, Site downtimes etc. or make updates only to failed downloads etc.
Otherwise, you have do a
file_exists(sha1($url))
each time you want to check if a local copy of a url exists. (which is slow).

digital-ether
Nearly a Posting Virtuoso
Moderator
1,293 posts since Sep 2005
Reputation Points: 461
Solved Threads: 101
 

Isn't there a simpler way to do this? All it has to do is save the htm file to a different table.

redZERO
Junior Poster in Training
82 posts since Dec 2007
Reputation Points: 10
Solved Threads: 2
 
Isn't there a simpler way to do this? All it has to do is save the htm file to a different table.

Thats actually a simple way.

To save to a table, just save the $text from each URL:

$text = file_get_contents($row[0]);

To a table, or even a column on the same table.Other notes:
Use a blob field if you will be saving any binary data, or just text if only HTML etc.
I'd make the table column UTF-8, and convert the encoded data from each URL in order to save multilingual data together. (php-utf8 lib may come in handy if you will do any parsing like you mentioned earlier - saving only etc.)

digital-ether
Nearly a Posting Virtuoso
Moderator
1,293 posts since Sep 2005
Reputation Points: 461
Solved Threads: 101
 

I heard about a library called cURL. Wasn't this designed for this sort of thing, therefore maybe easier?

redZERO
Junior Poster in Training
82 posts since Dec 2007
Reputation Points: 10
Solved Threads: 2
 
I heard about a library called cURL. Wasn't this designed for this sort of thing, therefore maybe easier?

cURL is more efficient and faster.

cURL is not simpler to work with though. (If simple means in terms of programming)

This is about the simplest you can get:

$text = file_get_contents($row[0]);


with curl that would be 10 or so lines.

digital-ether
Nearly a Posting Virtuoso
Moderator
1,293 posts since Sep 2005
Reputation Points: 461
Solved Threads: 101
 

This article has been dead for over three months

Post: Markdown Syntax: Formatting Help
You