Making a simple indexer

Question

redZERO 0 Junior Poster in Training

17 Years Ago

Hi, I would like to know a resource or simple solution for this,
I have a mySQL database with lots of links. I need said solution to visit each link and download the HTML file for it. Like an indexer which is told where to index.

php

3 Contributors
10 Replies
127 Views
1 Week Discussion Span
Latest Post 17 Years Ago Latest Post by digital-ether

All 10 Replies

digital-ether 399 Nearly a Posting Virtuoso

17 Years Ago

Hi, I would like to know a resource or simple solution for this,
I have a mySQL database with lots of links. I need said solution to visit each link and download the HTML file for it. Like an indexer which is told where to index.

If you have CURL installed with PHP it is more efficient.

To do this the correct way, you have to create a spider. Downloading web pages can be intensive both for your server and the remote server, so you need to have intervals between making connections if they are to the same host.

You also have to adhere to the remote hosts robots.txt policy.

Simple solution: how to save a single file

$text = file_get_contents($url);
file_put_contents($save_location, $text);

file_put_contents() requires PHP5. Or you can create an equivalent function.

if (!file_exists('file_put_contents')) {
// create a simple user function to emulate PHP5's file_put_contents() 
function file_put_contents($file, $contents) {
fwrite($fp = fopen($file), $contents, strlen($contents));
fclose($fp);
}
}

(lol, copy and paste from my last post)

If you are doing this frequently though on a large scale, its good to have a spider. This is can be written in PHP (though not the best language for it). The difference with a spider and a normal script is that a PHP spider should be written on the CLI version of PHP and run as a CGI. This allows it to run as a service, or daemon. The daemon can then donwload files for an indefinite period and not time out, while being considerate of the resources they use on the remote server (allowing breaks between downloads).
You would also use CURL or fsockopen() as it allows you to open sockets with more control (you can keep a HTTP1.1 session on the server while downloading several pages), and allow the remote host to know you are a robot/spider and also follow spidering policies in robots.txt on the server.

Spiders can also be implemented with PHP compiled as an Apache module and the inability to register a service or run a daemon on the server. It would require a interval based trigger for the script such as web page hits, or email receipts from SMTP, or even using an external website to ping your script.

Hope that helps a bit.

FireNet 64 Posting Whiz in Training

17 Years Ago

use regular expressions

Look here:
http://ca3.php.net/preg-match-all

digital-ether 399 Nearly a Posting Virtuoso

17 Years Ago

Thanks so much for the help! Is it possible to save only the data between the <body> tags?

Like said, regex will do the job.

Something like:

preg_match("/<body([^>]*)>(.*?)<\/body>/i", $txt, $matches);

You will have to check for multiple lines if the <body> tags spans more than 1 line. See regex docs for that... I think its the modifier, n or m. Not sure which.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

redZERO 0 Junior Poster in Training · Answer 1 · 2007-12-25T23:43:33+00:00

Thanks so much for the help! Is it possible to save only the data between the <body> tags?

redZERO 0 Junior Poster in Training · Answer 2 · 2007-12-26T22:25:42+00:00

Thanks so much. Another thing, for the original code, would it work like this:

connect to database
specify $save_location (Database info, which table etc)
<?php
//assuming you're connected to db
$getlist=mysql_query("SELECT url FROM url_table");
while($row=mysql_fetch_array($getlist)){
$text = file_get_contents($row[0]);
file_put_contents($save_location, $row[0]);
}
?>

Would this save pages from url's specified in a DB?

digital-ether 399 Nearly a Posting Virtuoso Team Colleague · Answer 3 · 2007-12-27T08:36:39+00:00

Thanks so much. Another thing, for the original code, would it work like this:
connect to database
specify $save_location (Database info, which table etc)
<?php
//assuming you're connected to db
$getlist=mysql_query("SELECT url FROM url_table");
while($row=mysql_fetch_array($getlist)){
   $text = file_get_contents($row[0]);
   file_put_contents($save_location, $row[0]);
}
?>
Would this save pages from url's specified in a DB?

The file_put_contents($save_location, $row[0]); is wrong.

You'll have to specify a directory where your local files will be saved. Then make sure you save the local version with a valid filename.
What I usually do is make a cryptographic hash of the URL, and use that as the filename of the local file. The hashes can be MD5's or SHA1's etc.

Eg:

<?php

// a directory for saving the URLS
$dir = 'cahed_sites/';

//assuming you're connected to db
$getlist=mysql_query("SELECT url FROM url_table");
while($row=mysql_fetch_array($getlist)){
   $text = file_get_contents($row[0]);

// make sure we have something... 
if ($text) {
$filename = sha1($text);
file_put_contents($dir.$filename.'.html', $text);
}
   
}
?>

You can also create a new column in your url_table, called `file` or similar. Then update the table when you create the local copy.

eg:

// a directory for saving the URLS
$dir = 'cahed_sites/';

//assuming you're connected to db
$getlist=mysql_query("SELECT url FROM url_table");
while($row=mysql_fetch_array($getlist)){
   $text = file_get_contents($row[0]);

// make sure we have something... 
if ($text) {
$filename = sha1($row[0]);

// if we succeed, update db row
if (file_put_contents($dir.$filename.'.html', $text)) {
$query = "UPDATE `url_table` SET `file` = '$filename' WHERE `url` = '{$row[0]}' LIMIT 1";
mysql_query($query);
}

}
   
}

this allows you to know which urls have a local copy, and thus make adjustments in your code for failures such as HTTP errors, Site downtimes etc. or make updates only to failed downloads etc.
Otherwise, you have do a
file_exists(sha1($url))
each time you want to check if a local copy of a url exists. (which is slow).

redZERO 0 Junior Poster in Training · Answer 4 · 2007-12-28T22:21:12+00:00

Isn't there a simpler way to do this? All it has to do is save the htm file to a different table.

digital-ether 399 Nearly a Posting Virtuoso Team Colleague · Answer 5 · 2007-12-29T04:34:17+00:00

Isn't there a simpler way to do this? All it has to do is save the htm file to a different table.

Thats actually a simple way.

To save to a table, just save the $text from each URL:

$text = file_get_contents($row[0]);

To a table, or even a column on the same table.

Other notes:
Use a blob field if you will be saving any binary data, or just text if only HTML etc.
I'd make the table column UTF-8, and convert the encoded data from each URL in order to save multilingual data together. (php-utf8 lib may come in handy if you will do any parsing like you mentioned earlier - saving <body> only etc.)

redZERO 0 Junior Poster in Training · Answer 6 · 2007-12-29T16:36:11+00:00

I heard about a library called cURL. Wasn't this designed for this sort of thing, therefore maybe easier?

digital-ether 399 Nearly a Posting Virtuoso Team Colleague · Answer 7 · 2007-12-29T19:06:38+00:00

I heard about a library called cURL. Wasn't this designed for this sort of thing, therefore maybe easier?

cURL is more efficient and faster.

cURL is not simpler to work with though. (If simple means in terms of programming)

This is about the simplest you can get:

$text = file_get_contents($row[0]);

with curl that would be 10 or so lines.

Making a simple indexer

Recommended Answers Collapse Answers

All 10 Replies

Recommended Answers