943,545 Members | Top Members by Rank

Ad:
  • PHP Discussion Thread
  • Unsolved
  • Views: 2275
  • PHP RSS
You are currently viewing page 1 of this multi-page discussion thread
Dec 20th, 2007
0

Making a simple indexer

Expand Post »
Hi, I would like to know a resource or simple solution for this,
I have a mySQL database with lots of links. I need said solution to visit each link and download the HTML file for it. Like an indexer which is told where to index.
Reputation Points: 10
Solved Threads: 2
Junior Poster in Training
redZERO is offline Offline
81 posts
since Dec 2007
Dec 25th, 2007
0

Re: Making a simple indexer

Click to Expand / Collapse  Quote originally posted by redZERO ...
Hi, I would like to know a resource or simple solution for this,
I have a mySQL database with lots of links. I need said solution to visit each link and download the HTML file for it. Like an indexer which is told where to index.
If you have CURL installed with PHP it is more efficient.

To do this the correct way, you have to create a spider. Downloading web pages can be intensive both for your server and the remote server, so you need to have intervals between making connections if they are to the same host.

You also have to adhere to the remote hosts robots.txt policy.

Simple solution: how to save a single file

PHP Syntax (Toggle Plain Text)
  1. $text = file_get_contents($url);
  2. file_put_contents($save_location, $text);

file_put_contents() requires PHP5. Or you can create an equivalent function.

PHP Syntax (Toggle Plain Text)
  1. if (!file_exists('file_put_contents')) {
  2. // create a simple user function to emulate PHP5's file_put_contents()
  3. function file_put_contents($file, $contents) {
  4. fwrite($fp = fopen($file), $contents, strlen($contents));
  5. fclose($fp);
  6. }
  7. }
(lol, copy and paste from my last post)

If you are doing this frequently though on a large scale, its good to have a spider. This is can be written in PHP (though not the best language for it). The difference with a spider and a normal script is that a PHP spider should be written on the CLI version of PHP and run as a CGI. This allows it to run as a service, or daemon. The daemon can then donwload files for an indefinite period and not time out, while being considerate of the resources they use on the remote server (allowing breaks between downloads).
You would also use CURL or fsockopen() as it allows you to open sockets with more control (you can keep a HTTP1.1 session on the server while downloading several pages), and allow the remote host to know you are a robot/spider and also follow spidering policies in robots.txt on the server.

Spiders can also be implemented with PHP compiled as an Apache module and the inability to register a service or run a daemon on the server. It would require a interval based trigger for the script such as web page hits, or email receipts from SMTP, or even using an external website to ping your script.

Hope that helps a bit.
Moderator
Reputation Points: 457
Solved Threads: 101
Nearly a Posting Virtuoso
digital-ether is offline Offline
1,250 posts
since Sep 2005
Dec 25th, 2007
0

Re: Making a simple indexer

Thanks so much for the help! Is it possible to save only the data between the <body> tags?
Reputation Points: 10
Solved Threads: 2
Junior Poster in Training
redZERO is offline Offline
81 posts
since Dec 2007
Dec 25th, 2007
0

Re: Making a simple indexer

use regular expressions

Look here:
http://ca3.php.net/preg-match-all
Reputation Points: 108
Solved Threads: 7
Posting Whiz in Training
FireNet is offline Offline
256 posts
since May 2004
Dec 25th, 2007
0

Re: Making a simple indexer

Click to Expand / Collapse  Quote originally posted by redZERO ...
Thanks so much for the help! Is it possible to save only the data between the <body> tags?
Like said, regex will do the job.

Something like:
PHP Syntax (Toggle Plain Text)
  1. preg_match("/<body([^>]*)>(.*?)<\/body>/i", $txt, $matches);

You will have to check for multiple lines if the <body> tags spans more than 1 line. See regex docs for that... I think its the modifier, n or m. Not sure which.
Moderator
Reputation Points: 457
Solved Threads: 101
Nearly a Posting Virtuoso
digital-ether is offline Offline
1,250 posts
since Sep 2005
Dec 26th, 2007
0

Re: Making a simple indexer

Thanks so much. Another thing, for the original code, would it work like this:

connect to database
specify $save_location (Database info, which table etc)
<?php
//assuming you're connected to db
$getlist=mysql_query("SELECT url FROM url_table");
while($row=mysql_fetch_array($getlist)){
$text = file_get_contents($row[0]);
file_put_contents($save_location, $row[0]);
}
?>

Would this save pages from url's specified in a DB?
Reputation Points: 10
Solved Threads: 2
Junior Poster in Training
redZERO is offline Offline
81 posts
since Dec 2007
Dec 26th, 2007
0

Re: Making a simple indexer

Click to Expand / Collapse  Quote originally posted by redZERO ...
Thanks so much. Another thing, for the original code, would it work like this:

connect to database
specify $save_location (Database info, which table etc)
PHP Syntax (Toggle Plain Text)
  1. <?php
  2. //assuming you're connected to db
  3. $getlist=mysql_query("SELECT url FROM url_table");
  4. while($row=mysql_fetch_array($getlist)){
  5. $text = file_get_contents($row[0]);
  6. file_put_contents($save_location, $row[0]);
  7. }
  8. ?>

Would this save pages from url's specified in a DB?
The file_put_contents($save_location, $row[0]); is wrong.

You'll have to specify a directory where your local files will be saved. Then make sure you save the local version with a valid filename.
What I usually do is make a cryptographic hash of the URL, and use that as the filename of the local file. The hashes can be MD5's or SHA1's etc.

Eg:

PHP Syntax (Toggle Plain Text)
  1. <?php
  2.  
  3. // a directory for saving the URLS
  4. $dir = 'cahed_sites/';
  5.  
  6. //assuming you're connected to db
  7. $getlist=mysql_query("SELECT url FROM url_table");
  8. while($row=mysql_fetch_array($getlist)){
  9. $text = file_get_contents($row[0]);
  10.  
  11. // make sure we have something...
  12. if ($text) {
  13. $filename = sha1($text);
  14. file_put_contents($dir.$filename.'.html', $text);
  15. }
  16.  
  17. }
  18. ?>

You can also create a new column in your url_table, called `file` or similar. Then update the table when you create the local copy.

eg:

PHP Syntax (Toggle Plain Text)
  1. // a directory for saving the URLS
  2. $dir = 'cahed_sites/';
  3.  
  4. //assuming you're connected to db
  5. $getlist=mysql_query("SELECT url FROM url_table");
  6. while($row=mysql_fetch_array($getlist)){
  7. $text = file_get_contents($row[0]);
  8.  
  9. // make sure we have something...
  10. if ($text) {
  11. $filename = sha1($row[0]);
  12.  
  13. // if we succeed, update db row
  14. if (file_put_contents($dir.$filename.'.html', $text)) {
  15. $query = "UPDATE `url_table` SET `file` = '$filename' WHERE `url` = '{$row[0]}' LIMIT 1";
  16. mysql_query($query);
  17. }
  18.  
  19. }
  20.  
  21. }

this allows you to know which urls have a local copy, and thus make adjustments in your code for failures such as HTTP errors, Site downtimes etc. or make updates only to failed downloads etc.
Otherwise, you have do a
file_exists(sha1($url))
each time you want to check if a local copy of a url exists. (which is slow).
Moderator
Reputation Points: 457
Solved Threads: 101
Nearly a Posting Virtuoso
digital-ether is offline Offline
1,250 posts
since Sep 2005
Dec 28th, 2007
0

Re: Making a simple indexer

Isn't there a simpler way to do this? All it has to do is save the htm file to a different table.
Reputation Points: 10
Solved Threads: 2
Junior Poster in Training
redZERO is offline Offline
81 posts
since Dec 2007
Dec 28th, 2007
0

Re: Making a simple indexer

Click to Expand / Collapse  Quote originally posted by redZERO ...
Isn't there a simpler way to do this? All it has to do is save the htm file to a different table.
Thats actually a simple way.

To save to a table, just save the $text from each URL:
PHP Syntax (Toggle Plain Text)
  1. $text = file_get_contents($row[0]);
To a table, or even a column on the same table.

Other notes:
Use a blob field if you will be saving any binary data, or just text if only HTML etc.
I'd make the table column UTF-8, and convert the encoded data from each URL in order to save multilingual data together. (php-utf8 lib may come in handy if you will do any parsing like you mentioned earlier - saving <body> only etc.)
Moderator
Reputation Points: 457
Solved Threads: 101
Nearly a Posting Virtuoso
digital-ether is offline Offline
1,250 posts
since Sep 2005
Dec 29th, 2007
0

Re: Making a simple indexer

I heard about a library called cURL. Wasn't this designed for this sort of thing, therefore maybe easier?
Reputation Points: 10
Solved Threads: 2
Junior Poster in Training
redZERO is offline Offline
81 posts
since Dec 2007

This thread is more than three months old

No one has posted to this discussion for at least three months. Please let old threads die and do not reply to them unless you feel you have something new and valuable to contribute that absolutely must be added to make the discussion complete. Otherwise, please start a new thread in this forum instead.
Message:
Previous Thread in PHP Forum Timeline: Is PHP knowledge neccessary to use phpbb?
Next Thread in PHP Forum Timeline: Inserting recurring event/appointment type using time and date range





About Us | Contact Us | Advertise | Acceptable Use Policy
Forum Index | Build Custom RSS Feed


Follow us on Twitter


© 2011 DaniWeb® LLC