PHP HTTP Screen-Scraping Class with Caching

Question

Troy 35 Posting Whiz

19 Years Ago

http://www.tgreer.com/class_http_php.html

I've written what I think is a high-quality PHP class for screen-scraping external (or internal) web content. The class includes features to cache scraped content for any number of seconds. So for example, if you want to show stock market data on your site that you scrape from a third-party, you can easily set it so your site hits the source site for fresh content no more than once every 5 minutes. This caching is seamless to your script--you don't have to worry about it.

The class includes a companion script named image_cache.php that can be used as the src attribute in img elements to cache images from external sites. This is useful if you want to incorporate an image from an external site that is dynamically updated on a regular basis. For example--stock charts that are generated every 60 seconds on many websites. This allows you to use their image on your site, but with the caching feature, you don't have to hit their site everytime somebody hits yours.

The class also has 2 static methods that make it simple to extract data from HTML tables. One extracts a table into an array and the other into XML.

The class can perform basic authentication allowing you to scrape protected content. It also cloaks itself as the User Agent of the user requesting your script. This allows you to access content that may normally be blocked to non-browser agents.

The article explains in detail how to use the class, and is itself, a good tutorial for many techniques in PHP.

If you have any comments or questions about this class or the article, let's discuss it here. I'm always wanting to learn more, so lets discuss. I hope you find the class and the article useful.

php

8 Contributors
17 Replies
776 Views
3 Years Discussion Span
Latest Post 16 Years Ago Latest Post by WebSnail

world_weapon 3 Junior Poster in Training

17 Years Ago

Hello, I am actually quite interested in this code, but I do not have that much experience with php. I was wondering if I wanted to access an https page that recieves post vars from a form, if I could use the $_POST= to setup the vars. I was wondering if you could put an example for such a scenario. What I am actually having to work with, is an https site that accepts login and password via post form and once the session is established I want to pass post vars to a page that uses them to create a table with the info I want.
I can extract the info from an array, but I am a little flaky on the understanding of fetching the pages. I am unclear as to how to do that with the script. I read about sendToHost() and stuff about fsockopen() but I don't know exactly how to implement that with this code. Any insight would be helpful.

world_weapon 3 Junior Poster in Training

17 Years Ago

Thanx alot, I will check it out. This is something that I absolutely must learn to do. Glad that you have this site out there helping folks like me get around to doing crafty stuff like this.

world_weapon 3 Junior Poster in Training

17 Years Ago

Hey Troy, if you check up on this, I was wondering, I come across a 302 status on one of the pages I try to scrape. I use the URLs the way they show up in the browser. Of course in the header, there is also a LoginRedir={stuffgoeshere}. I haven't been able to find much on LoginRedir on google. I was wondering though, how does your class handle 302 status? What I probably mean is does it follow through the redirect? I assume it doesn't because the content of body is not what I expect, something along the lines of object moved. Maybe I got something set improperly? The LoginRedir=XXXX is in the Set-Cookie: so I think it may also be with the way I have the script handle the cookie after receiving this. It could be the data stored in the cookie, or it could be the class not following through on the 302. I will try to play with this some more. Let me know what you think.

world_weapon 3 Junior Poster in Training

17 Years Ago

Just to let you know, I successfully did it with a cURL implementation, but I would still like to figure out how your class handles 302 status. cURL allows boolean setting to follow through on redirects. Well, I will play with it some more. Would love to have a non- cURL implementation.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

aasahil 0 Newbie Poster · Answer 1 · 2005-07-23T14:27:36+00:00

Hi I am ravi
I downloaded your code of screen-scraping.
But i facing facing the problem in storing the scrapped file in the folder please guide me.
and also tell me how can i extract contents from the scrapped file.

thanks
ravi

Troy 35 Posting Whiz · Answer 2 · 2005-07-23T19:46:32+00:00

The class has an option to save the content you scrape onto your local webserver's harddrive. This is called "caching". You tell the class where to store the cache files by one of 2 methods:

1. Modify the class file directly to change the dir property within the class constructor.

/*
Set the 'dir' property to the directory where you want to store the cached
content. End this value with a "/".
*/
$this->dir = "/var/www/cache/";

2. You can also modify the dir property per each instance of the class like so:

$h = new http();
$h->dir = "c:\inetpub\cache\";

As far as what you do with the content once you've scraped it, well, that's up to you. It's just a big string of page source---so do whatever parsing and extracting you want. The class does include a method to make it relatively simple to extract data out of an HTML table structure. It's all in the documentation at http://www.troywolf.com/articles.

aasahil 0 Newbie Poster · Answer 3 · 2005-07-23T20:51:24+00:00

Thanks sir
your reply is really encourage me. but unfortunatly it can't give me the scrapped file in my folder which is "img"
img is the folder which i have created in the location where i have placed all the pages.
my locathon is "mock/scraping/img" .
please tell me how i can modify the above path.
Sorry i disterb u .
I am new in the php field and i have not enough knowledge about php so please help me.

Thanks for encouraging me
your friend
ravi

Troy 35 Posting Whiz · Answer 4 · 2005-07-23T21:03:59+00:00

Does everything work fine if you do not use caching? (Set ttl = 0.) If so, your only problem is almost certainly with the dir property.

This will tell http_class to cache to a directory that is relative to the current path.

$h->dir = "mock/scraping/img/";

For this to work, the directory img must have write privs for the web service user. I'm going to guess you are using Linux based on forward slashed in your path example. So please check out the chmod command. If you do not have access to the shell to run the chmod command, you may be able to set write privs for the folder using your FTP client. You can always contact your hosting company (if applicable) to have them make your img folder globally writeable.

Troy 35 Posting Whiz · Answer 5 · 2007-10-02T23:50:31+00:00

I was wondering if you could put an example for such a scenario.

The code is designed for PHP programmers to integrate into their apps. So if you are not very familiar with PHP programming, it will probably be difficult to understand my class. However, over the years, a lot of newbies have been successful with it, so I guess take heart! :)

Yes, the class can be used to access HTTPS and password protected content. However, there are some hoops you have to jump through. I don't have time to teach this step by step, but I can offer a complete code example.

Go to my articles page at:
http://www.troywolf.com/articles/

Check out my class_sbdns.php (Server Beach DNS Tool API). It uses class_http to automate logging into an HTTPS site. I will give you this hint to help you know what to look for. Basically, you'll use the postvars to send your username and password. The remote server will respond with the next page and the HTTP headers will contain a session id cookie. Every subsequent hit you make to the server, you must pass that cookie back to the server. You do this in the headers. Again, class_sbdns does all this, so look in there for examples. It will be tedious for you I know, but not as tedious as me taking the time to walk you through step by step. (tedious for me that is ;))

Enjoy!

dcasso 0 Newbie Poster · Answer 6 · 2007-10-28T23:28:56+00:00

It's a really nice class, but I have one problem with it, that I can't seem to solve (but I don't think it's caused by the class that you made, but the classes it calls).

When I visit some pages I get a:
getFromUrl() called
Could not open connection. Error 61: Connection refused

Can I in any way get around this problem. When I visit the site through a webbrowser I can get access without any problems.

Thanks in advance
Dennis

savoo4u 0 Newbie Poster · Answer 7 · 2008-01-04T16:52:32+00:00

everything is working fine except im not able to save the content to cache or any file location,how can i do it
there is a waring lik this
"Warning: Missing argument 1 for http::getFromUrl(), called in C:\wamp\www\fetch\class\class_http.php on line 78 and defined in C:\wamp\www\fetch\class\class_http.php on line 127"

chocoholic 0 Newbie Poster · Answer 8 · 2008-01-05T04:36:33+00:00

when i used your examples, it all worked well, but when i tried using a different url, i got the error:

Warning: Missing argument 1 for http::getFromUrl(), called in /home/web/class_http.php on line 90 and defined in /home/web/class_http.php on line 139

also, before, when i was using your example, it didn't appear to be caching. when i looked in the current directory or when i specified a different directory, i did not see any cached data.

thanks

savoo4u 0 Newbie Poster · Answer 9 · 2008-01-08T12:37:41+00:00

by using

echo "<pre>";
print_r($msft_stats);
echo "</pre>";
or
/*
Static method table_into_xml()
to parse the elements from allgame.com
*/
function table_into_xml($rawHTML,$needle="",$needle_within=0,$allowedTags="") {
if (!$aryTable = http::table_into_array($rawHTML,$needle,$needle_within,$allowedTags)) { return false; }
$xml = "<?xml version=\"1.0\" standalone=\"yes\" \?\>\n";
$xml .= "<TABLE>\n";
$rowIdx = 0;
foreach ($aryTable as $row) {
$xml .= "\t<ROW id=\"".$rowIdx."\">\n";
$colIdx = 0;
foreach ($row as $col) {
$xml .= "\t\t<COL id=\"".$colIdx."\">".trim(utf8_encode(htmlspecialchars($col)))."</COL>\n";
$colIdx++;
}
$xml .= "\t</ROW>\n";
$rowIdx++;
}
$xml .= "</TABLE>";
return $xml;
}
}

In which location is the XML file created or were can we see the array
thank you

asadalim1 0 Light Poster · Answer 10 · 2008-12-13T03:54:58+00:00

Im trying to run the example but encounter this problem.

Warning: Missing argument 1 for http::getFromUrl(), called in C:\wamp\www\scrape\troy\class_http.php on line 88 and defined in C:\wamp\www\scrape\troy\class_http.php on line 137

any ideas?

cheers

asadalim1 0 Light Poster · Answer 11 · 2008-12-15T11:27:12+00:00

Has this class worked succesfully for anybody?

WebSnail 0 Newbie Poster · Answer 12 · 2009-01-08T22:58:06+00:00

Has this class worked succesfully for anybody?

I've just loaded it up on a *nix test site and it works pretty well.

It's literally just a class to grab the content, everything else is down to the Coder to parse using reg-ex, etc... (Oh the joy! *sob*).

I noticed a couple of initial impressions you get though.

1. The example script has two sites on it that you need to disable (comment out) before you run it (see the end of the code: lines 130 on->)

2. The whole table_into_array() thing uses an old title. Change this:

$msft_stats = http::table_into_array($h->body, "Avg Daily Volume", 1, null);

to this..

$msft_stats = http::table_into_array($h->body, "Avg. Daily Vol.", 1, null);

.. and you're laughing.

Anyhoo... this is useful for a project I've started looking at so here's hoping it stays the course.

WebSnail 0 Newbie Poster · Answer 13 · 2009-01-08T23:23:49+00:00

Realised there was a syntax error in the image_cache.php as well..
Find:

$h->fetch($_GET['url'], $_GET['ttl'];);

Replace with:

$h->fetch($_GET['url'], $_GET['ttl']);

(or just delete the extra semi colon) :)