| | |
PHP HTTP Screen-Scraping Class with Caching
Please support our PHP advertiser: PostgreSQL or MySQL? Compare and contrast the two most popular open source databases
![]() |
http://www.tgreer.com/class_http_php.html
I've written what I think is a high-quality PHP class for screen-scraping external (or internal) web content. The class includes features to cache scraped content for any number of seconds. So for example, if you want to show stock market data on your site that you scrape from a third-party, you can easily set it so your site hits the source site for fresh content no more than once every 5 minutes. This caching is seamless to your script--you don't have to worry about it.
The class includes a companion script named image_cache.php that can be used as the src attribute in img elements to cache images from external sites. This is useful if you want to incorporate an image from an external site that is dynamically updated on a regular basis. For example--stock charts that are generated every 60 seconds on many websites. This allows you to use their image on your site, but with the caching feature, you don't have to hit their site everytime somebody hits yours.
The class also has 2 static methods that make it simple to extract data from HTML tables. One extracts a table into an array and the other into XML.
The class can perform basic authentication allowing you to scrape protected content. It also cloaks itself as the User Agent of the user requesting your script. This allows you to access content that may normally be blocked to non-browser agents.
The article explains in detail how to use the class, and is itself, a good tutorial for many techniques in PHP.
If you have any comments or questions about this class or the article, let's discuss it here. I'm always wanting to learn more, so lets discuss. I hope you find the class and the article useful.
I've written what I think is a high-quality PHP class for screen-scraping external (or internal) web content. The class includes features to cache scraped content for any number of seconds. So for example, if you want to show stock market data on your site that you scrape from a third-party, you can easily set it so your site hits the source site for fresh content no more than once every 5 minutes. This caching is seamless to your script--you don't have to worry about it.
The class includes a companion script named image_cache.php that can be used as the src attribute in img elements to cache images from external sites. This is useful if you want to incorporate an image from an external site that is dynamically updated on a regular basis. For example--stock charts that are generated every 60 seconds on many websites. This allows you to use their image on your site, but with the caching feature, you don't have to hit their site everytime somebody hits yours.
The class also has 2 static methods that make it simple to extract data from HTML tables. One extracts a table into an array and the other into XML.
The class can perform basic authentication allowing you to scrape protected content. It also cloaks itself as the User Agent of the user requesting your script. This allows you to access content that may normally be blocked to non-browser agents.
The article explains in detail how to use the class, and is itself, a good tutorial for many techniques in PHP.
If you have any comments or questions about this class or the article, let's discuss it here. I'm always wanting to learn more, so lets discuss. I hope you find the class and the article useful.
The class has an option to save the content you scrape onto your local webserver's harddrive. This is called "caching". You tell the class where to store the cache files by one of 2 methods:
1. Modify the class file directly to change the dir property within the class constructor.
[PHP]
/*
Set the 'dir' property to the directory where you want to store the cached
content. End this value with a "/".
*/
$this->dir = "/var/www/cache/";
[/PHP]
2. You can also modify the dir property per each instance of the class like so:
[PHP]
$h = new http();
$h->dir = "c:\inetpub\cache\";
[/PHP]
As far as what you do with the content once you've scraped it, well, that's up to you. It's just a big string of page source---so do whatever parsing and extracting you want. The class does include a method to make it relatively simple to extract data out of an HTML table structure. It's all in the documentation at http://www.troywolf.com/articles.
1. Modify the class file directly to change the dir property within the class constructor.
[PHP]
/*
Set the 'dir' property to the directory where you want to store the cached
content. End this value with a "/".
*/
$this->dir = "/var/www/cache/";
[/PHP]
2. You can also modify the dir property per each instance of the class like so:
[PHP]
$h = new http();
$h->dir = "c:\inetpub\cache\";
[/PHP]
As far as what you do with the content once you've scraped it, well, that's up to you. It's just a big string of page source---so do whatever parsing and extracting you want. The class does include a method to make it relatively simple to extract data out of an HTML table structure. It's all in the documentation at http://www.troywolf.com/articles.
Thanks sir
your reply is really encourage me. but unfortunatly it can't give me the scrapped file in my folder which is "img"
img is the folder which i have created in the location where i have placed all the pages.
my locathon is "mock/scraping/img" .
please tell me how i can modify the above path.
Sorry i disterb u .
I am new in the php field and i have not enough knowledge about php so please help me.
Thanks for encouraging me
your friend
ravi
your reply is really encourage me. but unfortunatly it can't give me the scrapped file in my folder which is "img"
img is the folder which i have created in the location where i have placed all the pages.
my locathon is "mock/scraping/img" .
please tell me how i can modify the above path.
Sorry i disterb u .
I am new in the php field and i have not enough knowledge about php so please help me.
Thanks for encouraging me
your friend
ravi
Does everything work fine if you do not use caching? (Set ttl = 0.) If so, your only problem is almost certainly with the dir property.
This will tell http_class to cache to a directory that is relative to the current path.[PHP]$h->dir = "mock/scraping/img/";[/PHP] For this to work, the directory img must have write privs for the web service user. I'm going to guess you are using Linux based on forward slashed in your path example. So please check out the chmod command. If you do not have access to the shell to run the chmod command, you may be able to set write privs for the folder using your FTP client. You can always contact your hosting company (if applicable) to have them make your img folder globally writeable.
This will tell http_class to cache to a directory that is relative to the current path.[PHP]$h->dir = "mock/scraping/img/";[/PHP] For this to work, the directory img must have write privs for the web service user. I'm going to guess you are using Linux based on forward slashed in your path example. So please check out the chmod command. If you do not have access to the shell to run the chmod command, you may be able to set write privs for the folder using your FTP client. You can always contact your hosting company (if applicable) to have them make your img folder globally writeable.
Hello, I am actually quite interested in this code, but I do not have that much experience with php. I was wondering if I wanted to access an https page that recieves post vars from a form, if I could use the $_POST['']= to setup the vars. I was wondering if you could put an example for such a scenario. What I am actually having to work with, is an https site that accepts login and password via post form and once the session is established I want to pass post vars to a page that uses them to create a table with the info I want.
I can extract the info from an array, but I am a little flaky on the understanding of fetching the pages. I am unclear as to how to do that with the script. I read about sendToHost() and stuff about fsockopen() but I don't know exactly how to implement that with this code. Any insight would be helpful.
I can extract the info from an array, but I am a little flaky on the understanding of fetching the pages. I am unclear as to how to do that with the script. I read about sendToHost() and stuff about fsockopen() but I don't know exactly how to implement that with this code. Any insight would be helpful.
The purpose of my existence is why I am here.
•
•
•
•
I was wondering if you could put an example for such a scenario.

Yes, the class can be used to access HTTPS and password protected content. However, there are some hoops you have to jump through. I don't have time to teach this step by step, but I can offer a complete code example.
Go to my articles page at:
http://www.troywolf.com/articles/
Check out my class_sbdns.php (Server Beach DNS Tool API). It uses class_http to automate logging into an HTTPS site. I will give you this hint to help you know what to look for. Basically, you'll use the postvars to send your username and password. The remote server will respond with the next page and the HTTP headers will contain a session id cookie. Every subsequent hit you make to the server, you must pass that cookie back to the server. You do this in the headers. Again, class_sbdns does all this, so look in there for examples. It will be tedious for you I know, but not as tedious as me taking the time to walk you through step by step. (tedious for me that is
)Enjoy!
•
•
Join Date: Oct 2007
Posts: 1
Reputation:
Solved Threads: 0
It's a really nice class, but I have one problem with it, that I can't seem to solve (but I don't think it's caused by the class that you made, but the classes it calls).
When I visit some pages I get a:
getFromUrl() called
Could not open connection. Error 61: Connection refused
Can I in any way get around this problem. When I visit the site through a webbrowser I can get access without any problems.
Thanks in advance
Dennis
When I visit some pages I get a:
getFromUrl() called
Could not open connection. Error 61: Connection refused
Can I in any way get around this problem. When I visit the site through a webbrowser I can get access without any problems.
Thanks in advance
Dennis
•
•
Join Date: Jan 2008
Posts: 2
Reputation:
Solved Threads: 0
everything is working fine except im not able to save the content to cache or any file location,how can i do it
there is a waring lik this
"Warning: Missing argument 1 for http::getFromUrl(), called in C:\wamp\www\fetch\class\class_http.php on line 78 and defined in C:\wamp\www\fetch\class\class_http.php on line 127"
there is a waring lik this
"Warning: Missing argument 1 for http::getFromUrl(), called in C:\wamp\www\fetch\class\class_http.php on line 78 and defined in C:\wamp\www\fetch\class\class_http.php on line 127"
![]() |
Similar Threads
- Net.Downloaddata problems (VB.NET)
- Compare 2 Lists of Words (MySQL)
- PHP Screen Scraping (PHP)
- IE Address Bar hijacked by http://s5.th.msie.cc/ index.php (CWShredder.exe) (Viruses, Spyware and other Nasties)
Other Threads in the PHP Forum
- Previous Thread: Retriev color code from image.
- Next Thread: class.phpmailer and yahoo.com
| Thread Tools | Search this Thread |
apache api array basic body broken cache cakephp class cms code computing confirm cron curl customizableitems database date date/time delete dynamic email error file filter folder form forum freelancing function functions gc_maxlifetime google header headmethod howtowriteathesis href htaccess html iframe image include ip javascript joomla limit link list login malfunction memmory memory menu method msqli_multi_query multiple mycodeisbad mysql navigation neutrality oop parameter parsing paypal pdf php phpmysql query question random recourse regex root script search select seo server sessions snippet soap source space sql static support! system table thesishelp trouble tutorial update upload url variable video web webdesign xml youtube





