•
•
•
•
What is DaniWeb IT Discussion Community?
You're currently browsing the PHP section within the Web Development category of DaniWeb, a massive community of 426,442 software developers, web developers, Internet marketers, and tech gurus who are all enthusiastic about making contacts, networking, and learning from each other. In fact, there are 2,213 IT professionals currently interacting right now! Registration is free, only takes a minute and lets you enjoy all of the interactive features of the site.
Please support our PHP advertiser: Lunarpages PHP Web Hosting
Views: 10568 | Replies: 13
![]() |
•
•
Join Date: Jun 2005
Location: Kansas City, Missouri, USA
Posts: 344
Reputation:
Rep Power: 4
Solved Threads: 4
http://www.tgreer.com/class_http_php.html
I've written what I think is a high-quality PHP class for screen-scraping external (or internal) web content. The class includes features to cache scraped content for any number of seconds. So for example, if you want to show stock market data on your site that you scrape from a third-party, you can easily set it so your site hits the source site for fresh content no more than once every 5 minutes. This caching is seamless to your script--you don't have to worry about it.
The class includes a companion script named image_cache.php that can be used as the src attribute in img elements to cache images from external sites. This is useful if you want to incorporate an image from an external site that is dynamically updated on a regular basis. For example--stock charts that are generated every 60 seconds on many websites. This allows you to use their image on your site, but with the caching feature, you don't have to hit their site everytime somebody hits yours.
The class also has 2 static methods that make it simple to extract data from HTML tables. One extracts a table into an array and the other into XML.
The class can perform basic authentication allowing you to scrape protected content. It also cloaks itself as the User Agent of the user requesting your script. This allows you to access content that may normally be blocked to non-browser agents.
The article explains in detail how to use the class, and is itself, a good tutorial for many techniques in PHP.
If you have any comments or questions about this class or the article, let's discuss it here. I'm always wanting to learn more, so lets discuss. I hope you find the class and the article useful.
I've written what I think is a high-quality PHP class for screen-scraping external (or internal) web content. The class includes features to cache scraped content for any number of seconds. So for example, if you want to show stock market data on your site that you scrape from a third-party, you can easily set it so your site hits the source site for fresh content no more than once every 5 minutes. This caching is seamless to your script--you don't have to worry about it.
The class includes a companion script named image_cache.php that can be used as the src attribute in img elements to cache images from external sites. This is useful if you want to incorporate an image from an external site that is dynamically updated on a regular basis. For example--stock charts that are generated every 60 seconds on many websites. This allows you to use their image on your site, but with the caching feature, you don't have to hit their site everytime somebody hits yours.
The class also has 2 static methods that make it simple to extract data from HTML tables. One extracts a table into an array and the other into XML.
The class can perform basic authentication allowing you to scrape protected content. It also cloaks itself as the User Agent of the user requesting your script. This allows you to access content that may normally be blocked to non-browser agents.
The article explains in detail how to use the class, and is itself, a good tutorial for many techniques in PHP.
If you have any comments or questions about this class or the article, let's discuss it here. I'm always wanting to learn more, so lets discuss. I hope you find the class and the article useful.
•
•
Join Date: Jun 2005
Location: Kansas City, Missouri, USA
Posts: 344
Reputation:
Rep Power: 4
Solved Threads: 4
The class has an option to save the content you scrape onto your local webserver's harddrive. This is called "caching". You tell the class where to store the cache files by one of 2 methods:
1. Modify the class file directly to change the dir property within the class constructor.
[PHP]
/*
Set the 'dir' property to the directory where you want to store the cached
content. End this value with a "/".
*/
$this->dir = "/var/www/cache/";
[/PHP]
2. You can also modify the dir property per each instance of the class like so:
[PHP]
$h = new http();
$h->dir = "c:\inetpub\cache\";
[/PHP]
As far as what you do with the content once you've scraped it, well, that's up to you. It's just a big string of page source---so do whatever parsing and extracting you want. The class does include a method to make it relatively simple to extract data out of an HTML table structure. It's all in the documentation at http://www.troywolf.com/articles.
1. Modify the class file directly to change the dir property within the class constructor.
[PHP]
/*
Set the 'dir' property to the directory where you want to store the cached
content. End this value with a "/".
*/
$this->dir = "/var/www/cache/";
[/PHP]
2. You can also modify the dir property per each instance of the class like so:
[PHP]
$h = new http();
$h->dir = "c:\inetpub\cache\";
[/PHP]
As far as what you do with the content once you've scraped it, well, that's up to you. It's just a big string of page source---so do whatever parsing and extracting you want. The class does include a method to make it relatively simple to extract data out of an HTML table structure. It's all in the documentation at http://www.troywolf.com/articles.
Thanks sir
your reply is really encourage me. but unfortunatly it can't give me the scrapped file in my folder which is "img"
img is the folder which i have created in the location where i have placed all the pages.
my locathon is "mock/scraping/img" .
please tell me how i can modify the above path.
Sorry i disterb u .
I am new in the php field and i have not enough knowledge about php so please help me.
Thanks for encouraging me
your friend
ravi
your reply is really encourage me. but unfortunatly it can't give me the scrapped file in my folder which is "img"
img is the folder which i have created in the location where i have placed all the pages.
my locathon is "mock/scraping/img" .
please tell me how i can modify the above path.
Sorry i disterb u .
I am new in the php field and i have not enough knowledge about php so please help me.
Thanks for encouraging me
your friend
ravi
•
•
Join Date: Jun 2005
Location: Kansas City, Missouri, USA
Posts: 344
Reputation:
Rep Power: 4
Solved Threads: 4
Does everything work fine if you do not use caching? (Set ttl = 0.) If so, your only problem is almost certainly with the dir property.
This will tell http_class to cache to a directory that is relative to the current path.[PHP]$h->dir = "mock/scraping/img/";[/PHP] For this to work, the directory img must have write privs for the web service user. I'm going to guess you are using Linux based on forward slashed in your path example. So please check out the chmod command. If you do not have access to the shell to run the chmod command, you may be able to set write privs for the folder using your FTP client. You can always contact your hosting company (if applicable) to have them make your img folder globally writeable.
This will tell http_class to cache to a directory that is relative to the current path.[PHP]$h->dir = "mock/scraping/img/";[/PHP] For this to work, the directory img must have write privs for the web service user. I'm going to guess you are using Linux based on forward slashed in your path example. So please check out the chmod command. If you do not have access to the shell to run the chmod command, you may be able to set write privs for the folder using your FTP client. You can always contact your hosting company (if applicable) to have them make your img folder globally writeable.
•
•
Join Date: Apr 2004
Location: Brownsville or Austin, TX or Faber, VA
Posts: 59
Reputation:
Rep Power: 5
Solved Threads: 2
Hello, I am actually quite interested in this code, but I do not have that much experience with php. I was wondering if I wanted to access an https page that recieves post vars from a form, if I could use the $_POST['']= to setup the vars. I was wondering if you could put an example for such a scenario. What I am actually having to work with, is an https site that accepts login and password via post form and once the session is established I want to pass post vars to a page that uses them to create a table with the info I want.
I can extract the info from an array, but I am a little flaky on the understanding of fetching the pages. I am unclear as to how to do that with the script. I read about sendToHost() and stuff about fsockopen() but I don't know exactly how to implement that with this code. Any insight would be helpful.
I can extract the info from an array, but I am a little flaky on the understanding of fetching the pages. I am unclear as to how to do that with the script. I read about sendToHost() and stuff about fsockopen() but I don't know exactly how to implement that with this code. Any insight would be helpful.
The purpose of my existence is why I am here.
•
•
Join Date: Jun 2005
Location: Kansas City, Missouri, USA
Posts: 344
Reputation:
Rep Power: 4
Solved Threads: 4
•
•
•
•
I was wondering if you could put an example for such a scenario.
The code is designed for PHP programmers to integrate into their apps. So if you are not very familiar with PHP programming, it will probably be difficult to understand my class. However, over the years, a lot of newbies have been successful with it, so I guess take heart!

Yes, the class can be used to access HTTPS and password protected content. However, there are some hoops you have to jump through. I don't have time to teach this step by step, but I can offer a complete code example.
Go to my articles page at:
http://www.troywolf.com/articles/
Check out my class_sbdns.php (Server Beach DNS Tool API). It uses class_http to automate logging into an HTTPS site. I will give you this hint to help you know what to look for. Basically, you'll use the postvars to send your username and password. The remote server will respond with the next page and the HTTP headers will contain a session id cookie. Every subsequent hit you make to the server, you must pass that cookie back to the server. You do this in the headers. Again, class_sbdns does all this, so look in there for examples. It will be tedious for you I know, but not as tedious as me taking the time to walk you through step by step. (tedious for me that is
)Enjoy!
•
•
Join Date: Apr 2004
Location: Brownsville or Austin, TX or Faber, VA
Posts: 59
Reputation:
Rep Power: 5
Solved Threads: 2
•
•
Join Date: Oct 2007
Posts: 1
Reputation:
Rep Power: 0
Solved Threads: 0
It's a really nice class, but I have one problem with it, that I can't seem to solve (but I don't think it's caused by the class that you made, but the classes it calls).
When I visit some pages I get a:
getFromUrl() called
Could not open connection. Error 61: Connection refused
Can I in any way get around this problem. When I visit the site through a webbrowser I can get access without any problems.
Thanks in advance
Dennis
When I visit some pages I get a:
getFromUrl() called
Could not open connection. Error 61: Connection refused
Can I in any way get around this problem. When I visit the site through a webbrowser I can get access without any problems.
Thanks in advance
Dennis
•
•
Join Date: Jan 2008
Posts: 2
Reputation:
Rep Power: 0
Solved Threads: 0
everything is working fine except im not able to save the content to cache or any file location,how can i do it
there is a waring lik this
"Warning: Missing argument 1 for http::getFromUrl(), called in C:\wamp\www\fetch\class\class_http.php on line 78 and defined in C:\wamp\www\fetch\class\class_http.php on line 127"
there is a waring lik this
"Warning: Missing argument 1 for http::getFromUrl(), called in C:\wamp\www\fetch\class\class_http.php on line 78 and defined in C:\wamp\www\fetch\class\class_http.php on line 127"
![]() |
•
•
•
•
•
•
•
•
DaniWeb PHP Marketplace
•
•
•
•
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
- Net.Downloaddata problems (VB.NET)
- Compare 2 Lists of Words (MySQL)
- PHP Screen Scraping (PHP)
- IE Address Bar hijacked by http://s5.th.msie.cc/ index.php (CWShredder.exe) (Viruses, Spyware and other Nasties)
Other Threads in the PHP Forum
- Previous Thread: Help. Parse error in PHP code
- Next Thread: date range php/mysql


Linear Mode