User Name Password Register
DaniWeb IT Discussion Community
All
What is DaniWeb IT Discussion Community?
You're currently browsing the PHP section within the Web Development category of DaniWeb, a massive community of 426,442 software developers, web developers, Internet marketers, and tech gurus who are all enthusiastic about making contacts, networking, and learning from each other. In fact, there are 2,213 IT professionals currently interacting right now! Registration is free, only takes a minute and lets you enjoy all of the interactive features of the site.
Please support our PHP advertiser: Lunarpages PHP Web Hosting
Views: 10568 | Replies: 13
Reply
Join Date: Jun 2005
Location: Kansas City, Missouri, USA
Posts: 344
Reputation: Troy is an unknown quantity at this point 
Rep Power: 4
Solved Threads: 4
Troy's Avatar
Troy Troy is offline Offline
Posting Whiz

PHP HTTP Screen-Scraping Class with Caching

  #1  
Jun 21st, 2005
http://www.tgreer.com/class_http_php.html

I've written what I think is a high-quality PHP class for screen-scraping external (or internal) web content. The class includes features to cache scraped content for any number of seconds. So for example, if you want to show stock market data on your site that you scrape from a third-party, you can easily set it so your site hits the source site for fresh content no more than once every 5 minutes. This caching is seamless to your script--you don't have to worry about it.

The class includes a companion script named image_cache.php that can be used as the src attribute in img elements to cache images from external sites. This is useful if you want to incorporate an image from an external site that is dynamically updated on a regular basis. For example--stock charts that are generated every 60 seconds on many websites. This allows you to use their image on your site, but with the caching feature, you don't have to hit their site everytime somebody hits yours.

The class also has 2 static methods that make it simple to extract data from HTML tables. One extracts a table into an array and the other into XML.

The class can perform basic authentication allowing you to scrape protected content. It also cloaks itself as the User Agent of the user requesting your script. This allows you to access content that may normally be blocked to non-browser agents.

The article explains in detail how to use the class, and is itself, a good tutorial for many techniques in PHP.

If you have any comments or questions about this class or the article, let's discuss it here. I'm always wanting to learn more, so lets discuss. I hope you find the class and the article useful.
Troy Wolf is the author of SnippetEdit. "Website editing as easy as it gets." IX Web Hosting
AddThis Social Bookmark Button
Reply With Quote  
Join Date: Jul 2005
Posts: 10
Reputation: aasahil is an unknown quantity at this point 
Rep Power: 4
Solved Threads: 0
aasahil's Avatar
aasahil aasahil is offline Offline
Newbie Poster

Re: PHP HTTP Screen-Scraping Class with Caching

  #2  
Jul 23rd, 2005
Hi I am ravi
I downloaded your code of screen-scraping.
But i facing facing the problem in storing the scrapped file in the folder please guide me.
and also tell me how can i extract contents from the scrapped file.

thanks
ravi
Reply With Quote  
Join Date: Jun 2005
Location: Kansas City, Missouri, USA
Posts: 344
Reputation: Troy is an unknown quantity at this point 
Rep Power: 4
Solved Threads: 4
Troy's Avatar
Troy Troy is offline Offline
Posting Whiz

Re: PHP HTTP Screen-Scraping Class with Caching

  #3  
Jul 23rd, 2005
The class has an option to save the content you scrape onto your local webserver's harddrive. This is called "caching". You tell the class where to store the cache files by one of 2 methods:

1. Modify the class file directly to change the dir property within the class constructor.
[PHP]
/*
Set the 'dir' property to the directory where you want to store the cached
content. End this value with a "/".
*/
$this->dir = "/var/www/cache/";
[/PHP]

2. You can also modify the dir property per each instance of the class like so:
[PHP]
$h = new http();
$h->dir = "c:\inetpub\cache\";
[/PHP]

As far as what you do with the content once you've scraped it, well, that's up to you. It's just a big string of page source---so do whatever parsing and extracting you want. The class does include a method to make it relatively simple to extract data out of an HTML table structure. It's all in the documentation at http://www.troywolf.com/articles.
Troy Wolf is the author of SnippetEdit. "Website editing as easy as it gets." IX Web Hosting
Reply With Quote  
Join Date: Jul 2005
Posts: 10
Reputation: aasahil is an unknown quantity at this point 
Rep Power: 4
Solved Threads: 0
aasahil's Avatar
aasahil aasahil is offline Offline
Newbie Poster

Re: PHP HTTP Screen-Scraping Class with Caching

  #4  
Jul 23rd, 2005
Thanks sir
your reply is really encourage me. but unfortunatly it can't give me the scrapped file in my folder which is "img"
img is the folder which i have created in the location where i have placed all the pages.
my locathon is "mock/scraping/img" .
please tell me how i can modify the above path.
Sorry i disterb u .
I am new in the php field and i have not enough knowledge about php so please help me.

Thanks for encouraging me
your friend
ravi
Reply With Quote  
Join Date: Jun 2005
Location: Kansas City, Missouri, USA
Posts: 344
Reputation: Troy is an unknown quantity at this point 
Rep Power: 4
Solved Threads: 4
Troy's Avatar
Troy Troy is offline Offline
Posting Whiz

Re: PHP HTTP Screen-Scraping Class with Caching

  #5  
Jul 23rd, 2005
Does everything work fine if you do not use caching? (Set ttl = 0.) If so, your only problem is almost certainly with the dir property.

This will tell http_class to cache to a directory that is relative to the current path.[PHP]$h->dir = "mock/scraping/img/";[/PHP] For this to work, the directory img must have write privs for the web service user. I'm going to guess you are using Linux based on forward slashed in your path example. So please check out the chmod command. If you do not have access to the shell to run the chmod command, you may be able to set write privs for the folder using your FTP client. You can always contact your hosting company (if applicable) to have them make your img folder globally writeable.
Troy Wolf is the author of SnippetEdit. "Website editing as easy as it gets." IX Web Hosting
Reply With Quote  
Join Date: Apr 2004
Location: Brownsville or Austin, TX or Faber, VA
Posts: 59
Reputation: world_weapon is an unknown quantity at this point 
Rep Power: 5
Solved Threads: 2
world_weapon's Avatar
world_weapon world_weapon is offline Offline
Junior Poster in Training

Re: PHP HTTP Screen-Scraping Class with Caching

  #6  
Oct 2nd, 2007
Hello, I am actually quite interested in this code, but I do not have that much experience with php. I was wondering if I wanted to access an https page that recieves post vars from a form, if I could use the $_POST['']= to setup the vars. I was wondering if you could put an example for such a scenario. What I am actually having to work with, is an https site that accepts login and password via post form and once the session is established I want to pass post vars to a page that uses them to create a table with the info I want.
I can extract the info from an array, but I am a little flaky on the understanding of fetching the pages. I am unclear as to how to do that with the script. I read about sendToHost() and stuff about fsockopen() but I don't know exactly how to implement that with this code. Any insight would be helpful.
The purpose of my existence is why I am here.
Reply With Quote  
Join Date: Jun 2005
Location: Kansas City, Missouri, USA
Posts: 344
Reputation: Troy is an unknown quantity at this point 
Rep Power: 4
Solved Threads: 4
Troy's Avatar
Troy Troy is offline Offline
Posting Whiz

Solution Re: PHP HTTP Screen-Scraping Class with Caching

  #7  
Oct 2nd, 2007
Originally Posted by world_weapon View Post
I was wondering if you could put an example for such a scenario.

The code is designed for PHP programmers to integrate into their apps. So if you are not very familiar with PHP programming, it will probably be difficult to understand my class. However, over the years, a lot of newbies have been successful with it, so I guess take heart!

Yes, the class can be used to access HTTPS and password protected content. However, there are some hoops you have to jump through. I don't have time to teach this step by step, but I can offer a complete code example.

Go to my articles page at:
http://www.troywolf.com/articles/

Check out my class_sbdns.php (Server Beach DNS Tool API). It uses class_http to automate logging into an HTTPS site. I will give you this hint to help you know what to look for. Basically, you'll use the postvars to send your username and password. The remote server will respond with the next page and the HTTP headers will contain a session id cookie. Every subsequent hit you make to the server, you must pass that cookie back to the server. You do this in the headers. Again, class_sbdns does all this, so look in there for examples. It will be tedious for you I know, but not as tedious as me taking the time to walk you through step by step. (tedious for me that is )

Enjoy!
Troy Wolf is the author of SnippetEdit. "Website editing as easy as it gets." IX Web Hosting
Reply With Quote  
Join Date: Apr 2004
Location: Brownsville or Austin, TX or Faber, VA
Posts: 59
Reputation: world_weapon is an unknown quantity at this point 
Rep Power: 5
Solved Threads: 2
world_weapon's Avatar
world_weapon world_weapon is offline Offline
Junior Poster in Training

Re: PHP HTTP Screen-Scraping Class with Caching

  #8  
Oct 2nd, 2007
Thanx alot, I will check it out. This is something that I absolutely must learn to do. Glad that you have this site out there helping folks like me get around to doing crafty stuff like this.
The purpose of my existence is why I am here.
Reply With Quote  
Join Date: Oct 2007
Posts: 1
Reputation: dcasso is an unknown quantity at this point 
Rep Power: 0
Solved Threads: 0
dcasso dcasso is offline Offline
Newbie Poster

Re: PHP HTTP Screen-Scraping Class with Caching

  #9  
Oct 28th, 2007
It's a really nice class, but I have one problem with it, that I can't seem to solve (but I don't think it's caused by the class that you made, but the classes it calls).

When I visit some pages I get a:
getFromUrl() called
Could not open connection. Error 61: Connection refused

Can I in any way get around this problem. When I visit the site through a webbrowser I can get access without any problems.

Thanks in advance
Dennis
Reply With Quote  
Join Date: Jan 2008
Posts: 2
Reputation: savoo4u is an unknown quantity at this point 
Rep Power: 0
Solved Threads: 0
savoo4u savoo4u is offline Offline
Newbie Poster

Re: PHP HTTP Screen-Scraping Class with Caching

  #10  
Jan 4th, 2008
everything is working fine except im not able to save the content to cache or any file location,how can i do it
there is a waring lik this
"Warning: Missing argument 1 for http::getFromUrl(), called in C:\wamp\www\fetch\class\class_http.php on line 78 and defined in C:\wamp\www\fetch\class\class_http.php on line 127"
Reply With Quote  
Reply

Only community members can participate in forum threads. You must register or log in to contribute.

DaniWeb PHP Marketplace
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)

 

Thread Tools Display Modes

Similar Threads
Other Threads in the PHP Forum

All times are GMT -4. The time now is 2:50 pm.
Forum system based on vBulletin Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
©2003 - 2008 DaniWeb® LLC