943,923 Members | Top Members by Rank

Ad:
  • PHP Discussion Thread
  • Unsolved
  • Views: 19736
  • PHP RSS
You are currently viewing page 1 of this multi-page discussion thread
Jun 21st, 2005
0

PHP HTTP Screen-Scraping Class with Caching

Expand Post »
http://www.tgreer.com/class_http_php.html

I've written what I think is a high-quality PHP class for screen-scraping external (or internal) web content. The class includes features to cache scraped content for any number of seconds. So for example, if you want to show stock market data on your site that you scrape from a third-party, you can easily set it so your site hits the source site for fresh content no more than once every 5 minutes. This caching is seamless to your script--you don't have to worry about it.

The class includes a companion script named image_cache.php that can be used as the src attribute in img elements to cache images from external sites. This is useful if you want to incorporate an image from an external site that is dynamically updated on a regular basis. For example--stock charts that are generated every 60 seconds on many websites. This allows you to use their image on your site, but with the caching feature, you don't have to hit their site everytime somebody hits yours.

The class also has 2 static methods that make it simple to extract data from HTML tables. One extracts a table into an array and the other into XML.

The class can perform basic authentication allowing you to scrape protected content. It also cloaks itself as the User Agent of the user requesting your script. This allows you to access content that may normally be blocked to non-browser agents.

The article explains in detail how to use the class, and is itself, a good tutorial for many techniques in PHP.

If you have any comments or questions about this class or the article, let's discuss it here. I'm always wanting to learn more, so lets discuss. I hope you find the class and the article useful.
Similar Threads
Reputation Points: 36
Solved Threads: 6
Posting Whiz
Troy is offline Offline
354 posts
since Jun 2005
Jul 23rd, 2005
0

Re: PHP HTTP Screen-Scraping Class with Caching

Hi I am ravi
I downloaded your code of screen-scraping.
But i facing facing the problem in storing the scrapped file in the folder please guide me.
and also tell me how can i extract contents from the scrapped file.

thanks
ravi
Reputation Points: 10
Solved Threads: 0
Newbie Poster
aasahil is offline Offline
10 posts
since Jul 2005
Jul 23rd, 2005
0

Re: PHP HTTP Screen-Scraping Class with Caching

The class has an option to save the content you scrape onto your local webserver's harddrive. This is called "caching". You tell the class where to store the cache files by one of 2 methods:

1. Modify the class file directly to change the dir property within the class constructor.
[PHP]
/*
Set the 'dir' property to the directory where you want to store the cached
content. End this value with a "/".
*/
$this->dir = "/var/www/cache/";
[/PHP]

2. You can also modify the dir property per each instance of the class like so:
[PHP]
$h = new http();
$h->dir = "c:\inetpub\cache\";
[/PHP]

As far as what you do with the content once you've scraped it, well, that's up to you. It's just a big string of page source---so do whatever parsing and extracting you want. The class does include a method to make it relatively simple to extract data out of an HTML table structure. It's all in the documentation at http://www.troywolf.com/articles.
Reputation Points: 36
Solved Threads: 6
Posting Whiz
Troy is offline Offline
354 posts
since Jun 2005
Jul 23rd, 2005
0

Re: PHP HTTP Screen-Scraping Class with Caching

Thanks sir
your reply is really encourage me. but unfortunatly it can't give me the scrapped file in my folder which is "img"
img is the folder which i have created in the location where i have placed all the pages.
my locathon is "mock/scraping/img" .
please tell me how i can modify the above path.
Sorry i disterb u .
I am new in the php field and i have not enough knowledge about php so please help me.

Thanks for encouraging me
your friend
ravi
Reputation Points: 10
Solved Threads: 0
Newbie Poster
aasahil is offline Offline
10 posts
since Jul 2005
Jul 23rd, 2005
0

Re: PHP HTTP Screen-Scraping Class with Caching

Does everything work fine if you do not use caching? (Set ttl = 0.) If so, your only problem is almost certainly with the dir property.

This will tell http_class to cache to a directory that is relative to the current path.[PHP]$h->dir = "mock/scraping/img/";[/PHP] For this to work, the directory img must have write privs for the web service user. I'm going to guess you are using Linux based on forward slashed in your path example. So please check out the chmod command. If you do not have access to the shell to run the chmod command, you may be able to set write privs for the folder using your FTP client. You can always contact your hosting company (if applicable) to have them make your img folder globally writeable.
Reputation Points: 36
Solved Threads: 6
Posting Whiz
Troy is offline Offline
354 posts
since Jun 2005
Oct 2nd, 2007
0

Re: PHP HTTP Screen-Scraping Class with Caching

Hello, I am actually quite interested in this code, but I do not have that much experience with php. I was wondering if I wanted to access an https page that recieves post vars from a form, if I could use the $_POST['']= to setup the vars. I was wondering if you could put an example for such a scenario. What I am actually having to work with, is an https site that accepts login and password via post form and once the session is established I want to pass post vars to a page that uses them to create a table with the info I want.
I can extract the info from an array, but I am a little flaky on the understanding of fetching the pages. I am unclear as to how to do that with the script. I read about sendToHost() and stuff about fsockopen() but I don't know exactly how to implement that with this code. Any insight would be helpful.
Reputation Points: 21
Solved Threads: 2
Junior Poster in Training
world_weapon is offline Offline
63 posts
since Apr 2004
Oct 2nd, 2007
0

Re: PHP HTTP Screen-Scraping Class with Caching

I was wondering if you could put an example for such a scenario.
The code is designed for PHP programmers to integrate into their apps. So if you are not very familiar with PHP programming, it will probably be difficult to understand my class. However, over the years, a lot of newbies have been successful with it, so I guess take heart!

Yes, the class can be used to access HTTPS and password protected content. However, there are some hoops you have to jump through. I don't have time to teach this step by step, but I can offer a complete code example.

Go to my articles page at:
http://www.troywolf.com/articles/

Check out my class_sbdns.php (Server Beach DNS Tool API). It uses class_http to automate logging into an HTTPS site. I will give you this hint to help you know what to look for. Basically, you'll use the postvars to send your username and password. The remote server will respond with the next page and the HTTP headers will contain a session id cookie. Every subsequent hit you make to the server, you must pass that cookie back to the server. You do this in the headers. Again, class_sbdns does all this, so look in there for examples. It will be tedious for you I know, but not as tedious as me taking the time to walk you through step by step. (tedious for me that is )

Enjoy!
Reputation Points: 36
Solved Threads: 6
Posting Whiz
Troy is offline Offline
354 posts
since Jun 2005
Oct 2nd, 2007
0

Re: PHP HTTP Screen-Scraping Class with Caching

Thanx alot, I will check it out. This is something that I absolutely must learn to do. Glad that you have this site out there helping folks like me get around to doing crafty stuff like this.
Reputation Points: 21
Solved Threads: 2
Junior Poster in Training
world_weapon is offline Offline
63 posts
since Apr 2004
Oct 28th, 2007
0

Re: PHP HTTP Screen-Scraping Class with Caching

It's a really nice class, but I have one problem with it, that I can't seem to solve (but I don't think it's caused by the class that you made, but the classes it calls).

When I visit some pages I get a:
getFromUrl() called
Could not open connection. Error 61: Connection refused

Can I in any way get around this problem. When I visit the site through a webbrowser I can get access without any problems.

Thanks in advance
Dennis
Reputation Points: 10
Solved Threads: 0
Newbie Poster
dcasso is offline Offline
1 posts
since Oct 2007
Jan 4th, 2008
0

Re: PHP HTTP Screen-Scraping Class with Caching

everything is working fine except im not able to save the content to cache or any file location,how can i do it
there is a waring lik this
"Warning: Missing argument 1 for http::getFromUrl(), called in C:\wamp\www\fetch\class\class_http.php on line 78 and defined in C:\wamp\www\fetch\class\class_http.php on line 127"
Reputation Points: 10
Solved Threads: 0
Newbie Poster
savoo4u is offline Offline
2 posts
since Jan 2008

This thread is more than three months old

No one has posted to this discussion for at least three months. Please let old threads die and do not reply to them unless you feel you have something new and valuable to contribute that absolutely must be added to make the discussion complete. Otherwise, please start a new thread in this forum instead.
Message:
Previous Thread in PHP Forum Timeline: Retriev color code from image.
Next Thread in PHP Forum Timeline: class.phpmailer and yahoo.com





About Us | Contact Us | Advertise | Acceptable Use Policy
Forum Index | Build Custom RSS Feed


Follow us on Twitter


© 2011 DaniWeb® LLC