| | |
PHP HTTP Screen-Scraping Class with Caching
Please support our PHP advertiser: PostgreSQL or MySQL? Compare and contrast the two most popular open source databases
![]() |
•
•
Join Date: Jan 2008
Posts: 1
Reputation:
Solved Threads: 0
when i used your examples, it all worked well, but when i tried using a different url, i got the error:
Warning: Missing argument 1 for http::getFromUrl(), called in /home/web/class_http.php on line 90 and defined in /home/web/class_http.php on line 139
also, before, when i was using your example, it didn't appear to be caching. when i looked in the current directory or when i specified a different directory, i did not see any cached data.
thanks
Warning: Missing argument 1 for http::getFromUrl(), called in /home/web/class_http.php on line 90 and defined in /home/web/class_http.php on line 139
also, before, when i was using your example, it didn't appear to be caching. when i looked in the current directory or when i specified a different directory, i did not see any cached data.
thanks
•
•
Join Date: Jan 2008
Posts: 2
Reputation:
Solved Threads: 0
by using
echo "<pre>";
print_r($msft_stats);
echo "</pre>";
or
/*
Static method table_into_xml()
to parse the elements from allgame.com
*/
function table_into_xml($rawHTML,$needle="",$needle_within=0,$allowedTags="") {
if (!$aryTable = http::table_into_array($rawHTML,$needle,$needle_within,$allowedTags)) { return false; }
$xml = "<?xml version=\"1.0\" standalone=\"yes\" \?\>\n";
$xml .= "<TABLE>\n";
$rowIdx = 0;
foreach ($aryTable as $row) {
$xml .= "\t<ROW id=\"".$rowIdx."\">\n";
$colIdx = 0;
foreach ($row as $col) {
$xml .= "\t\t<COL id=\"".$colIdx."\">".trim(utf8_encode(htmlspecialchars($col)))."</COL>\n";
$colIdx++;
}
$xml .= "\t</ROW>\n";
$rowIdx++;
}
$xml .= "</TABLE>";
return $xml;
}
}
In which location is the XML file created or were can we see the array
thank you
echo "<pre>";
print_r($msft_stats);
echo "</pre>";
or
/*
Static method table_into_xml()
to parse the elements from allgame.com
*/
function table_into_xml($rawHTML,$needle="",$needle_within=0,$allowedTags="") {
if (!$aryTable = http::table_into_array($rawHTML,$needle,$needle_within,$allowedTags)) { return false; }
$xml = "<?xml version=\"1.0\" standalone=\"yes\" \?\>\n";
$xml .= "<TABLE>\n";
$rowIdx = 0;
foreach ($aryTable as $row) {
$xml .= "\t<ROW id=\"".$rowIdx."\">\n";
$colIdx = 0;
foreach ($row as $col) {
$xml .= "\t\t<COL id=\"".$colIdx."\">".trim(utf8_encode(htmlspecialchars($col)))."</COL>\n";
$colIdx++;
}
$xml .= "\t</ROW>\n";
$rowIdx++;
}
$xml .= "</TABLE>";
return $xml;
}
}
In which location is the XML file created or were can we see the array
thank you
Hey Troy, if you check up on this, I was wondering, I come across a 302 status on one of the pages I try to scrape. I use the URLs the way they show up in the browser. Of course in the header, there is also a LoginRedir={stuffgoeshere}. I haven't been able to find much on LoginRedir on google. I was wondering though, how does your class handle 302 status? What I probably mean is does it follow through the redirect? I assume it doesn't because the content of body is not what I expect, something along the lines of object moved. Maybe I got something set improperly? The LoginRedir=XXXX is in the Set-Cookie: so I think it may also be with the way I have the script handle the cookie after receiving this. It could be the data stored in the cookie, or it could be the class not following through on the 302. I will try to play with this some more. Let me know what you think.
The purpose of my existence is why I am here.
Just to let you know, I successfully did it with a cURL implementation, but I would still like to figure out how your class handles 302 status. cURL allows boolean setting to follow through on redirects. Well, I will play with it some more. Would love to have a non- cURL implementation.
The purpose of my existence is why I am here.
•
•
Join Date: Jan 2009
Posts: 2
Reputation:
Solved Threads: 0
I've just loaded it up on a *nix test site and it works pretty well.
It's literally just a class to grab the content, everything else is down to the Coder to parse using reg-ex, etc... (Oh the joy! *sob*).
I noticed a couple of initial impressions you get though.
1. The example script has two sites on it that you need to disable (comment out) before you run it (see the end of the code: lines 130 on->)
2. The whole table_into_array() thing uses an old title. Change this:
to this..
.. and you're laughing.
Anyhoo... this is useful for a project I've started looking at so here's hoping it stays the course.
It's literally just a class to grab the content, everything else is down to the Coder to parse using reg-ex, etc... (Oh the joy! *sob*).
I noticed a couple of initial impressions you get though.
1. The example script has two sites on it that you need to disable (comment out) before you run it (see the end of the code: lines 130 on->)
2. The whole table_into_array() thing uses an old title. Change this:
PHP Syntax (Toggle Plain Text)
$msft_stats = http::table_into_array($h->body, "Avg Daily Volume", 1, null);
PHP Syntax (Toggle Plain Text)
$msft_stats = http::table_into_array($h->body, "Avg. Daily Vol.", 1, null);
Anyhoo... this is useful for a project I've started looking at so here's hoping it stays the course.
Last edited by WebSnail; Jan 8th, 2009 at 1:12 pm.
•
•
Join Date: Jan 2009
Posts: 2
Reputation:
Solved Threads: 0
Realised there was a syntax error in the image_cache.php as well..
Find:
Replace with:
(or just delete the extra semi colon)
Find:
PHP Syntax (Toggle Plain Text)
$h->fetch($_GET['url'], $_GET['ttl'];);
Replace with:
PHP Syntax (Toggle Plain Text)
$h->fetch($_GET['url'], $_GET['ttl']);
![]() |
Similar Threads
- Net.Downloaddata problems (VB.NET)
- Compare 2 Lists of Words (MySQL)
- PHP Screen Scraping (PHP)
- IE Address Bar hijacked by http://s5.th.msie.cc/ index.php (CWShredder.exe) (Viruses, Spyware and other Nasties)
Other Threads in the PHP Forum
- Previous Thread: Retriev color code from image.
- Next Thread: class.phpmailer and yahoo.com
| Thread Tools | Search this Thread |
advanced ajax apache api array beginner binary broken cakephp checkbox class cms code cookies cron curl database date display dropdownlist dynamic echo email eregi error execution file files folder form forms function functions google href htaccess html if...loop image include includingmysecondfileinthechain insert integration ip java javascript joomla jquery key library limit link login mail menu mlm multiple mysql oop paypal pdf pdfdownload php phpvotingscript problem query radio random recursion regex remote screen script search server sessions smarty sms soap sorting source space sql startup stored syntax system table traffic tutorial update upload url validator variable video web xml youtube zend





