Scrape page content?

Question

brianjoe 0 Newbie Poster

14 Years Ago

Hi people!

Im having an issue, and it is really bothering me.

I want to get some content (only 2 lines) from another site and paste it into my own site. The problem is that the content on the other site keeps changing, and I need it to be auto updated on my site.

How is this possible, to scrape theese few lines from another site with PHP?

Thanks in advance.

php

4 Contributors
11 Replies
165 Views
1 Day Discussion Span
Latest Post 14 Years Ago Latest Post by smantscheff

All 11 Replies

mschroeder 251 Bestower of Knowledge

14 Years Ago

You have to first get the contents of that page. Some options for this would be cURL, file_get_contents, sockets, fopen/fread/fclose, etc.

Once you have the content you have a few options for extracting that particular piece of information. You can try to use a regular expression with the preg functions to match on the tags around it. You can try loading it into simpleXML and using xpath to select the node from the document, drawback to this method is simpleXML does not like invalid documents (most html).

Or, my personal favorite use the php DOM object, set it to load HTML so it only throws warnings about invalid markup, suppress or trap the warnings so they're not displayed and then use xpath to grab just the nodes that contain the content you want.

Once you have your scraped data cache it so you don't need to make the request on every single page load and you'll have a pretty effective solution.

mschroeder 251 Bestower of Knowledge

14 Years Ago

In this example I load the html from the daniweb homepage and than query on ALL div's with id='stats', which should only be one. See the link below for some good tutorials on crash course xpath.

<?php
$html = file_get_contents('http://www.daniweb.com');

$dom = new DOMDocument('1.0', 'iso-8859-1');

//Suppress any warnings from invalid html markup
@$dom->loadHTML( $html );
$xpath = new DOMXPath( $dom );
$query = '//div[@id="stats"]';

$nodes = $xpath->query( $query );
foreach( $nodes as $node ){
	echo $node->nodeValue;
}

XPath: http://www.w3schools.com/xpath/default.asp
DomDocument: http://www.php.net/manual/en/class.domdocument.php
DomXpath: http://www.php.net/manual/en/class.domxpath.php

mschroeder 251 Bestower of Knowledge

14 Years Ago

You can navigate to any node anywhere in an xml based (html) document with XPath. But, the path to get to that script node depends entirely on the rest of the document it is contained within.

For example if it was the first script tag in the head: /html/head/script[1] Once you have your selection narrowed down to that node you can easily parse the resulting content for those values specifically.

Edited 14 Years Ago by mschroeder because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

yoge911 1 Newbie Poster · Answer 1 · 2011-02-18T21:10:33+00:00

hi,
well, how frequently does the content change?

Just a thought ..... may be u can set the approximate time interval for the page refresh using meta tag or an ajax request to refresh the page every X seconds.

brianjoe 0 Newbie Poster · Answer 2 · 2011-02-18T21:23:24+00:00

I know that I can make a page refresh, but that doesnt get me anywhere if I dont have imported the content from the other site to my own. How do I import that?

It's not my own site I want the content from, it's from another site (maybe located in US? I dont know).. I heard that I could do a page scrape, but how?

brianjoe 0 Newbie Poster · Answer 3 · 2011-02-18T21:57:47+00:00

So it would be something like this? $html = file_get_html('http://www.google.com/'); But how do I use xpath (which I have never heard of) to get the specific data from that site?

brianjoe 0 Newbie Poster · Answer 4 · 2011-02-18T22:47:20+00:00

Ah okay, I want to read more about it thanks! BUT, I think there is a problem.. The content I want to grab is not inside a div, but a table.

This is the lines I want:

<script type="text/javascript">
              var flashvars = {
                'feedback_url':'http://site.com/test.php?args=558065,31,1298047277:11xM1GGRWCA/trailer.flv',
                'file': '1298047277:11xM1GGRWCA/trailer.flv',
                                              };
			</script>

feedback_url and file, as you can see above, is the ones I want.

brianjoe 0 Newbie Poster · Answer 5 · 2011-02-18T23:23:15+00:00

Thanks for your help, but I dont seem to get it. But I guess I now know how difficult it would be.

Fx. this is not working, not displaying anything:

<?php
$html = file_get_contents('http://www.daniweb.com');

$dom = new DOMDocument('1.0', 'iso-8859-1');

//Suppress any warnings from invalid html markup
@$dom->loadHTML( $html );
$xpath = new DOMXPath( $dom );
$query = '//html/head/script[1]';

$nodes = $xpath->query( $query );
foreach( $nodes as $node ){
	echo $node->nodeValue;
}
?>

mschroeder 251 Bestower of Knowledge Team Colleague · Answer 6 · 2011-02-19T00:12:49+00:00

You need to read about Xpath if that is the route you want to take, it really isn't difficult but your Xpath expression up there is wrong, and it is not the one I posted previously.

Walk through the simple tutorials I posted earlier: http://www.w3schools.com/xpath/default.asp

brianjoe 0 Newbie Poster · Answer 7 · 2011-02-19T02:11:41+00:00

brianjoe 0 Newbie Poster

14 Years Ago

Thanks man, really appreciate it !

smantscheff 265 Veteran Poster · Answer 8 · 2011-02-19T20:10:07+00:00

You could also use a regular expression to extract the desired content:

if (preg_match( '~feedback_url\':\s*\'([^\']+)\',\s\'file\':\s*([^\']+)'~, $sourcecode, $match )) {
  $feedback_url = $match[1];
  $file = $match[2];
}

Scrape page content?

Recommended Answers Collapse Answers

All 11 Replies

Recommended Answers