If I have an external website how can I pull data from it. I have the following code snippet that I need to pull data from:

<div class="headlinesBox">

							
																								<div class="headline currentHeadline">
									<div class="headlinesClipping">
									<img src="/common/images/thumbnails/source/1320614405d.jpg" style="float: left; width: 262px; height: 236px;"/>
									</div>
									<div class="headlinesText">
										<h3><a href="/details/news/1325223/Template-Assisted_Fabrication_for_Polymer_Solar_Cells.html" title="Template-Assisted Fabrication for Polymer Solar Cells">Template-Assisted Fabrication for Polymer Solar Cells</a></h3>

										<p>
											Bulk heterojunction films with nanostructured donor/acceptor interfaces have been fabricated for photovoltaic devices by means of anodic aluminum oxide (AAO) templates.
											<!--
																						-->
										</p>
									</div>
								</div>
							
																								<div class="headline additionalHeadline">
									<div class="headlinesClipping">
									<img src="/common/images/thumbnails/source/13215dec9a2.jpg" style="float: left; width: 156px; height: 236px;"/>

									</div>
									<div class="headlinesText">
										<h3><a href="/details/news/1328303/First_Understand_Absorber_Layers_Then_Improve_Solar_Cell_Efficiency.html" title="First Understand Absorber Layers, Then Improve Solar Cell Efficiency">First Understand Absorber Layers, Then Improve Solar Cell Efficiency</a></h3>
										<p>
											A thorough understanding of photovoltaic materials is crucial if thin-film solar cell efficiency is to be improved.
											<!--
																						-->
										</p>
									</div>
								</div>

i need to get the title, image location and page location for both articles in that code snippet and put them into an array. How can I do this?

Thanks,

James

Look up web scraping or site crawling. There are several ways to accomplish this so you have to figure out what type of data you are going to be scraping off the site(s) and then build something to accomplish your goal.

Member Avatar
diafol

If your host allows you to use file_get_contents() on external sites (some don't - check the phpinfo()), then use that to gain the output and use substr() or some of the preg functions to strip out the bits you need.

Perhaps xpath or curl could also do what you want.

If your host allows you to use file_get_contents() on external sites (some don't - check the phpinfo()), then use that to gain the output and use substr() or some of the preg functions to strip out the bits you need.

Perhaps xpath or curl could also do what you want.

Im a little confused of what I would need to do to use preg_match or is that the wrong function?

Have a look at this: http://uk.php.net/manual/en/function.preg-match.php

In this example:

<?php
// get host name from URL
preg_match('@^(?:http://)?([^/]+)@i',
    "http://www.php.net/index.html", $matches);
$host = $matches[1];

// get last two segments of host name
preg_match('/[^.]+\.[^.]+$/', $host, $matches);
echo "domain name is: {$matches[0]}\n";
?>

How would I alter that for my needs? All the brackets and symbols are confusing :confused:

I have done quite a bit of screen scraping. Some of this has been run on a daily basis to extract data and move it to somewhere else. I decided to capture what I know about this in my help file. You can see it here.