We're a community of 1.1M IT Pros here for help, advice, solutions, professional growth and fun. Join us!
1,080,661 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Start New Discussion Reply to this Discussion

Using cURL to download an entire web page.

Hi,

I have been searching here and Google for the past few days but I haven't been able to find an answer.

I want to have a script that will download one page of a website with all the content i.e. images, css, js etc...

I have been able to save the html (text) like this:

function get_data($url)
{
	$ch = curl_init();
	$timeout = 5;
	curl_setopt($ch,CURLOPT_URL,$url);
	curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
	curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
	$data = curl_exec($ch);
	curl_close($ch);
	return $data;
}

$returned_content = get_data('http://example.com/page.htm');

$my_file = 'file.htm';
$handle = fopen($my_file, 'w') or die('Cannot open file:  '.$my_file);
fwrite($handle, $returned_content);

This will save a file called 'file.htm' with all the HTML but no images, css, js etc...

I have also been able to do this:

$img[]='http://example.com/image.jpg';

foreach($img as $i){
	save_image($i);
	if(getimagesize(basename($i))){
		echo 'Image ' . basename($i) . ' Downloaded OK';
	}else{
		echo 'Image ' . basename($i) . ' Download Failed';
	}
}

function save_image($img,$fullpath='basename'){
	if($fullpath=='basename'){
		$fullpath = basename($img);
	}
	$ch = curl_init ($img);
	curl_setopt($ch, CURLOPT_HEADER, 0);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
	curl_setopt($ch, CURLOPT_BINARYTRANSFER,1);
	$rawdata=curl_exec($ch);
	curl_close ($ch);
	if(file_exists($fullpath)){
		unlink($fullpath);
	}
	$fp = fopen($fullpath,'x');
	fwrite($fp, $rawdata);
	fclose($fp);
}

This will save that specific image but I haven't found anything that will save the entire HTML with all the content behind it.


Thanks for your help in advance!

2
Contributors
1
Reply
2 Days
Discussion Span
2 Years Ago
Last Updated
2
Views
jambla
Newbie Poster
1 post since Mar 2011
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0

It's a simple problem with a non-trivial solution.
How a browser works -- a browser downloads the .html file (like you are doing in cURL). It then *parses* that file into element nodes and linked resources (like images, javascript, etc.) as well as layout and all that jazz. It then requests each of these linked resources as individual files and uses them in the display presented to the client.

It is that (second to) last bit which is the step you are missing "It then requests each of these linked resources as individual files..." and I think you recognize this.

cURL doesn't have a built in parsing mechanism for the page that you request through it. You will need to pass that page's source to either a DOMDocument or XML parsing method and then grab all the types of linked resources you want from there. Or you can do a regex type search for "http, src, etc." strings in the page to pull the URLs out that way.

Cheers

langsor
Posting Whiz
390 posts since Aug 2008
Reputation Points: 30
Solved Threads: 36
Skill Endorsements: 0

This article has been dead for over three months: Start a new discussion instead

Post: Markdown Syntax: Formatting Help
 
You
View similar articles that have also been tagged:
 
© 2013 DaniWeb® LLC
Page generated in 0.0530 seconds using 2.65MB