Start New Discussion within our Web Development Community


I have been searching here and Google for the past few days but I haven't been able to find an answer.

I want to have a script that will download one page of a website with all the content i.e. images, css, js etc...

I have been able to save the html (text) like this:

function get_data($url)
	$ch = curl_init();
	$timeout = 5;
	$data = curl_exec($ch);
	return $data;

$returned_content = get_data('');

$my_file = 'file.htm';
$handle = fopen($my_file, 'w') or die('Cannot open file:  '.$my_file);
fwrite($handle, $returned_content);

This will save a file called 'file.htm' with all the HTML but no images, css, js etc...

I have also been able to do this:


foreach($img as $i){
		echo 'Image ' . basename($i) . ' Downloaded OK';
		echo 'Image ' . basename($i) . ' Download Failed';

function save_image($img,$fullpath='basename'){
		$fullpath = basename($img);
	$ch = curl_init ($img);
	curl_setopt($ch, CURLOPT_HEADER, 0);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
	curl_setopt($ch, CURLOPT_BINARYTRANSFER,1);
	curl_close ($ch);
	$fp = fopen($fullpath,'x');
	fwrite($fp, $rawdata);

This will save that specific image but I haven't found anything that will save the entire HTML with all the content behind it.

Thanks for your help in advance!

It's a simple problem with a non-trivial solution.
How a browser works -- a browser downloads the .html file (like you are doing in cURL). It then *parses* that file into element nodes and linked resources (like images, javascript, etc.) as well as layout and all that jazz. It then requests each of these linked resources as individual files and uses them in the display presented to the client.

It is that (second to) last bit which is the step you are missing "It then requests each of these linked resources as individual files..." and I think you recognize this.

cURL doesn't have a built in parsing mechanism for the page that you request through it. You will need to pass that page's source to either a DOMDocument or XML parsing method and then grab all the types of linked resources you want from there. Or you can do a regex type search for "http, src, etc." strings in the page to pull the URLs out that way.


This article has been dead for over six months. Start a new discussion instead.