I've been researching this all day, and am having trouble finding something that works. I have installed and used Sphider (PHP/MySQL), and it crawls my site successfully and gives me the URLs and I can do a search for any text on any page, but it doesn't pick up the image filenames in the <img> tags. What I need is a way to somehow get a list of the image filenames in the <img> tags, so that I can do a search (for example IMAGE_1234.JPG) and find the url of the page where that image is on, out of perhaps 100 pages. Or optionally, but not necessary, a list of all the images on the site with the full url, for example:
mysite.com/aaa.html/image1.jpg
mysite.com/aaa.html/image491.jpg
mysite.com/bbb.html/image534.jpg
mysite.com/bbb.html/image123.jpg
What else I've tried:
simplehtmldom and another using curl (errors)
some programs on the web: they retrieve the pictures but don't give me a list
google image search for my site: they only have about 5% of my photos or less, and old versions
So basically I'm looking for a simple spider script in php (my hosting service does not support perl) to get a list of all my images & the path
All the html pages are in the top level directory, but each image is pulled from dozens of various subfolders, each containing about 20 pictures, and there are almost 3000 jpg files, so if I'm looking for a specific picture, I need an easy way to search for it, or click on a link to see it (optional). One easy way would be to download a copy of my entire site contents to my box and do a file search on the computer, but I'd like to have a search box on the actual main web page for this.
Thanks...

OK, I have a start to something that works, using the SimpleHTMLDom

http://simplehtmldom.sourceforge.net/

This works and gives me a list of images for one page:

<?php
include_once('simple_html_dom.php');
// Create DOM from URL
$html = file_get_html('http://viewoftheblue.com/photography/njrails09_1.html'); //chosen page

// Find all images
$images = array(); 
foreach($html->find('img') as $element) {
       $images[] = $element->src; 
} 
reset($images);
$result = count($images);
foreach ($images as $out) {
    echo "$out<br />\n";
}	
?>

It also shows the subdirectory where the images are located (very cool), but I need to plug in each URL manually so far and run it again. So basically, now I need to put together something that will find all the url's (rather than the images) that appear in the main page as links or further down the tree, put them in an array and then go through that array of url's and run the above as a sub-loop for each url.

OK, I took it one step further, and it's pretty much what I need now. I hard code in one url and one base url, and it lists all the links on the page, and then goes into those links and lists all the images on those resultant pages in clickable form. The few more perks to make this a nice script would be to have it go into deeper levels, exclude any offsite urls, as well as mailto: and other stuff like links to the hit counter site and twitter, etc.
Feel free to add more ideas if you'd like.

<?php
include('simple_html_dom.php');

// Create DOM from URL
$page = "http://viewoftheblue.com/photography/rail.html";
$html = new simple_html_dom();
$html->load_file($page);

// Find all links
$links = array(); 
foreach($html->find('a') as $element) {
       $links[] = $element; 
} 
reset($links);
echo "Links found on $page:<br /><br />";
foreach ($links as $out) {
    echo "$out->href<br />";
}
echo"<br />";	

// Parse resultant individual pages for images

foreach ($links as $subpage) { 
    $base = "http://viewoftheblue.com/photography/";
	// Create DOM from URL
	$subpage = $subpage->href;
	$page = $base . $subpage;
	$html = file_get_html($page);

	// Find all images
	$images = array(); 
	foreach($html->find('img') as $element) {
		   $images[] = $element->src; 
	} 
	reset($images);
	echo "Images found on $page:<br /><br />";
	foreach ($images as $out) {
		$url = "<a href=$base$out>$out</a>";
		echo "$url<br />";
	}	
	echo"<br />";	
}
?>
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.