0

I'm creating a bot scraper that gathers information off of other websites and i am using html simple dom parser to do it.

I have found a bug though. I ran into one website that doesnt parse.\

Here is a sample of the code that it cannot parse:

<div
class="header"><div
class="container"><ul
id="nav"><li><a
id="home" href="http://thinkclay.com"
class="selected" title="Return to the home page">Return to home</a></li><li><a
id="about" href="http://thinkclay.com/about"
title="Read more about Clay McIlrath">About Clay McIlrath</a></li><li><a
id="design" href="http://thinkclay.com/graphic-design"
title="View my Graphic Design Portfolio">Web Design Portfolio</a></li><li><a
id="development" href="http://thinkclay.com/web-development"
title="View my Web Development Portfolio">Web Development Portfolio</a></li><li><a
id="photography" href="http://thinkclay.com/photography"
title="View my Photography Portfolio">Photography Portfolio</a></li><li><a
id="wallpaper" href="http://thinkclay.com/desktop-wallpapers"
title="Download free desktop wallpapers">Free Desktop Wallpapers</a></li><li><a
id="wordpress" href="http://thinkclay.com/wordpress"
title="Download free wordpress themes">Free Wordpress Themes</a></li></ul><div
style="clear:both;"></div><p>My name is Clayton McIlrath and I am an entrepreneur currently living in CO. I personally enjoy the process of learning, exploring, and doing all things creative as well as sharing my experiences with others. Being an entrepreneur and <a
href="http://bychosen.com">business owner</a>, I hope that my experiences may help someone else start their own venture and find success and freedom as I have! Feel free to <a
href="http://bychosen.com/contact">contact me</a> anytime for questions or opportunities.</p> <a
class="close" href="#close" title="Close the Cloud"><img
src="http://thinkclay.com/wp-content/themes/thinkclay_v2/images/close.png" alt="close" /></a></div></div><div
class="container"> <a

its seems as if the code gets a line break after the tag name and before the first attribute.

I have tried str_replace'ing & preg_replacing white space characters with a single space and that still doesnt seem to work. Would anybody have any ideas as to why this is happening and how i can fix it?

Thanks

3
Contributors
6
Replies
8
Views
5 Years
Discussion Span
Last Post by sacarias40
0

Ok, so what is parsing it. Im sure the page that it will not parse is not the item thats broken.

0

Hi,

Please hold on to this topic, and I will show how to scrape that thing down to its bone. I am in the middle of something really important.

0

Thanks I really appreciate it a lot. Just as a side note once more,

I need to be able to parse any website that i throw at it.

Im really after meta description, keywords, title, images, and any canonical or shortlinks.

Thanks gain

0

ok, sorry for the wait. To parse any website you want to throw at it all you need to do is use the codes as shown at source forge.

Please allow me to focus on your question above, this is how I would parse any websites. I would use object oriented..

Save this as parsethis.php

<div class="header">
<div class="container">
<ul id="nav">
<li><a id="home" href="http://thinkclay.com" class="selected" title="Return to the home page">Return to home</a>
</li>
<li><a id="about" href="http://thinkclay.com/about" title="Read more about Clay McIlrath">About Clay McIlrath</a>
</li>
<li><a id="design" href="http://thinkclay.com/graphic-design" title="View my Graphic Design Portfolio">Web Design Portfolio</a>
</li>
<li><a id="development" href="http://thinkclay.com/web-development" title="View my Web Development Portfolio">Web Development Portfolio</a>
</li>
<li><a id="photography" href="http://thinkclay.com/photography" title="View my Photography Portfolio">Photography Portfolio</a></li>
<li><a id="wallpaper" href="http://thinkclay.com/desktop-wallpapers" title="Download free desktop wallpapers">Free Desktop Wallpapers</a></li>
<li><a id="wordpress" href="http://thinkclay.com/wordpress" title="Download free wordpress themes">Free Wordpress Themes</a></li>
</ul>
<div style="clear:both;"></div>
<p>My name is Clayton McIlrath and I am an entrepreneur currently living in CO. I personally enjoy the process of learning, exploring, and doing all things creative as well as sharing my experiences with others. Being an entrepreneur and <a
href="http://bychosen.com">business owner</a>, I hope that my experiences may help someone else start their own venture and find success and freedom as I have! Feel free to <a href="http://bychosen.com/contact">contact me</a> anytime for questions or opportunities.</p> <a class="close" href="#close" title="Close the Cloud"><img src="http://thinkclay.com/wp-content/themes/thinkclay_v2/images/close.png" alt="close" /></a>
</div>

Save this as your parser.php

<?php
	require_once 'simplehtmldom/simple_html_dom.php';
	
	$html = file_get_html("parsethis.php");
	
# lets find the ul with id->nav
	foreach($html->find('ul[id=nav]')as $ul){
	 # for every ul in the html document we must find->li
	  foreach($ul->find('li') as $li){
	   # for every li we find-> a as items
		foreach($li->find('a') as $items){
		 # we output the full link
		 echo $items."<br/>";
		 # we parse the attributes of the link in this case the href value which is the url
		 $href = $items->href;
		 echo $href."<br/>";
		 # we parse the title attributes item->title
		 $title = $items->title;
		 echo $title."<br/>";
		 
		 
			
			
		}
	}
	 
	 }
	 # we will attempt to parse the only <p> of the page
	 foreach($html->find('p') as $p){
			echo $p."<br/>";
		}
		#these two prevents memory leaks..
 $html->clear();
unset($html);
?>

For an extremely difficult website, you must use cURL and save the output as html document and then feed to html dom objects.

0

How I am doing it right now is using curl, but im not saving it to a file. I skip that and use the load() method in html_simple_dom.

I get the url, use curl on it, take the response. load it into the dom parser object, then i attempt to parse it. for description, title, keywords, images.

For some reason it doesnt want to grab anything but the title off of that website that was posted.

Thank you for your response.

I'm just not sure why i would need to place it into a file first. I might want to cache it for an amount of time, but that would be about it. I think my way saves an extra step and a little bit of memory.

Thanks again!

This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.