0
<?php
include_once('simple_html_dom.php');

function get_url_contents($url){
    $crl = curl_init();
    $timeout = 5;
    curl_setopt ($crl, CURLOPT_URL,$url);
    curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
    $ret = curl_exec($crl);
    curl_close($crl);
    return $ret;
}

$url = 'http://books.rediff.com/categories';
$outhtml = get_url_contents($url);
$html= str_get_html($outhtml);

$urlarray = array();
foreach($html->find('a') as $link) {
    $findme = 'http://';
    if (strpos($link->href, 'http://') === 0) {
        array_push($urlarray, $link->href);
    }
}
print_r($urlarray);
?>

This is simple web crawler, where I have extracted all the urls on the page. I cannot understand how will I apply BFS/DFS in this crawler ???
Please help!

2
Contributors
1
Reply
4
Views
5 Years
Discussion Span
Last Post by Traevel
0

Hi,

I take it this is some sort of assignment, in which case my bet would be that they want you to find the url's using BFS and DFS and not simply dumping all link elements in an array. You could traverse the DOM in search for links with BFS or DFS, but a list of links needs no traversels. The page with all the books is your data I presume, in which case you could search (all) the elements of the page for link elements in a structured traversel. Page elements could be seen as nodes in a (DOM) tree. You have sibblings and children, so with breadth first you would start by visiting sibblings first whereas with depth first you would start by going through the child nodes first.

Take a look at the structure of a web page, draw a simple one out on paper (note how it can be made to look like a tree), then figure out the steps you would have to take to traverse through that structure in search for links with both methods. Once you find a link you could put it in a list, like you had the foreach loop do.

breadth first > http://upload.wikimedia.org/wikipedia/commons/3/33/Breadth-first-tree.svg
depth first > http://upload.wikimedia.org/wikipedia/commons/1/1f/Depth-first-tree.svg

Storing wise, if you are looking for a specific link you would need something different than if you were looking for all links, but the general searching for <a> elements would be the same.

Traevel

Edited by Traevel: spelling

This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.