How to crawl web pages using BFS.DFS ?

Question

apanimesh061 0 Junior Poster

12 Years Ago

<?php
include_once('simple_html_dom.php');

function get_url_contents($url){
    $crl = curl_init();
    $timeout = 5;
    curl_setopt ($crl, CURLOPT_URL,$url);
    curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
    $ret = curl_exec($crl);
    curl_close($crl);
    return $ret;
}

$url = 'http://books.rediff.com/categories';
$outhtml = get_url_contents($url);
$html= str_get_html($outhtml);

$urlarray = array();
foreach($html->find('a') as $link) {
    $findme = 'http://';
    if (strpos($link->href, 'http://') === 0) {
        array_push($urlarray, $link->href);
    }
}
print_r($urlarray);
?>

This is simple web crawler, where I have extracted all the urls on the page. I cannot understand how will I apply BFS/DFS in this crawler ???
Please help!

php

2 Contributors
1 Reply
814 Views
50 Minutes Discussion Span
Latest Post 12 Years Ago Latest Post by Traevel

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Traevel 216 Light Poster · Answer 1 · 2012-08-09T12:05:14+00:00

Hi,

I take it this is some sort of assignment, in which case my bet would be that they want you to find the url's using BFS and DFS and not simply dumping all link elements in an array. You could traverse the DOM in search for links with BFS or DFS, but a list of links needs no traversels. The page with all the books is your data I presume, in which case you could search (all) the elements of the page for link elements in a structured traversel. Page elements could be seen as nodes in a (DOM) tree. You have sibblings and children, so with breadth first you would start by visiting sibblings first whereas with depth first you would start by going through the child nodes first.

Take a look at the structure of a web page, draw a simple one out on paper (note how it can be made to look like a tree), then figure out the steps you would have to take to traverse through that structure in search for links with both methods. Once you find a link you could put it in a list, like you had the foreach loop do.

breadth first > http://upload.wikimedia.org/wikipedia/commons/3/33/Breadth-first-tree.svg
depth first > http://upload.wikimedia.org/wikipedia/commons/1/1f/Depth-first-tree.svg

Storing wise, if you are looking for a specific link you would need something different than if you were looking for all links, but the general searching for <a> elements would be the same.

Traevel