0
<?php
include_once('simple_html_dom.php');

function get_url_contents($url){
    $crl = curl_init();
    $timeout = 5;
    curl_setopt ($crl, CURLOPT_URL,$url);
    curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
    $ret = curl_exec($crl);
    curl_close($crl);
    return $ret;
}

$url = 'http://books.rediff.com/categories/fiction-genres/2180204';
$outhtml = get_url_contents($url);
$html= str_get_html($outhtml);

foreach($html->find('a') as $link) {
    echo "<a href =".$link->href.">".$link->href."</a><br>";
}
?>

This gives all the links present on the given URL.
I wish to remove all the duplicate entries as well as those Javascript links that I get after crawling like "javascript:doSearch('MT'); javascript:window,history.go(-1);" ...
Please help!
Thanks ...

3
Contributors
11
Replies
12
Views
5 Years
Discussion Span
Last Post by apanimesh061
Featured Replies
  • 1

    Store your href's in an array and use [this function](http://php.net/manual/en/function.array-unique.php). Read More

0

After I extract all the urls from the web page ... how do I traverse the urls using bfs or dfs ? Do I have to store them in a database and then traverse through them ??

0

Whether you store them in an array or in the database, there is no issue between BFS or DFS, because those two apply to binary trees or graphs.

0

Okkay!
Well, if I do not store the URLs in the database then how will I traverse them by BFS/DFS ?

0

I have been able to remove the duplicates in the urls crawled.
But I cannot understand that how should I implement BFS/DFS traversal in this crawler .... ?
Like I stored all the crawled arrays in an array ...
Do I have to store all the URLs directly into a tree instread of an array ???

0

Now that I have at least got all the urls of a page, I just wish to traverse urls using BFS and DFS ....
Please tell me how do we do that in php ?? I have a feeling I am asking this question the wrong way or may be something is missing !!

This question has already been answered. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.