<?php
include_once('simple_html_dom.php');

function get_url_contents($url){
    $crl = curl_init();
    $timeout = 5;
    curl_setopt ($crl, CURLOPT_URL,$url);
    curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
    $ret = curl_exec($crl);
    curl_close($crl);
    return $ret;
}

$url = 'http://books.rediff.com/categories/fiction-genres/2180204';
$outhtml = get_url_contents($url);
$html= str_get_html($outhtml);

foreach($html->find('a') as $link) {
    echo "<a href =".$link->href.">".$link->href."</a><br>";
}
?>

This gives all the links present on the given URL.
I wish to remove all the duplicate entries as well as those Javascript links that I get after crawling like "javascript:doSearch('MT'); javascript:window,history.go(-1);" ...
Please help!
Thanks ...

Recommended Answers

All 11 Replies

@priteas
Oh! Thanks ....
How should I remove entries like that are not valid urls or do not start with "http://..." ?

Bear in mind that URLs can be both absolute and relative. E.g.
http://example.com/my/page.html
/my/page.html
my/page.html

You may also need to account for mailto:, tel: and other link types too.

After I extract all the urls from the web page ... how do I traverse the urls using bfs or dfs ? Do I have to store them in a database and then traverse through them ??

Whether you store them in an array or in the database, there is no issue between BFS or DFS, because those two apply to binary trees or graphs.

Okkay!
Well, if I do not store the URLs in the database then how will I traverse them by BFS/DFS ?

I have been able to remove the duplicates in the urls crawled.
But I cannot understand that how should I implement BFS/DFS traversal in this crawler .... ?
Like I stored all the crawled arrays in an array ...
Do I have to store all the URLs directly into a tree instread of an array ???

For what reason do you keep referring to BFS/DFS ? It is not applicable in your current situation.

Now that I have at least got all the urls of a page, I just wish to traverse urls using BFS and DFS ....
Please tell me how do we do that in php ?? I have a feeling I am asking this question the wrong way or may be something is missing !!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.