How to remove duplicate URLs from a crawled website ??

Question

apanimesh061 0 Junior Poster

12 Years Ago

<?php
include_once('simple_html_dom.php');

function get_url_contents($url){
    $crl = curl_init();
    $timeout = 5;
    curl_setopt ($crl, CURLOPT_URL,$url);
    curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
    $ret = curl_exec($crl);
    curl_close($crl);
    return $ret;
}

$url = 'http://books.rediff.com/categories/fiction-genres/2180204';
$outhtml = get_url_contents($url);
$html= str_get_html($outhtml);

foreach($html->find('a') as $link) {
    echo "<a href =".$link->href.">".$link->href."</a><br>";
}
?>

This gives all the links present on the given URL.
I wish to remove all the duplicate entries as well as those Javascript links that I get after crawling like "javascript:doSearch('MT'); javascript:window,history.go(-1);" ...
Please help!
Thanks ...

php

3 Contributors
11 Replies
1K Views
1 Day Discussion Span
Latest Post 12 Years Ago Latest Post by apanimesh061

All 11 Replies

pritaeas 2,211 ¯\_(ツ)_/¯

12 Years Ago

Store your href's in an array and use this function.

pritaeas 2,211 ¯\_(ツ)_/¯

12 Years Ago

http://php.net/manual/en/function.strpos.php

pritaeas 2,211 ¯\_(ツ)_/¯

12 Years Ago

Whether you store them in an array or in the database, there is no issue between BFS or DFS, because those two apply to binary trees or graphs.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

apanimesh061 0 Junior Poster · Answer 1 · 2012-08-07T13:51:16+00:00

@priteas
Oh! Thanks ....
How should I remove entries like that are not valid urls or do not start with "http://..." ?

blocblue 238 Posting Pro in Training Featured Poster · Answer 2 · 2012-08-07T14:34:13+00:00

Bear in mind that URLs can be both absolute and relative. E.g.
http://example.com/my/page.html
/my/page.html
my/page.html

You may also need to account for mailto:, tel: and other link types too.

apanimesh061 0 Junior Poster · Answer 3 · 2012-08-08T07:50:13+00:00

After I extract all the urls from the web page ... how do I traverse the urls using bfs or dfs ? Do I have to store them in a database and then traverse through them ??

apanimesh061 0 Junior Poster · Answer 4 · 2012-08-08T09:27:51+00:00

Okkay!
Well, if I do not store the URLs in the database then how will I traverse them by BFS/DFS ?

pritaeas 2,211 ¯\_(ツ)_/¯ Moderator Featured Poster · Answer 5 · 2012-08-08T09:34:43+00:00

http://www.programmerinterview.com/index.php/data-structures/dfs-vs-bfs/

So if you want to use one of those methods, you will have to build a tree first. However, I do not see any relation between this and your original question. Care to explain?

apanimesh061 0 Junior Poster · Answer 6 · 2012-08-08T11:49:12+00:00

I have been able to remove the duplicates in the urls crawled.
But I cannot understand that how should I implement BFS/DFS traversal in this crawler .... ?
Like I stored all the crawled arrays in an array ...
Do I have to store all the URLs directly into a tree instread of an array ???

pritaeas 2,211 ¯\_(ツ)_/¯ Moderator Featured Poster · Answer 7 · 2012-08-08T12:01:57+00:00

For what reason do you keep referring to BFS/DFS ? It is not applicable in your current situation.

apanimesh061 0 Junior Poster · Answer 8 · 2012-08-08T12:25:54+00:00

Now that I have at least got all the urls of a page, I just wish to traverse urls using BFS and DFS ....
Please tell me how do we do that in php ?? I have a feeling I am asking this question the wrong way or may be something is missing !!

How to remove duplicate URLs from a crawled website ??

Recommended Answers Collapse Answers

All 11 Replies

Recommended Answers