We're a community of 1077K IT Pros here for help, advice, solutions, professional growth and fun. Join us!
1,076,267 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Start New Discussion Reply to this Discussion

How to remove duplicate URLs from a crawled website ??

<?php
include_once('simple_html_dom.php');

function get_url_contents($url){
    $crl = curl_init();
    $timeout = 5;
    curl_setopt ($crl, CURLOPT_URL,$url);
    curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
    $ret = curl_exec($crl);
    curl_close($crl);
    return $ret;
}

$url = 'http://books.rediff.com/categories/fiction-genres/2180204';
$outhtml = get_url_contents($url);
$html= str_get_html($outhtml);

foreach($html->find('a') as $link) {
    echo "<a href =".$link->href.">".$link->href."</a><br>";
}
?>

This gives all the links present on the given URL.
I wish to remove all the duplicate entries as well as those Javascript links that I get after crawling like "javascript:doSearch('MT'); javascript:window,history.go(-1);" ...
Please help!
Thanks ...

3
Contributors
11
Replies
1 Day
Discussion Span
9 Months Ago
Last Updated
12
Views
Question
Answered
apanimesh061
Posting Whiz in Training
216 posts since Nov 2010
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0

Store your href's in an array and use this function.

pritaeas
Posting Prodigy
Moderator
9,309 posts since Jul 2006
Reputation Points: 1,178
Solved Threads: 1,465
Skill Endorsements: 86

@priteas
Oh! Thanks ....
How should I remove entries like that are not valid urls or do not start with "http://..." ?

apanimesh061
Posting Whiz in Training
216 posts since Nov 2010
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0
pritaeas
Posting Prodigy
Moderator
9,309 posts since Jul 2006
Reputation Points: 1,178
Solved Threads: 1,465
Skill Endorsements: 86

Bear in mind that URLs can be both absolute and relative. E.g.
http://example.com/my/page.html
/my/page.html
my/page.html

You may also need to account for mailto:, tel: and other link types too.

blocblue
Practically a Posting Shark
837 posts since Jan 2008
Reputation Points: 272
Solved Threads: 161
Skill Endorsements: 12

After I extract all the urls from the web page ... how do I traverse the urls using bfs or dfs ? Do I have to store them in a database and then traverse through them ??

apanimesh061
Posting Whiz in Training
216 posts since Nov 2010
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0

Whether you store them in an array or in the database, there is no issue between BFS or DFS, because those two apply to binary trees or graphs.

pritaeas
Posting Prodigy
Moderator
9,309 posts since Jul 2006
Reputation Points: 1,178
Solved Threads: 1,465
Skill Endorsements: 86

Okkay!
Well, if I do not store the URLs in the database then how will I traverse them by BFS/DFS ?

apanimesh061
Posting Whiz in Training
216 posts since Nov 2010
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0

http://www.programmerinterview.com/index.php/data-structures/dfs-vs-bfs/

So if you want to use one of those methods, you will have to build a tree first. However, I do not see any relation between this and your original question. Care to explain?

pritaeas
Posting Prodigy
Moderator
9,309 posts since Jul 2006
Reputation Points: 1,178
Solved Threads: 1,465
Skill Endorsements: 86

I have been able to remove the duplicates in the urls crawled.
But I cannot understand that how should I implement BFS/DFS traversal in this crawler .... ?
Like I stored all the crawled arrays in an array ...
Do I have to store all the URLs directly into a tree instread of an array ???

apanimesh061
Posting Whiz in Training
216 posts since Nov 2010
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0

For what reason do you keep referring to BFS/DFS ? It is not applicable in your current situation.

pritaeas
Posting Prodigy
Moderator
9,309 posts since Jul 2006
Reputation Points: 1,178
Solved Threads: 1,465
Skill Endorsements: 86

Now that I have at least got all the urls of a page, I just wish to traverse urls using BFS and DFS ....
Please tell me how do we do that in php ?? I have a feeling I am asking this question the wrong way or may be something is missing !!

apanimesh061
Posting Whiz in Training
216 posts since Nov 2010
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0
Question Answered as of 9 Months Ago by pritaeas and blocblue

This question has already been solved: Start a new discussion instead

Post: Markdown Syntax: Formatting Help
 
You
 
© 2013 DaniWeb® LLC
Page rendered in 0.0951 seconds using 2.7MB