Hi
How can I parse html files for example daniweb.com
1.Extract all the links pointed (the content of the href tag)

  1. how can I List all of them after having removed the duplicates (means that there is two links pointed the same page, only one should be displayed)

I shall be very thankful for din help

Recommended Answers

All 2 Replies

Hi tony75,

To parse HTML in Perl, you might have to know how to use some Perl modules. HTML being what it is, can really be best parsed not with regex. That been said, there are a number of module on CPAN and meta::cpan which can be used to parse any HTML file.

I use HTML::TreeBuilder to parse HTML files and I use LWP::UserAgent to get the HTML from the net.
Of course, if you have your HTML file you don't need LWP::UserAgent.

So, with these two modules mentioned above, you can do like so:

#!/usr/bin/perl
use warnings;
use strict;
use utf8;
use LWP::UserAgent qw(get);
use HTML::TreeBuilder;

BEGIN{
  sub p{
    print $_,$/ for @_;
  }
}

my $website_to_check = 'http://www.daniweb.com';

my $resp = LWP::UserAgent->new;

my $ua = $resp->get($website_to_check);

if ( $ua->is_success ) {
    my $tree = HTML::TreeBuilder->new;
    $tree->parse( $ua->decoded_content );
    $tree->eof;
    _file_parser( $tree, $website_to_check );  # call _file_parser
    $tree->delete();
}
else {
    die $ua->status_line, $/;
}

# _file_parser subroutine
sub _file_parser {   
    my ( $t, $site ) = @_;
    binmode STDOUT,':encoding(UTF-8)';

    for ( $t->find_by_tag_name('a') ) {
        p join ' => ' => $_->as_text, $_->attr('href') || next;
    }
}

NOTE:

  1. I customaized the print function, to make it behave like the new feature say, by writing my own subroutine sub p{...}, instead of print or say I now use just p.

  2. With the LWP::UserAgent, you can get the HTML page you want to parse. Then, test if that succeeded. If yes load that into the HTML::TreeBuilder to parse as showned above. You might have to undersatnd HTML::Element which HTML::TreeBuilder is a sub-class to.

  3. Since, there might be Unicode on the webpage, there is need to properly parse those too, hence the reason for the usage use utf8, which ensure that unicodes are input rightly and binmode ... made it possible for proper outputting of the same in right encoding.

  4. Unless otherwise stated, other methods used are from HTML::TreeBuilder a sub-class of HTML::Element

All of that be said, the above script could be shorten, using a subroutine, new_from_url from HTML::TreeBuilder, which calls, module LWP::UserAgent if installed on your system, but DOES NOT install that for you.
So, the script goes thus:

#!/usr/bin/perl
use warnings;
use strict;
use HTML::TreeBuilder 5 -weak;

my $base_url = 'http://www.daniweb.com';
my $tree = HTML::TreeBuilder->new_from_url($base_url);

for ( $tree->find_by_tag_name('a') ) {
    if ( my $str = $_->attr('href') ) {
        print $_->as_text,' => ',$str,$/;
    }
}

The second script does the same thing with the first, with a fewer lines of codes.
To get more from this, you might have to read the modules documentations.

IMPORTANT: The two scripts in this post does what you want, except for making sure that the output is not repeated. That you can do by, passing, your link as key to a HASH since hash can only have only and one key alone, then you are sure that there would not be a link repeating itself. That I believe you can do.

Hope this solve your post.

Wonderful and thanks again for your help.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.