Parse HTML?
I'm trying to convert html to plain text (remove all html tags)
I don't want to use regex so I tried the module HTML::Parser
and tried the parse() function but i got this error
Undefined subroutine &main::start called at getwords.pl line 27.
and as a matter of fact I checked the module's source HTML/Parser.pm and there is no parse function
then I downloaded WWW::Mechanize and I got the same error since WWW::Mechanize uses HTML::Parser parse function
really weird
I already downloaded the source from CPAN ( http://search.cpan.org/dist/HTML-Parser/Parser.pm ) and the parse function is missing.
What can I do? and why is the funcion missing?
terabyte
Junior Poster in Training
68 posts since Oct 2010
Reputation Points: 10
Solved Threads: 4
Undefined subroutine &main::start called at getwords.pl line 27.
The error message says it can't find your start function. It doesn't mention a function named 'parse', you need to define a function or subroutine named 'start' in your main package.
For example, if you have an html file in your current working directory name 'VerySimpleFile.html' the following script should run OK:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser ();
# Create parser object
my $p = HTML::Parser->new( api_version => 3,
start_h => [\&start, "tagname, text"],
text_h => [\&text, "text"],
end_h => [\&end, "tagname, text"],
marked_sections => 1,
);
# Parse directly from file
$p->parse_file('VerySimpleFile.html');
sub start{
my ($tagname, $text) = @_;
print "<!-- $tagname starts here................-->\n";
print $text;
}
sub text{
my $text = shift;
print $text;
}
sub end{
my ($tagname, $text) = @_;
print "\n<!-- $tagname ends here................-->\n";
print $text;
}
d5e5
Practically a Posting Shark
810 posts since Sep 2009
Reputation Points: 159
Solved Threads: 159
The previous script uses the parse_file() method to automatically open and parse the entire file. If you want to parse a file that is already open (such as STDIN) or you want to parse only some records in a large file then you can use the parse() method instead. (I can't find it in the source Parser.pm file either but that doesn't seem to matter -- it works for me anyway.) The following script loops through a file already open in STDIN, and parses it and prints some output.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser ();
# Create parser object
my $p = HTML::Parser->new( api_version => 3,
start_h => [\&start, "tagname, text"],
text_h => [\&text, "text"],
end_h => [\&end, "tagname, text"],
marked_sections => 1,
);
# Loop through STDIN and parse each line.
while (<>){
$p->parse($_);
}
print "\n";
$p->eof;#Tell Parser object we're finished parsing this file
sub start{
my ($tagname, $text) = @_;
print "<!-- $tagname starts here................-->\n";
print $text;
}
sub text{
my $text = shift;
print $text;
}
sub end{
my ($tagname, $text) = @_;
print "\n<!-- $tagname ends here................-->\n";
print $text;
}
d5e5
Practically a Posting Shark
810 posts since Sep 2009
Reputation Points: 159
Solved Threads: 159