Extract table from html, the none module way

Question

arishy 0 Newbie Poster

12 Years Ago

Here is part of the html

<body>
  <table>
    <tr>
      <td id="ctl00_ContentPlaceHolder1_TD_Unlimited_Usage_PerDay" class="form_label"
      colspan="2">
        <table width='100%' cellspacing='2' cellpadding='0' border='0'>
          <tbody>
            <tr>
              <td align='center' bgcolor='#E1E1E1'>Date</td>

              <td align='center' bgcolor='#E1E1E1'>Download Usage</td>

              <td align='center' bgcolor='#E1E1E1'>Upload Usage</td>

              <td align='center' bgcolor='#E1E1E1'>Total Usage</td>
            </tr>

            <tr>
              <td class='table_item' align='center' bgcolor='#EBEBEB'>2012-11-02</td>

              <td class='table_item' align='center' bgcolor='#EBEBEB'>1.269 GB</td>

              <td class='table_item' align='center' bgcolor='#EBEBEB'>0.286 GB</td>

              <td class='table_item' align='center' bgcolor='#EBEBEB'>1.554 GB</td>
            </tr>

            <tr>
              <td class='table_item' align='center' bgcolor='#EBEBEB'>2012-11-03</td>

              <td class='table_item' align='center' bgcolor='#EBEBEB'>2.602 GB</td>

              <td class='table_item' align='center' bgcolor='#EBEBEB'>0.152 GB</td>

              <td class='table_item' align='center' bgcolor='#EBEBEB'>2.754 GB</td>
            </tr>

Here is part of the html
Now I need to create hash with date as key and 3 values (download,upload,Total)
This hash will be used for furher processing.
Then I need to create csv from the hash to insert the data into a database (date,download,upload)
The problem I am facing is the html extracted from the web page is not formated in a logical way, by that I mean lines inside the file do not follow "structured" form. Also, I do not want to use any cpan module, because I feel more comfortable "driving" manual ...I hate automatic !!!! just kidding
So far, I have used many approaches one is to use regex !!! and I realized that I am creating problems rather than solving them. I can list some of these approaches, but may be a proven approach from you guys and gals can direct me to solve this problem.

html-css mysql perl

2 Contributors
8 Replies
2K Views
1 Day Discussion Span
Latest Post 12 Years Ago Latest Post by 2teez

All 8 Replies

2teez 43 Posting Whiz

12 Years Ago

Hello arishy,

To start with NEVER use regex to parse an html file(s). It might work for a line or two, but not a file.

Modern Perl Mastering encourages the use of tested and used modules in the cpan or other cpan mirrors. Why will you want to re-create the wheel, expect you find out that the wheel is not useful for you.
Checking through the code you presented above, I think the problem is that you don't know how to arrange your info into an hash.
I will show two ways:
a) Modifying your script, which I think is alot better in the sense that you don't have to parse the html data seperately. Which was what I did.
b) Parse the html file using any of the HTML Parser you know ( I used HTML::Parser ). Then, the rest is just text re-arrangement.

Lastly, to write your CSV use the module Text::CSV or Text::CSV_XS. This is left for the to figure out (it's Simple to get though).

Below are the two scripts.
Modified Script

use strict;
use warnings;
use HTML::TableExtract;

my $content;
open my $fh, '<:utf8', "foo.html" or die "can't open file:$!";
{
    local $/ = undef;    # slurp mode
    $content = <$fh>;    #the above line is important otherwise you get nothing
}
close $fh or die "can't close file: $!";
my $te = HTML::TableExtract->new();
$te->parse($content);

my %hash_to_use;        ## hash to use

foreach my $ts ( $te->tables() ) {
    foreach my $row ( $ts->rows() ) {
        my $key = shift(@$row);
        push @{ $hash_to_use{$key} }, @$row unless $key =~ m/Date|\s+/;
    }
}

# print out your result OR do other things
{
    local $" = "\t";
    print "Date\t\tDownload Usage\tUpload Usage\tTotal Usage\n";
    print $_, "\t", "@{$hash_to_use{$_}}", $/ for sort keys %hash_to_use;
}

The 2nd Solution

use warnings;
use strict;
use HTML::Parser;

my @array;

# use the HTML::Parser
my $html = HTML::Parser->new(
    api_version => 3,
    text_h      => [
        sub {
            $_[0] =~ s/^\s+|\s+$//;
            push @array, $_[0] if $_[0] ne "";
        },
        'dtext'
    ],
);

# read your html file and parse
open my $fh, '<:utf8', 'foo.html' or die $!;
$html->parse_file($fh);
close $fh or die $!;

my %arrang_data;    # hash to use
my $key;
for (@array) {
    if ( /\d{4}-/ or /Date/ ) {
        $key = $_;
    }
    else {
        push @{ $arrang_data{$key} }, $_;
    }
}

delete $arrang_data{'Date'};    # remove the date key/value from the hash

# print out your result OR do other things
{
    local $" = "\t";
    print "Date\t\tDownload Usage\tUpload Usage\tTotal Usage\n";
    print $_, "\t", "@{$arrang_data{$_}}", $/ for sort keys %arrang_data;
}

output:

Date        Download Usage  Upload Usage    Total Usage
2012-11-02  1.269 GB        0.286 GB        1.554 GB
2012-11-03  2.602 GB        0.152 GB        2.754 GB

Hope this helps

2teez 43 Posting Whiz

12 Years Ago

I will study your approaches and hopefully come back with "Modern" solution...

That is the spirit.....

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

arishy 0 Newbie Poster · Answer 1 · 2012-11-13T15:58:28+00:00

Date    Download Usage  Upload Usage    Total Usage
2012-11-02  1.269 GB    0.286 GB    1.554 GB
2012-11-03  2.602 GB    0.152 GB    2.754 GB
2012-11-04  5.446 GB    0.51 GB 5.956 GB
2012-11-05  3.285 GB    0.21 GB 3.495 GB
2012-11-06  1.289 GB    0.323 GB    1.612 GB
2012-11-07  1.493 GB    0.085 GB    1.579 GB
2012-11-08  1.298 GB    0.077 GB    1.375 GB
2012-11-09  0.34 GB 0.029 GB    0.369 GB
2012-11-10  0.873 GB    0.206 GB    1.079 GB
2012-11-11  2.171 GB    0.081 GB    2.253 GB
2012-11-12  5.329 GB    0.229 GB    5.558 GB
2012-11-13  0.356 GB    0.014 GB    0.369 GB

Here is one attempt that will show you why I want to go the "hard way"

use strict;
use warnings;

use HTML::TableExtract;


my $content;
open (FILE, "<dany99.html");
open (REPORT, ">dany99.txt");
binmode FILE;

    local $/ = undef; # slurp mode
    $content = <FILE>;  ###the above line is important otherwise you get nothing


my $te = HTML::TableExtract->new();
$te->parse( $content );

foreach my $ts ( $te->tables() )
{
    foreach my $row ( $ts->rows() )
    {
        print REPORT join ( "\t", @$row ), "\n";
    }
}

The result from running this script is shown at the top (sorry about that)

May be the reason has nothing to do with the module used ( bad html coding ) , so I decided to go the manual way.

To recap, my request for help is stated in the initial post.

arishy 0 Newbie Poster · Answer 2 · 2012-11-14T04:04:36+00:00

Hello 2teez,
It is lovely to hear from you ...
I TOTALY agree with you( someone righty said "if you have a problem and thinking of using regex ...now you have two" This applies fully when you deal with html.
Point 1 ....The problem; in my case, I did not "trust" the html coding not the module. And I proved it with the above post.
Point 2 .... Yes I am learning, and the funny thing is; I used one of YOUR scripts here in this forum to get a handle on it and it worked. So, double thanks for that....

I will study your approaches and hopefully come back with "Modern" solution...

arishy 0 Newbie Poster · Answer 3 · 2012-11-14T11:20:25+00:00

Both approaches worked (of course !!).
As for the csv , I borrowed your one liner master piece, but again I need you to fine tune it

open  OUT , '>' , "tedatausagex.csv";
print OUT $_,"\t","@{$hash_to_use{$_}}", $/ for sort keys %hash_to_use;
close OUT;

What I am hoping for mysql to accept it, is:
"2012-10-02","1.111","0.222","1.333"

It will take me sometime to understand the csv module and hashes, so I will do my home work and come back with more questions to be answered.

When I have the courage I will research letting perl hash talk directly to mysql and cut the middle process. What do you think?

arishy 0 Newbie Poster · Answer 4 · 2012-11-15T05:01:09+00:00

I managed to get the csv partially right

for my $key ( sort keys %hash_to_use ) {
    print join( q{,}, "\"$key\"", @{ $hash_to_use{$key} } ), "\n"; 
}
### output
# "2012-11-02",1.269 GB,0.286 GB,1.554 GB
# "2012-11-03",2.602 GB,0.152 GB,2.754 GB
# "2012-11-04",5.446 GB,0.51 GB,5.956 GB

2teez 43 Posting Whiz · Answer 5 · 2012-11-15T06:11:01+00:00

A quick fix though ....

use warnings;
use strict;
use HTML::Parser;

my @array;

# use the HTML::Parser
my $html = HTML::Parser->new(
    api_version => 3,
    text_h      => [
        sub {
            $_[0] =~ s/^\s+|\s+$//;
            push @array, $_[0] if $_[0] ne "";
        },
        'dtext'
    ],
);

# read your html file and parse
open my $fh, '<:utf8', 'foo.html' or die $!;
$html->parse_file($fh);
close $fh or die $!;

my %arrang_data;    # hash to use
my $key;
for (@array) {
    if ( /\d{4}-/ or /Date/ ) {
        $key = $_;
    }
    else {
        if (/(.+?)\s+/) {
            push @{ $arrang_data{$key} }, $1;
        }
    }
}

delete $arrang_data{'Date'};    # remove the date key/value from the hash

 my $array_ref=[];
#A quick fix
for ( sort keys %arrang_data ) {
    print qq{\"$_\",}, map { qq{\"$_\",} } @{ $arrang_data{$_} };
    push @$array_ref,[$_, map { $_ } @{ $arrang_data{$_} }];
    print $/;
}

output ...

"2012-11-02","1.269","0.286","1.554",
"2012-11-03","2.602","0.152","2.754",

Check the for loops for changes.
Really, I think you need to check Text::CSV module when you have the time.
NOTE
I did change the workings of the first for loop, because you showed you didn't want the "GB's".

Head-Up
All you wanted as stated in your OP, can be achieved in a single Perl Script.

2teez 43 Posting Whiz · Answer 6 · 2012-11-15T08:17:41+00:00

Hi arishy,
Please remove line 39 and 43 in my last post before this. I only used them for a test. They are not needed in other for the script to work.
Thanks.

Extract table from html, the none module way

Recommended Answers Collapse Answers

All 8 Replies

Recommended Answers