Here is part of the html

<body>
  <table>
    <tr>
      <td id="ctl00_ContentPlaceHolder1_TD_Unlimited_Usage_PerDay" class="form_label"
      colspan="2">
        <table width='100%' cellspacing='2' cellpadding='0' border='0'>
          <tbody>
            <tr>
              <td align='center' bgcolor='#E1E1E1'>Date</td>

              <td align='center' bgcolor='#E1E1E1'>Download Usage</td>

              <td align='center' bgcolor='#E1E1E1'>Upload Usage</td>

              <td align='center' bgcolor='#E1E1E1'>Total Usage</td>
            </tr>

            <tr>
              <td class='table_item' align='center' bgcolor='#EBEBEB'>2012-11-02</td>

              <td class='table_item' align='center' bgcolor='#EBEBEB'>1.269 GB</td>

              <td class='table_item' align='center' bgcolor='#EBEBEB'>0.286 GB</td>

              <td class='table_item' align='center' bgcolor='#EBEBEB'>1.554 GB</td>
            </tr>

            <tr>
              <td class='table_item' align='center' bgcolor='#EBEBEB'>2012-11-03</td>

              <td class='table_item' align='center' bgcolor='#EBEBEB'>2.602 GB</td>

              <td class='table_item' align='center' bgcolor='#EBEBEB'>0.152 GB</td>

              <td class='table_item' align='center' bgcolor='#EBEBEB'>2.754 GB</td>
            </tr>

Here is part of the html
Now I need to create hash with date as key and 3 values (download,upload,Total)
This hash will be used for furher processing.
Then I need to create csv from the hash to insert the data into a database (date,download,upload)
The problem I am facing is the html extracted from the web page is not formated in a logical way, by that I mean lines inside the file do not follow "structured" form. Also, I do not want to use any cpan module, because I feel more comfortable "driving" manual ...I hate automatic !!!! just kidding
So far, I have used many approaches one is to use regex !!! and I realized that I am creating problems rather than solving them. I can list some of these approaches, but may be a proven approach from you guys and gals can direct me to solve this problem.

Date    Download Usage  Upload Usage    Total Usage
2012-11-02  1.269 GB    0.286 GB    1.554 GB
2012-11-03  2.602 GB    0.152 GB    2.754 GB
2012-11-04  5.446 GB    0.51 GB 5.956 GB
2012-11-05  3.285 GB    0.21 GB 3.495 GB
2012-11-06  1.289 GB    0.323 GB    1.612 GB
2012-11-07  1.493 GB    0.085 GB    1.579 GB
2012-11-08  1.298 GB    0.077 GB    1.375 GB
2012-11-09  0.34 GB 0.029 GB    0.369 GB
2012-11-10  0.873 GB    0.206 GB    1.079 GB
2012-11-11  2.171 GB    0.081 GB    2.253 GB
2012-11-12  5.329 GB    0.229 GB    5.558 GB
2012-11-13  0.356 GB    0.014 GB    0.369 GB

Here is one attempt that will show you why I want to go the "hard way"

use strict;
use warnings;

use HTML::TableExtract;


my $content;
open (FILE, "<dany99.html");
open (REPORT, ">dany99.txt");
binmode FILE;

    local $/ = undef; # slurp mode
    $content = <FILE>;  ###the above line is important otherwise you get nothing


my $te = HTML::TableExtract->new();
$te->parse( $content );

foreach my $ts ( $te->tables() )
{
    foreach my $row ( $ts->rows() )
    {
        print REPORT join ( "\t", @$row ), "\n";
    }
}

The result from running this script is shown at the top (sorry about that)

May be the reason has nothing to do with the module used ( bad html coding ) , so I decided to go the manual way.

To recap, my request for help is stated in the initial post.

Hello arishy,

To start with NEVER use regex to parse an html file(s). It might work for a line or two, but not a file.

  1. Modern Perl Mastering encourages the use of tested and used modules in the cpan or other cpan mirrors. Why will you want to re-create the wheel, expect you find out that the wheel is not useful for you.

  2. Checking through the code you presented above, I think the problem is that you don't know how to arrange your info into an hash.
    I will show two ways:
    a) Modifying your script, which I think is alot better in the sense that you don't have to parse the html data seperately. Which was what I did.
    b) Parse the html file using any of the HTML Parser you know ( I used HTML::Parser ). Then, the rest is just text re-arrangement.

Lastly, to write your CSV use the module Text::CSV or Text::CSV_XS. This is left for the to figure out (it's Simple to get though).

Below are the two scripts.
Modified Script

use strict;
use warnings;
use HTML::TableExtract;

my $content;
open my $fh, '<:utf8', "foo.html" or die "can't open file:$!";
{
    local $/ = undef;    # slurp mode
    $content = <$fh>;    #the above line is important otherwise you get nothing
}
close $fh or die "can't close file: $!";
my $te = HTML::TableExtract->new();
$te->parse($content);

my %hash_to_use;        ## hash to use

foreach my $ts ( $te->tables() ) {
    foreach my $row ( $ts->rows() ) {
        my $key = shift(@$row);
        push @{ $hash_to_use{$key} }, @$row unless $key =~ m/Date|\s+/;
    }
}

# print out your result OR do other things
{
    local $" = "\t";
    print "Date\t\tDownload Usage\tUpload Usage\tTotal Usage\n";
    print $_, "\t", "@{$hash_to_use{$_}}", $/ for sort keys %hash_to_use;
}

The 2nd Solution

use warnings;
use strict;
use HTML::Parser;

my @array;

# use the HTML::Parser
my $html = HTML::Parser->new(
    api_version => 3,
    text_h      => [
        sub {
            $_[0] =~ s/^\s+|\s+$//;
            push @array, $_[0] if $_[0] ne "";
        },
        'dtext'
    ],
);

# read your html file and parse
open my $fh, '<:utf8', 'foo.html' or die $!;
$html->parse_file($fh);
close $fh or die $!;

my %arrang_data;    # hash to use
my $key;
for (@array) {
    if ( /\d{4}-/ or /Date/ ) {
        $key = $_;
    }
    else {
        push @{ $arrang_data{$key} }, $_;
    }
}

delete $arrang_data{'Date'};    # remove the date key/value from the hash

# print out your result OR do other things
{
    local $" = "\t";
    print "Date\t\tDownload Usage\tUpload Usage\tTotal Usage\n";
    print $_, "\t", "@{$arrang_data{$_}}", $/ for sort keys %arrang_data;
}

output:

Date        Download Usage  Upload Usage    Total Usage
2012-11-02  1.269 GB        0.286 GB        1.554 GB
2012-11-03  2.602 GB        0.152 GB        2.754 GB

Hope this helps

Hello 2teez,
It is lovely to hear from you ...
I TOTALY agree with you( someone righty said "if you have a problem and thinking of using regex ...now you have two" This applies fully when you deal with html.
Point 1 ....The problem; in my case, I did not "trust" the html coding not the module. And I proved it with the above post.
Point 2 .... Yes I am learning, and the funny thing is; I used one of YOUR scripts here in this forum to get a handle on it and it worked. So, double thanks for that....

I will study your approaches and hopefully come back with "Modern" solution...

I will study your approaches and hopefully come back with "Modern" solution...

That is the spirit.....

Both approaches worked (of course !!).
As for the csv , I borrowed your one liner master piece, but again I need you to fine tune it

open  OUT , '>' , "tedatausagex.csv";
print OUT $_,"\t","@{$hash_to_use{$_}}", $/ for sort keys %hash_to_use;
close OUT;

What I am hoping for mysql to accept it, is:
"2012-10-02","1.111","0.222","1.333"

It will take me sometime to understand the csv module and hashes, so I will do my home work and come back with more questions to be answered.

When I have the courage I will research letting perl hash talk directly to mysql and cut the middle process. What do you think?

Edited 4 Years Ago by arishy

I managed to get the csv partially right

for my $key ( sort keys %hash_to_use ) {
    print join( q{,}, "\"$key\"", @{ $hash_to_use{$key} } ), "\n"; 
}
### output
# "2012-11-02",1.269 GB,0.286 GB,1.554 GB
# "2012-11-03",2.602 GB,0.152 GB,2.754 GB
# "2012-11-04",5.446 GB,0.51 GB,5.956 GB

A quick fix though ....

use warnings;
use strict;
use HTML::Parser;

my @array;

# use the HTML::Parser
my $html = HTML::Parser->new(
    api_version => 3,
    text_h      => [
        sub {
            $_[0] =~ s/^\s+|\s+$//;
            push @array, $_[0] if $_[0] ne "";
        },
        'dtext'
    ],
);

# read your html file and parse
open my $fh, '<:utf8', 'foo.html' or die $!;
$html->parse_file($fh);
close $fh or die $!;

my %arrang_data;    # hash to use
my $key;
for (@array) {
    if ( /\d{4}-/ or /Date/ ) {
        $key = $_;
    }
    else {
        if (/(.+?)\s+/) {
            push @{ $arrang_data{$key} }, $1;
        }
    }
}

delete $arrang_data{'Date'};    # remove the date key/value from the hash

 my $array_ref=[];
#A quick fix
for ( sort keys %arrang_data ) {
    print qq{\"$_\",}, map { qq{\"$_\",} } @{ $arrang_data{$_} };
    push @$array_ref,[$_, map { $_ } @{ $arrang_data{$_} }];
    print $/;
}

output ...

"2012-11-02","1.269","0.286","1.554",
"2012-11-03","2.602","0.152","2.754",

Check the for loops for changes.
Really, I think you need to check Text::CSV module when you have the time.
NOTE
I did change the workings of the first for loop, because you showed you didn't want the "GB's".

Head-Up
All you wanted as stated in your OP, can be achieved in a single Perl Script.

Hi arishy,
Please remove line 39 and 43 in my last post before this. I only used them for a test. They are not needed in other for the script to work.
Thanks.

Edited 4 Years Ago by 2teez

This question has already been answered. Start a new discussion instead.