I am trying to filter/parse HTML tags mentioned below to get the desired output (mentioned). I have tried so far with `sed` & `grep` but was only able to extract the content between the starting and closing tags. However, I want to be more specific like below I have mentioned. Any help would be really appreciated. Thanks!

1. Filter:
   `<a href="http://www.gradle.org">Gradle 2.4</a>` at Aug 8, 2015 6:38:46 PM`</p>`
   --> I want to fetch date and time out of this (Way I thought was to extract        between ...at and `</p>` - not sure if I am correct.

2. Filter `<div class="percent">50%</div>` to get that 50%
   --> My approach: if class="percent" extract number out of the tag

3. Filter `<a href="packages/com.pratik.testing.html">com.pratik.testing</a>` to get package name i.e com.pratik.testing

4. Filter `<a href="classes/com.pratik.testing.UserTest.html">UserTest</a>`&& 

   I want to fetch test performed ---> i.e UserTest (Content between tags)

5. Filter time taken out of `<td>5.308s</td>`

6. Filter failure % out of `<td class="failures">50%</td>`

7. Filter test that has failed, here it is "failingTest" 
   out of `<h3 class="failures">failingTest</h3>`

I want to come up with a bash script that will do all above filtering on index.html (where index.html has these above html tags + `additional tags which I do not want to worry about /filter`)

    Input:

    <!DOCTYPE html> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> <meta http-equiv="x-ua-compatible" content="IE=edge"/> <title>Test results - Test Summary</title> <link href="css/base-style.css" rel="stylesheet" type="text/css"/> <link href="css/style.css" rel="stylesheet" type="text/css"/> <script src="js/report.js" type="text/javascript"></script> </head> <body> <div id="content"> <h1>Test Summary</h1> <div id="summary"> <table> <tr> <td> <div class="summaryGroup"> <table> <tr> <td> <div class="infoBox" id="tests"> <div class="counter">2</div> <p>tests</p> </div> </td> <td> <div class="infoBox" id="failures"> <div class="counter">1</div> <p>failures</p> </div> </td> <td> <div class="infoBox" id="ignored"> <div class="counter">0</div> <p>ignored</p> </div> </td> <td> <div class="infoBox" id="duration"> <div class="counter">5.308s</div> <p>duration</p> </div> </td> </tr> </table> </div> </td> <td> <div class="infoBox failures" id="successRate"> <div class="percent">50%</div> <p>successful</p> </div> </td> </tr> </table> </div> <div id="tabs"> <ul class="tabLinks"> <li> <a href="#tab0">Failed tests</a> </li> <li> <a href="#tab1">Packages</a> </li> <li> <a href="#tab2">Classes</a> </li> </ul> <div id="tab0" class="tab"> <h2>Failed tests</h2> <ul class="linkList"> <li> <a href="classes/com.pratik.testing.UserTest.html">UserTest</a>.
    <a href="classes/com.pratik.testing.UserTest.html#failingTest">failingTest</a> </li> </ul> </div> <div id="tab1" class="tab"> <h2>Packages</h2> <table> <thead> <tr> <th>Package</th> <th>Tests</th> <th>Failures</th> <th>Ignored</th> <th>Duration</th> <th>Success rate</th> </tr> </thead> <tbody> <tr> <td class="failures"> <a href="packages/com.pratik.testing.html">com.pratik.testing</a> </td> <td>2</td> <td>1</td> <td>0</td> <td>5.308s</td> <td class="failures">50%</td> </tr> </tbody> </table> </div> <div id="tab2" class="tab"> <h2>Classes</h2> <table> <thead> <tr> <th>Class</th> <th>Tests</th> <th>Failures</th> <th>Ignored</th> <th>Duration</th> <th>Success rate</th> </tr> </thead> <tbody> <tr> <td class="failures"/> <a href="classes/com.pratik.testing.UserTest.html">com.pratik.testing.UserTest</a> <td>2</td> <td>1</td> <td>0</td> <td>5.308s</td> <td class="failures">50%</td> </tr> </tbody> </table> </div> </div> <div id="footer"> <p> <div> <label class="hidden" id="label-for-line-wrapping-toggle" for="line-wrapping-toggle">Wrap lines
    <input id="line-wrapping-toggle" type="checkbox" autocomplete="off"/> </label> </div>Generated by 
    <a href="http://www.gradle.org">Gradle 2.4</a> at Aug 8, 2015 6:38:46 PM</p> </div> </div> </body> </html>



      Desired output:
        JSON:

    Aug 8 2015 6:38:46 PM, 50%, com.pratik.testing, UserTest, 5.308s, failingTest,..

Hello,

You could do this with bash by pulling the lines you wanted with grep and then parsing them with awk instead of sed. For example :

grep "http://www.gradle.org" index.html | awk '{ print $5 " " $6 " " $7 " " $8 " " $9 }'

But you would be better off using something like perl to strip out the data you wanted. You can execute perl scripts just like bash scripts by putting the

#!/usr/bin/perl

and then making the file executable

chmod 755 myscript.pl

Below is the beginnings of a perl script that searches each line in the index.html file for specific strings and if it finds the string it splits the line into separate fields using the > symbol to delimit the fields and then parses the output you wanted. I commented out (##) a couple of print statements that I used to help in debugging the script.

#!/usr/bin/perl

use 5.006;
use strict;
use warnings;

my $record;
my $filter_one;
my $filter_two;
open (TESTFILE, "index.html") || die "could not open the file.";

while ($record = <TESTFILE>) 
{
    chomp($record);

    if ( $record =~ m/www.gradle.org/ ) {
        ##print "Matched one \n\n";
        my @line = split(/>/, $record);
    $filter_one =  substr( $line[2], 4 , (length($line[2])- 7));
    ##print "$filter_one \n" ;
}
if ( $record =~ m/class=\"percent/ ) {
        ##print "Matched two \n\n";
        my @line = split(/>/, $record);
    $filter_two =  substr( $line[1], 0 , (length($line[1])- 5));
    ##print "$filter_two \n" ;
}
    ##print "$record \n";
}
print "$filter_one, $filter_two, \n";

close(TESTFILE);
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.