I am trying to filter/parse HTML tags mentioned below to get the desired output (mentioned). I have tried so far with `sed` & `grep` but was only able to extract the content between the starting and closing tags. However, I want to be more specific like below I have mentioned. Any help would be really appreciated. Thanks!

1. Filter:
   `<a href="http://www.gradle.org">Gradle 2.4</a>` at Aug 8, 2015 6:38:46 PM`</p>`
   --> I want to fetch date and time out of this (Way I thought was to extract        between ...at and `</p>` - not sure if I am correct.

2. Filter `<div class="percent">50%</div>` to get that 50%
   --> My approach: if class="percent" extract number out of the tag

3. Filter `<a href="packages/com.pratik.testing.html">com.pratik.testing</a>` to get package name i.e com.pratik.testing

4. Filter `<a href="classes/com.pratik.testing.UserTest.html">UserTest</a>`&& 

   I want to fetch test performed ---> i.e UserTest (Content between tags)

5. Filter time taken out of `<td>5.308s</td>`

6. Filter failure % out of `<td class="failures">50%</td>`

7. Filter test that has failed, here it is "failingTest" 
   out of `<h3 class="failures">failingTest</h3>`

I want to come up with a bash script that will do all above filtering on index.html (where index.html has these above html tags + `additional tags which I do not want to worry about /filter`)


    <!DOCTYPE html> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> <meta http-equiv="x-ua-compatible" content="IE=edge"/> <title>Test results - Test Summary</title> <link href="css/base-style.css" rel="stylesheet" type="text/css"/> <link href="css/style.css" rel="stylesheet" type="text/css"/> <script src="js/report.js" type="text/javascript"></script> </head> <body> <div id="content"> <h1>Test Summary</h1> <div id="summary"> <table> <tr> <td> <div class="summaryGroup"> <table> <tr> <td> <div class="infoBox" id="tests"> <div class="counter">2</div> <p>tests</p> </div> </td> <td> <div class="infoBox" id="failures"> <div class="counter">1</div> <p>failures</p> </div> </td> <td> <div class="infoBox" id="ignored"> <div class="counter">0</div> <p>ignored</p> </div> </td> <td> <div class="infoBox" id="duration"> <div class="counter">5.308s</div> <p>duration</p> </div> </td> </tr> </table> </div> </td> <td> <div class="infoBox failures" id="successRate"> <div class="percent">50%</div> <p>successful</p> </div> </td> </tr> </table> </div> <div id="tabs"> <ul class="tabLinks"> <li> <a href="#tab0">Failed tests</a> </li> <li> <a href="#tab1">Packages</a> </li> <li> <a href="#tab2">Classes</a> </li> </ul> <div id="tab0" class="tab"> <h2>Failed tests</h2> <ul class="linkList"> <li> <a href="classes/com.pratik.testing.UserTest.html">UserTest</a>.
    <a href="classes/com.pratik.testing.UserTest.html#failingTest">failingTest</a> </li> </ul> </div> <div id="tab1" class="tab"> <h2>Packages</h2> <table> <thead> <tr> <th>Package</th> <th>Tests</th> <th>Failures</th> <th>Ignored</th> <th>Duration</th> <th>Success rate</th> </tr> </thead> <tbody> <tr> <td class="failures"> <a href="packages/com.pratik.testing.html">com.pratik.testing</a> </td> <td>2</td> <td>1</td> <td>0</td> <td>5.308s</td> <td class="failures">50%</td> </tr> </tbody> </table> </div> <div id="tab2" class="tab"> <h2>Classes</h2> <table> <thead> <tr> <th>Class</th> <th>Tests</th> <th>Failures</th> <th>Ignored</th> <th>Duration</th> <th>Success rate</th> </tr> </thead> <tbody> <tr> <td class="failures"/> <a href="classes/com.pratik.testing.UserTest.html">com.pratik.testing.UserTest</a> <td>2</td> <td>1</td> <td>0</td> <td>5.308s</td> <td class="failures">50%</td> </tr> </tbody> </table> </div> </div> <div id="footer"> <p> <div> <label class="hidden" id="label-for-line-wrapping-toggle" for="line-wrapping-toggle">Wrap lines
    <input id="line-wrapping-toggle" type="checkbox" autocomplete="off"/> </label> </div>Generated by 
    <a href="http://www.gradle.org">Gradle 2.4</a> at Aug 8, 2015 6:38:46 PM</p> </div> </div> </body> </html>

      Desired output:

    Aug 8 2015 6:38:46 PM, 50%, com.pratik.testing, UserTest, 5.308s, failingTest,..


You could do this with bash by pulling the lines you wanted with grep and then parsing them with awk instead of sed. For example :

grep "http://www.gradle.org" index.html | awk '{ print $5 " " $6 " " $7 " " $8 " " $9 }'

But you would be better off using something like perl to strip out the data you wanted. You can execute perl scripts just like bash scripts by putting the


and then making the file executable

chmod 755 myscript.pl

Below is the beginnings of a perl script that searches each line in the index.html file for specific strings and if it finds the string it splits the line into separate fields using the > symbol to delimit the fields and then parses the output you wanted. I commented out (##) a couple of print statements that I used to help in debugging the script.


use 5.006;
use strict;
use warnings;

my $record;
my $filter_one;
my $filter_two;
open (TESTFILE, "index.html") || die "could not open the file.";

while ($record = <TESTFILE>) 

    if ( $record =~ m/www.gradle.org/ ) {
        ##print "Matched one \n\n";
        my @line = split(/>/, $record);
    $filter_one =  substr( $line[2], 4 , (length($line[2])- 7));
    ##print "$filter_one \n" ;
if ( $record =~ m/class=\"percent/ ) {
        ##print "Matched two \n\n";
        my @line = split(/>/, $record);
    $filter_two =  substr( $line[1], 0 , (length($line[1])- 5));
    ##print "$filter_two \n" ;
    ##print "$record \n";
print "$filter_one, $filter_two, \n";

This article has been dead for over six months. Start a new discussion instead.