I'm writing a program which searches through a list of medium sized text files for a particular keyword string. This string may appear more than once in the file. The files are generally parsed out for readability, so they have hard returns at the 80 column mark within paragraphs. As I have it right now, the program slurps the entire file and then runs a global search on my keyword string like so:

$brcount++ while $data =~ /(file for bankrup)|(file for chapter)/gi;

I then print out this count, along with a bunch of header information. What I would like to do is to print out say 1-3 lines before and after each occurrence so that I have an output file that I can manually inspect.

I've tried to hack in the lines to the regex string, but since it's a global search, it of course matches each occurrence 5 or 6 times:

while ($data=~ /(?:[^\n]*\n){1,3}(file for bankrup)(?:[^\n]*\n){1,3}|(?:[^\n]*\n){1,3}(file for chapter)(?:[^\n]*\n){1,3}/gi) {
print $data;
}

Is there an elegant way to do this? Alternatively, is there another way of going about this reading the file in line-by-line? I've thought about appending two lines together, performing the phrase match, storing the line numbers, and then going back in and pulling the additional lines with the stored line numbers, but this seemed like a huge hack. It would probably run faster that way, I would imagine.

Recommended Answers

All 4 Replies

$brcount++ while $data =~ /(file for bankrup)|(file for chapter)/gi; This assumes that those two phrases will never span across two lines. Is that correct? If so, then you don't need to consider appending two lines together to do each match.

$brcount++ while $data =~ /(file for bankrup)|(file for chapter)/gi; This assumes that those two phrases will never span across two lines. Is that correct? If so, then you don't need to consider appending two lines together to do each match.

Ooops. You're completely right. There should be a \s* instead of ' ' and a /s flag in that one.

After benchmarking the slurp script against a line-by-line alternative, I think it's clear that I should do this line by line. I think I could do something like this (where $data is a concatenation of the last two read in lines of <DATA_FILE>)

if ($data =~ /(file\s*for\s*bankrup)|(file\s*for\s* chapter)/is) {
	$line1 = $data;
	$line2 = <DATA_FILE>;
	$line3 = <DATA_FILE>;
print OUTPUT "$line1$line2$line3\n";
}

Somehow this doesn't seem like the most efficient thing to be doing.

Wait, I think I can use a different approach via the range delimiter

while (<DATA>) {
  print if /\bfile for bankrupt\b/i .. /^\s*$/
}

Which prints out the rest of the paragraph.

However, If anyone knows the best way to print out just the next 2-3 lines and, more importantly, the previous 2-3 lines I'd appreciate the advice.

EDIT: After consideration, I'm guessing it would be easiest to keep the last 2-3 lines in an array and push-shift the array each for each loop through the while statement.

However, If anyone knows the best way to print out just the next 2-3 lines and, more importantly, the previous 2-3 lines I'd appreciate the advice.

EDIT: After consideration, I'm guessing it would be easiest to keep the last 2-3 lines in an array and push-shift the array each for each loop through the while statement.

Jonathan Leffler posted a Perl script he calls 'sgrep', on Stackoverflow which takes a similar approach to what you are thinking of, except he doesn't look at two lines together so maybe it won't work for you. But you may want to look at it for inspiration.

I would try to do something like the following instead:

#!/usr/bin/perl
#PrintMatchContext01.pl
use 5.006;
use strict;
use warnings;
use Tie::File;
sub max ($$) { $_[$_[0] < $_[1]] }
sub min ($$) { $_[$_[0] > $_[1]] }
my $file_in = 'data.txt';
my @lines;
tie @lines, 'Tie::File', $file_in or die "Unable to tie $file_in $!";
my $n_recs = @lines; # how many records are in the file?
my @lnbrs = ();
my $i = 0;
while ($i < $n_recs - 1) {
    my $twolines = $lines[$i] . $lines[$i + 1];
    if ($twolines =~ m/file\s+for\s+(?:bankrup|chapter)/im) {
        push @lnbrs, $i
    }
    $i++;
}

foreach my $ln (@lnbrs) {
    my $start = max($ln-3, 0);
    my $end = min ($ln+3, $n_recs);
    print '-' x 75, "\n";
    foreach ($start .. $end) {
        print "$lines[$_]\n";
    }
}
untie @lines;
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.