A basic script, the output foramt is not real good but that will be up to you to change to your needs:
use strict;
use warnings;
open(IN, "readme.txt") or die "ERROR: $!";
open(OUT, ">seeme.txt") or die "ERROR: $!";
my (%cnter, $title, @order);
while(<IN>) {
next if (/^\s*$/);
chomp;
my @line = split(/\s+/);
if($line[0] =~ /^=/) {
$line[0] =~ tr/=//d; # remove all the "=" from the section title
$title = "@line";
push @order, $title;
}
else {
tr/,.?!//d for @line; #remove some punctuation
tr/A-Z/a-z/ for @line; #convert all text to lower case so 'Word' and 'word' are the same
$cnter{$_}{$title}++ for @line;
}
}
print OUT join("\t",@order),"\n";
foreach my $word (sort keys %cnter){
print OUT "$word : ";
my @t = ();
foreach my $title (@order) {
push @t, (exists $cnter{$word}{$title}) ? $cnter{$word}{$title} : 0;
}
print OUT join("\t", @t),"\n";
}
close(IN);
close(OUT);
This does not allow for a lot of data analysis in and of itself. It simply lists the data by word and its count per section. If you wanted to sort by highest word frequency per section (for example) you would need to build a more robust data structure or open the file this script creates and parse that file with another script. You could look at the output of the above script as your basic statistics from which you could perform more analysis of the data.
Last edited by KevinADC : Apr 11th, 2008 at 3:36 am.