count characters in a string

Question

mank 0 Light Poster

16 Years Ago

This code counts number of characters in a single line
how can I make it count(give total X Y Z) it in multiple lines seperated by ">"

while(<IN>) {
  chomp;
  my @line = split(//, $_);  
  if($line[0] eq ">"){print @line;}
          elsif (($line[0] eq "X"||"Y"||"Z")||($line[0] ne ">"))
                        {
                my %cnter;
                         foreach my $gc (@line) 
                             {$cnter{$gc} ++;}

                      print join(" ", @cnter{qw(X Y Z)}), "\n";

                        }


                }

close(IN);
exit;

perl

4 Contributors
32 Replies
286 Views
4 Weeks Discussion Span
Latest Post 16 Years Ago Latest Post by KevinADC

katharnakh 7 Posting Whiz in Training

16 Years Ago

open(IN, "readme.txt") || die "ERROR: $!"; 
my %cnter;

while(<IN>) {
  chomp;
  my @line = split(//, $_); 
  if($line[0] eq ">"){print @line, "\n";}
  elsif (($line[0] eq "X"||"Y"||"Z")||($line[0] ne ">"))
  {
      foreach my $gc (@line) 
      {
      	$cnter{$gc} = $cnter{$gc} ? $cnter{$gc} += 1 : 1;
      }
  }
}
foreach $k(keys %cnter){
	print "$k: $cnter{$k} | ";
}
close(IN);

# readme.txt
> this is first line
X this is second line
Y this is third line
Z this is forth line

# Output
> this is first line
 : 12 | c: 1 | d: 2 | e: 4 | f: 1 | h: 5 | i: 10 | l: 3 | n: 4 | o: 2 | r: 2 | s: 7 | t: 5 | X: 1 | Y: 1 | Z: 1 |

katharnakh.

katharnakh 7 Posting Whiz in Training

16 Years Ago

Hi,
since, X, Y and Z are repeating after every line having '>', make line having '>' as initial key in hash, and then add keys(X, Y, Z) and values.

open(IN, "readme.txt") || die "ERROR: $!";
my (%cnter, $marker);
while(<IN>) {
		chomp;
		my @line = split(//, $_);
		if($line[0] eq ">")
		{
				$marker = $_;
				print @line, "\n";
		}
		elsif (($line[0] eq "X"||"Y"||"Z")||($line[0] ne ">"))
		{
		  	foreach my $gc (@line)
				{
		 				$cnter{$marker}{$gc} = $cnter{$marker}{$gc} ? $cnter{$marker}{$gc} += 1 : 1;
		 		}
		}
}

foreach $k(keys %cnter){
		print "\n$k: ";
		foreach $kk(keys %{$cnter{$k}}){
 				print "\t$kk: $cnter{$k}{$kk}";
		}
}
close(IN);

Note: Please dont post your queries to private message, the forum is provided for that only, getting help.

katharnakh

KevinADC 192 Practically a Posting Shark

16 Years Ago

A basic script, the output foramt is not real good but that will be up to you to change to your needs:

use strict;
use warnings;
open(IN, "readme.txt") or die "ERROR: $!";
open(OUT, ">seeme.txt") or die "ERROR: $!";
my (%cnter, $title, @order);
while(<IN>) {
      next if (/^\s*$/);
      chomp;
      my @line = split(/\s+/);
      if($line[0] =~ /^=/) {
            $line[0] =~ tr/=//d; # remove all the "=" from the section title
            $title = "@line";
            push @order, $title;
      }
      else {
            tr/,.?!//d for @line; #remove some punctuation
            tr/A-Z/a-z/ for @line; #convert all text to lower case so 'Word' and 'word' are the same
            $cnter{$_}{$title}++ for @line; 
      }
}

print OUT join("\t",@order),"\n";
foreach my $word (sort keys %cnter){
      print OUT "$word : ";
      my @t = ();
      foreach my $title (@order) {
            push @t, (exists $cnter{$word}{$title}) ?  $cnter{$word}{$title} : 0;
      }
      print OUT join("\t", @t),"\n";
}
close(IN);
close(OUT);

This does not allow for a lot of data analysis in and of itself. It simply lists the data by word and its count per section. If you wanted to sort by highest word frequency per section (for example) you would need to build a more robust data structure or open the file this script creates and parse that file with another script. You could look at the output of the above script as your basic statistics from which you could perform more analysis of the data.

KevinADC 192 Practically a Posting Shark

16 Years Ago

Still learning on my end.
I am reviewing the code and am trying to understand this:
$cnter{$_}{$title}++ for @line;
I see this as a hash, %cnter, being populated . $_ is the word and the $title is the section. I see that the keys = $_ in this loop which are all the words. The value would then be the $title and the ++ is to count each individual word appearing in the section. Is the 'for @line' portion used for reading each line as it comes through?
I think this means this code already has a hash of all the words in each section: keys %cnter. I am tryin to figure out how to detemrine how to identify the hash for each section.
Thanks-

%cnter is a two dimensional hash ( a hash of hashes). $_ (the words) and $title (the section title) are bot hash keys. The value of a hash key can be another hash (and more things besides). ++ is the count of each word per section.

'for @line' just loops through the @line array and applies the value of each "line" to $_ which is used to build the hash up with. Its a short way of writing:

for (@line) {
    $cnter{$_}{$title}++;
}

It does mean there is a hash with all the words counted per all sections.

%cntr = (
word1 => {
    title1 => count,
    title2 => count,
 },
word2 => {
    title1 => count,
    title2 => count,
 } 
etc etc

If a word was not found in a section it would not be in the word hash. This is why my code checks later all the section titles and applies a value of 0 (zero) if a word was not found in a particular section.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

mank 0 Light Poster · Answer 1 · 2008-03-18T11:41:52+00:00

This is what I am looking for 
input file

>abc
XXYYZZ
XXYYZZ
>bcd
XXZZ
XXYY
>cde
ZZYYXX
>def
YYZZ
YYXX

output->

>abc 4 4 4
>bcd 4 2 2
>cde 2 2 2
>def 2 4 2

mank 0 Light Poster · Answer 2 · 2008-03-19T05:36:05+00:00

mank 0 Light Poster

16 Years Ago

Thanks :)

godevars 0 Newbie Poster · Answer 3 · 2008-04-08T23:48:51+00:00

I have been playing with this. I have tried updating it to count words. When I update the split to /\s+/ I receive an error for an unitialized hash value. How would I correct it? I tried by changing the elsif line to not include the X,Y and Z argument.

KevinADC 192 Practically a Posting Shark · Answer 4 · 2008-04-09T00:46:43+00:00

KevinADC 192 Practically a Posting Shark

16 Years Ago

post your code and some sample data.

godevars 0 Newbie Poster · Answer 5 · 2008-04-09T02:11:23+00:00

Below is the code I am working with:

1. use strict;
2.
3. open(FILEIN,'Sample.txt') || die "Can't open OUT file: $!\n";
4. open (FILEOUT, '>SampleOut2.txt') ||die "Can't open OUT file: $!\n";
5.
6. my (%cnter, $marker);
7. while(<FILEIN>) {
8.		chomp;
9.		my @line = split(/\s+/, $_);
10.		if($line[0] eq "=")
11.		{
12.			$marker = $_;
13.			print FILEOUT @line, "\n";
14.		}
15.		elsif ($line[0] ne "=")
16.		{
17.		  	foreach my $gc (@line)
18.				{
19.		 		$cnter{$marker}{$gc} = $cnter{$marker}{$gc} ? $cnter{$marker}{$gc} += 1 : 1;
20.		 		}
21.		}
22. }
23.
24. foreach my $k(keys %cnter){
25.		print FILEOUT "\n$k: ";
26.		foreach my $kk(keys %{$cnter{$k}}){
27. 				print FILEOUT "\t$kk: $cnter{$k}{$kk}\n";
28.		}
29. }
30. close FILEIN;
31. close FILEOUT;

Text for the sample is:

======Retirement Plan Fundamentals Part I

The objective of the Retirement Plan Fundamentals, Parts I and II, is to give an individual
beginning a career as a retirement plan professional a general background in qualified plans as a first step toward meeting the challenges of the profession.

=====Retirement Plan Fundamentals Part III

Retirement Plan Fundamentals Part III (RPF-3) covers plan administration, including census collection, benefit allocations and coverage and nondiscrimination testing. This course emphasizes daily valuation recordkeeping but includes discussions of balance-forward plans and conversions.

I am looking to count the words in each section seperately but reported in one output.

Thanks

katharnakh 7 Posting Whiz in Training · Answer 6 · 2008-04-09T11:07:40+00:00

open(IN, "readme.txt") || die "ERROR: $!";
open(OUT, ">seeme.txt") || die "ERROR: $!";

my (%cnter, $marker);
while(<IN>) {
		chomp;
		my @line = split(/\s+/, $_);
		if($line[0] =~ /=/)
		{
				$marker = $_;
				print @line, "\n";
				#print OUT @line, "\n";
		}
		elsif ($line[0] =! /=/)
		{
		  	foreach my $gc (@line)
				{
		 				$cnter{$marker}{$gc} = $cnter{$marker}{$gc} ? $cnter{$marker}{$gc} += 1 : 1;
		 		}
		}
}

foreach $k(keys %cnter){
		print OUT "\n$k: ";
		foreach $kk(keys %{$cnter{$k}}){
				print OUT "\n$kk: $cnter{$k}{$kk}";
 				#print "\t$kk: $cnter{$k}{$kk}";
		}
}
close(IN);
close(OUT);

You should use a regular expression to check whether line is having a character '=', and everything else is same. Also you need to one more line to inner foreach loop to output to a file.

sample output:
======Retirement Plan Fundamentals Part I:
plans: 1
background: 1
first: 1
individual: 1
...
=====Retirement Plan Fundamentals Part III:
conversions.: 1
plans: 1
course: 1
allocations: 1
...

katharnakh.

KevinADC 192 Practically a Posting Shark · Answer 7 · 2008-04-09T13:12:08+00:00

this might return false matches:

if($line[0] =~ /=/)

it would be better (I think judging by the sample data) to anchor it to the beginning of the string:

if($line[0] =~ /^=/)

at least that way you know it not somewhere else in the string. Might be better to just use the index() function though to avoid the unecessary overhead of a regexp.

This is also not correct syntax:

elsif ($line[0] =! /=/)

should be:

elsif ($line[0] !~ /=/)

katharnakh 7 Posting Whiz in Training · Answer 8 · 2008-04-09T13:34:09+00:00

...
This is also not correct syntax:
elsif ($line[0] =! /=/)
should be:
elsif ($line[0] !~ /=/)

Thanks Kavin...

KevinADC 192 Practically a Posting Shark · Answer 9 · 2008-04-09T13:42:28+00:00

Borrowing from katharnakh's code I corrected a couple of things and expanded on the processing.

use strict;
use warnings;
open(IN, "readme.txt") || die "ERROR: $!";
open(OUT, ">seeme.txt") || die "ERROR: $!";

my (%cnter, $marker);
while(<IN>) {
		chomp;
		my @line = split(/\s+/);
		if($line[0] =~ /^=/) {
            $line[0] =~ tr/=//d; # remove all the "=" from the section title
            $marker = join(' ',@line); # rejoin the section title into a string
		}
		else {
            tr/,.?!//d for @line; #remove some punctuation
            tr/A-Z/a-z/ for @line; #convert all text to lower case so 'Word' and 'word' are the same
			   $cnter{$marker}{$_}++ for @line; 
 		}
}

# sort words by count in descending order
foreach my $section (keys %cnter){
		print OUT "$section\n";
		foreach my $word (sort { $cnter{$section}{$b} <=> $cnter{$section}{$a} } keys %{$cnter{$section}}){
				print OUT "$word: $cnter{$section}{$word}\n";
		}
}
close(IN);
close(OUT);

godevars 0 Newbie Poster · Answer 10 · 2008-04-09T23:44:56+00:00

Thank you. I updated my script and the output looks good. I did start using the 'warnings' and 'strict' commands as suggested to me in another thread. When I run the updated script I receive an 'unitialized value in pattern match' error for this line:

if($line[0] =~ /^=/) {

My output is correct, just not sure why I continue to receive this error. I tried to correct by updating it to read.

if(my $line[0] =~ /^=/) {

Didn't work. Also tried to add @line (as well as $line) to the line where 'my' variables are defined but no luck. Is this an error I would receive but could ignore? Thanks again for helping me understand.

KevinADC 192 Practically a Posting Shark · Answer 11 · 2008-04-10T04:21:30+00:00

it's probably being caused by a blank line in the file. It is also a warning, not an error. Errors terminate scripts, warnings alert you to possible problems but the script keeps running.

You may want to try and skip blank lines in the file:

while(<IN>) {
      [B]next if (/^\s*$/;[/B]
		chomp;

godevars 0 Newbie Poster · Answer 12 · 2008-04-11T03:52:48+00:00

I didn't realize the blank lines would create warnings. No more warnings now. In reviewing the script, it looks like a hash of hashes were made. I added (after the 'while' section' and before the '#sort..':

my ($key, $value);
while (($key, $value) = each %cnter) {
      print FILEOUT "$key => $value\n";
}

I wanted to see what the values in each hash were (wanted understand the logic behind each portion). My output looked something like this:

Retirement Plan Fundamentals Part I => HASH(0x18f0554)
Retirement Plan Fundamentals Part III => HASH(0x1876a74)
.....

I am looking to compare the words from each section in an easier way to review. For example, output would like more like:

word/section Plan Part I : Plan Part II : Plan Part III

plans: 1 : 0 : 1
course: 0 : 1 : 1
.....

I know (keys %cnter) is the title part.
(keys %{cnter{$section}}) is each word in the title hash. Not sure where to start. If I do this:

(keys %{cnter{$section}}) eq (keys %{cnter{$section}})

I am only comparing the words in the section. I believe I need to assign each section a unique hash name so that it would look something like this:

(keys %{cnter1{$section}}) eq (keys %{cnter2{$section}})

I would use one of the sections as the base list so that when I compare other sections, I would just add a word if it's not there already.

Also, please let me know if I should start another thread. I want to make sure I follow proper form here.

Thanks-

katharnakh 7 Posting Whiz in Training · Answer 13 · 2008-04-11T10:24:38+00:00

...
I know (keys %cnter) is the title part.
(keys %{cnter{$section}}) is each word in the title hash. Not sure where to start. If I do this:
(keys %{cnter{$section}}) eq (keys %{cnter{$section}})
I am only comparing the words in the section. I believe I need to assign each section a unique hash name so that it would look something like this:
(keys %{cnter1{$section}}) eq (keys %{cnter2{$section}})
...

Hi,
Its a good idea to have two or three sections, if you have two or three sections only(not more than that). But if sections in the input file is more than 3(may be 5 or 10) then having unique hash for each sections is not a good idea. So we can change a little bit in the way data is stored in the hash, so that it is easy to handle.

Now the data is stored in the hash is like this,

cnter{
	section1 => {
				word1 => count1,
				word2 => count2
			},
	section2 => {
				word1 => count1,
				word2 => count2
			},
        ...
}

we will alter above hash, so that we will have one hash for entire section,

cnter{
		word1 => [sec1_count, sec2_count, sec3_count....],
		word2 => [sec1_count, sec2_count, sec3_count....],
		...
}

Now only word is stored as KEY and VALUE is an anonymous array holding count of word in each section.

Note: Each index in the array is dedicated to perticular section, so word count should be incremented accordingly.

katharnakh.

KevinADC 192 Practically a Posting Shark · Answer 14 · 2008-04-11T12:56:02+00:00

It seems that the array approach would require you know all the sections beforehand.

It may require more than one process to get the final output. First build the hash that counts the words per section, then have another routine that builds the hash of arrays then prints the final output.

KevinADC 192 Practically a Posting Shark · Answer 15 · 2008-04-11T13:04:06+00:00

I think the approach for the main data structure will have to be:

word => {
      section_titte => count
      section_title => count
}
word => {
      section_titte => count
      section_title => count
}

plus have a seperate array that holds the names of each section in the order it was found in the file.

katharnakh 7 Posting Whiz in Training · Answer 16 · 2008-04-11T17:17:17+00:00

It seems that the array approach would require you know all the sections beforehand.
It may require more than one process to get the final output. First build the hash that counts the words per section, then have another routine that builds the hash of arrays then prints the final output.

Not really, only thing which we need to consider is extra overhead(which, i consider) to find the index of the section in the array, so that count of a word belong to that section is incremented correctly.

Here is the implementation according to the datastructure which i had mentioned earlier. Below code modifies Kavin's earlier posted code, just to show how the implementation goes. I have commented some lines just to compare two datastructure creation and manipulation

use strict;
use warnings;
open(IN, "readme.txt") or die "ERROR: $!";
open(OUT, ">seeme.txt") or die "ERROR: $!";
my (%cnter, $title, @order);
while(<IN>) {
      next if (/^\s*$/);
      chomp;
      my @line = split(/\s+/);
      if($line[0] =~ /^=/) {
            $line[0] =~ tr/=//d; # remove all the "=" from the section title
            $title = "@line";
            push @order, $title;
      }
      else {
            tr/,.?!//d for @line; #remove some punctuation
            tr/A-Z/a-z/ for @line; #convert all text to lower case so 'Word' and 'word' are the same
            
            # a local hash to store { section => its index in the array
            my %index;
            @index{@order} = (0..$#order);
            
            #$cnter{$_}{$title}++ for @line; 
            ${$cnter{$_}}[$index{$title}]++ for @line;
      }
}

print OUT join("\t",@order),"\n";
foreach my $word (sort keys %cnter){
      print OUT "$word :";
      my @t = ();
      my $index=0;
      foreach my $title (@order) {
            #push @t, (exists $cnter{$word}{$title}) ?  $cnter{$word}{$title} : 0;
            push @t, (exists ${$cnter{$word}}[$index]) ? ${$cnter{$word}}[$index] : 0;
      }
      print OUT join("\t", @t),"\n";
}
close(IN);
close(OUT);

I appreciate Kavin's approach which is more simple, clean and code is more readable.

katharnakh.

KevinADC 192 Practically a Posting Shark · Answer 17 · 2008-04-12T05:17:37+00:00

Your code does not seem to work properly katharnakh. I did not try and determine why. I don't think the "exists" function works on arrays:

exists ${$cnter{$word}}[$index]

like it does on hash keyes:

exists $cnter{$word}{$title}

so that might be a problem.

godevars 0 Newbie Poster · Answer 18 · 2008-04-13T22:45:43+00:00

I tried it as well and the results give the same result of numbers for each word in each section. I think it does have something to do with [$index].

godevars 0 Newbie Poster · Answer 19 · 2008-04-14T01:53:17+00:00

I added some spacing and then decided to add all the words that were in each section. I declared a new variable $count and wanted to add the count being pushed to the end of the @t array. My thought was this would add all words in the section and then I could print it out as the last line. The second foreach loop is for one section, correct?

foreach my $word (sort keys %cnter){
      print OUT "$word : ";
      my @t = ();
      my $count = 0;
      foreach my $title (@order) {
            push @t, (exists $cnter{$word}{$title}) ?  $cnter{$word}{$title} : 0;
            my $count += $t[-1] # this would add the number just pushed to end of array
      }
      print OUT join("\t", @t),"\n";
      print OUT "$count"; #prints total then goes to next title
}

I may be interpreting this incorrectly.

Thanks-

KevinADC 192 Practically a Posting Shark · Answer 20 · 2008-04-14T04:28:00+00:00

Did you run the code? The first obvious problem is declaring the variable twice with "my". The code you posted looks like it will always display zero because of that. But even if you properly declare the variable it will not count the total of words per section. It looks like it will be the total of only each instance of a word for all the sections it is found in. Say the word were "foo" and it was found 3 times in section I and 2 times in section III the code will print 5, I think. I did not run the code to see but I can tell it is not totalling the words per section. I would do that while the data is being read in from the file, not after, while the data is being printed to the OUT file, although that is more than likely possible.

godevars 0 Newbie Poster · Answer 21 · 2008-04-14T10:08:03+00:00

I did run it and you are right. I was getting zeros. I thought it was because of how I set it up. I corrected it and I am getting the count for the word in the sections. I am looking at now creating a hash for each word where the key is the section and the values are the counts. Then in the outside of the loops I could add all the same key/values to get a total number of words. Sections is $title and the counts are what is being pushed in the second 'foreach'.

%section_hash = ($title => $t[-1])

I tried this and this replaced my values until I hit the last section. I reviewed this and thought it may be easier to make an array of the numbers for each section instead. Is there a way to change the name of an array label within the loop? In this example, I want to end up with 2 section arrays.

my @section_array = ();
push @section_array, @t;

I'd like the @section_array to change based on $title. I'll try again tomorrow.

Thanks-

katharnakh 7 Posting Whiz in Training · Answer 22 · 2008-04-14T10:22:05+00:00

Hi Kavin,

Your code does not seem to work properly katharnakh. I did not try and determine why. I don't think the "exists" function works on arrays:
exists ${$cnter{$word}}[$index]
like it does on hash keyes:
exists $cnter{$word}{$title}
so that might be a problem.

The program was executing but output was wrong. I realised later that it is important to maintain the indexes of the titles encountered. I have done the changes... I have added %index as global instead of local to else block.

use strict;
use warnings;
open(IN, "readme.txt") or die "ERROR: $!";
open(OUT, ">seeme.txt") or die "ERROR: $!";
my (%cnter, $title, @order, %index);

while(<IN>) {
      next if (/^\s*$/);
      chomp;
      my @line = split(/\s+/);
      if($line[0] =~ /^=/) {
            $line[0] =~ tr/=//d; # remove all the "=" from the section title
            $title = "@line";
            push @order, $title;
      }
      else {
            tr/,.?!//d for @line; #remove some punctuation
            tr/A-Z/a-z/ for @line; #convert all text to lower case so 'Word' and 'word' are the same
            
            # a local hash to store { section => its index in the array
            
            @index{@order} = (0..$#order);
            
            #$cnter{$_}{$title}++ for @line; 
            ${$cnter{$_}}[$index{$title}]++ for @line;
      }
}

print OUT join("\t",@order),"\n";
foreach my $word (sort keys %cnter){
      print OUT "$word :";
      my @t = ();
      foreach my $title (@order) {
            #push @t, (exists $cnter{$word}{$title}) ?  $cnter{$word}{$title} : 0;
            push @t, (defined ${$cnter{$word}}[$index{$title}]) ? ${$cnter{$word}}[$index{$title}] : 0;
      }
      print OUT join("\t", @t),"\n";
}
close(IN);
close(OUT);

Thanks Kavin...
katharnakh.

katharnakh 7 Posting Whiz in Training · Answer 23 · 2008-04-14T13:58:03+00:00

Your code does not seem to work properly katharnakh. I did not try and determine why. I don't think the "exists" function works on arrays:
exists ${$cnter{$word}}[$index]
like it does on hash keyes:
exists $cnter{$word}{$title}
so that might be a problem.

Hi Kavin,
the code does work with push @t, (exists ${$cnter{$word}}[$index{$title}]) ? ${$cnter{$word}}[$index{$title}] : 0; line. I am not sure why, may be you can tell, if you have an idea, because code does include user strict; use warnings; construct and it does not seems to throw any warnings or errors... I forgot to mention that point in the earlier thread.

katharnakh.

godevars 0 Newbie Poster · Answer 24 · 2008-04-14T22:30:29+00:00

Still learning on my end.
I am reviewing the code and am trying to understand this:

$cnter{$_}{$title}++ for @line;

I see this as a hash, %cnter, being populated . $_ is the word and the $title is the section. I see that the keys = $_ in this loop which are all the words. The value would then be the $title and the ++ is to count each individual word appearing in the section. Is the 'for @line' portion used for reading each line as it comes through?

I think this means this code already has a hash of all the words in each section: keys %cnter. I am tryin to figure out how to detemrine how to identify the hash for each section.

Thanks-

KevinADC 192 Practically a Posting Shark · Answer 25 · 2008-04-15T04:13:09+00:00

Hi Kavin,
the code does work with push @t, (exists ${$cnter{$word}}[$index{$title}]) ? ${$cnter{$word}}[$index{$title}] : 0; line. I am not sure why, may be you can tell, if you have an idea, because code does include user strict; use warnings; construct and it does not seems to throw any warnings or errors... I forgot to mention that point in the earlier thread.
katharnakh.

Then I guess the exists funtion does work for arrays as well as hashes. I will try some simple tests later and see what happens.