Wordcount of a text file (Python)

vegaseat 2 Tallied Votes 2K Views Share

A simple program to count the words, lines and sentences contained in a text file. The assumptions are made that words are separated by whitespaces, and sentences end with a period, question mark or exclamation mark.

# count lines, sentences, and words of a text file

# set all the counters to zero
lines, blanklines, sentences, words = 0, 0, 0, 0

print '-' * 50

try:
  # use a text file you have, or google for this one ...
  filename = 'GettysburgAddress.txt'
  textf = open(filename, 'r')
except IOError:
  print 'Cannot open file %s for reading' % filename
  import sys
  sys.exit(0)

# reads one line at a time
for line in textf:
  print line,   # test
  lines += 1
  
  if line.startswith('\n'):
    blanklines += 1
  else:
    # assume that each sentence ends with . or ! or ?
    # so simply count these characters
    sentences += line.count('.') + line.count('!') + line.count('?')
    
    # create a list of words
    # use None to split at any whitespace regardless of length
    # so for instance double space counts as one space
    tempwords = line.split(None)
    print tempwords  # test
    
    # word total count
    words += len(tempwords)

    
textf.close()

print '-' * 50
print "Lines      : ", lines
print "Blank lines: ", blanklines
print "Sentences  : ", sentences
print "Words      : ", words

# optional console wait for keypress
from msvcrt import getch
getch()
vegaseat 1,735 DaniWeb's Hypocrite Team Colleague

This code is most likely more portable:
# optional console wait for keypress
raw_input('Press Enter...')

eclark53 0 Newbie Poster

I need a program to count the words in a sentence and illuminate the sentence that has 30 or more words. I need to be able to load an article into the program and then have the program highlight the sentence that has more words that the selected amount of words, i.e. 20 word, 30 words.

pelupelu 0 Newbie Poster

I am using this code to compute some lexical statistics in a text. However, it is not recognizing the end of the sentences (example . ? ! etc) and returns 1 sentence. I think that the command line.count is not working. The counting of the lines in the text is functional. Finally for the word counting, the program is only considering the last sentence and not the whole text. Can someone help me with this issue?

vegaseat 1,735 DaniWeb's Hypocrite Team Colleague

I modified the program using a this test text, and it works correctly ...

# count lines, sentences, and words of a text file

# set all the counters to zero
lines, blanklines, sentences, words = 0, 0, 0, 0

# test text ...
text = """\
Just a simple text.
We can count the sentences!
Why do sentences have to end?

Every now and then a blank line.
Perhaps it will snow!

Wow, another blank line for the count.
That should do it for the test!"""

# write the trs file
fname = "MyText1.txt"
fout = open(fname, "w")
fout.write(text)
fout.close()

# read the file back in
textf = open(fname, "r")

# reads one line at a time
for line in textf:
    #print line,   # test
    lines += 1

    if line.startswith('\n'):
        blanklines += 1
    else:
        # assume that each sentence ends with . or ! or ?
        # so simply count these characters
        sentences += line.count('.') + line.count('!') + line.count('?')

        # create a list of words
        # use None to split at any whitespace regardless of length
        # so for instance double space counts as one space
        tempwords = line.split(None)
        #print tempwords  # test

        # word total count
        words += len(tempwords)

textf.close()

print '-' * 50
print "Lines      : ", lines
print "Blank lines: ", blanklines
print "Sentences  : ", sentences
print "Words      : ", words

"""my result -->
Lines      :  9
Blank lines:  2
Sentences  :  7
Words      :  40
"""
pelupelu 0 Newbie Poster

thanks. Will this program works for any type of encoding (ASCII, UTF-8 for example?). I found on the web the following instruction which is supposed to allow Python to work with UTF-8:
# -*- coding: utf-8 -*-
The problem is that when I debug the program it does not seem to recognize the instruction (probably because it starts with #) but if I erase it it does not work either. Do you know this command?

vegaseat 1,735 DaniWeb's Hypocrite Team Colleague

To define a source code encoding, a magic comment must
be placed into the source files either as first or second
line in the file:
# -*- coding: <encoding name> -*-

see: http://www.python.org/peps/pep-0263.html

pelupelu 0 Newbie Poster

thanks again for the info. Is there a way to differentiate letters from numbers? for example in the string "the wine is 7 years old". Do you need a function for that? I tried to use line.count but it did not work (I guess I need a generic term for numbers).

vegaseat 1,735 DaniWeb's Hypocrite Team Colleague

Try this code sample ...

s = "the wine is 7 years old"
for c in s:
    if c.isdigit():
        print( "%s is numeric" % c )
drumkill 0 Newbie Poster
import string

# Count number of lines, words, and characters

sentence = raw_input("Enter filename: ")
try:
    outfile = open(sentence, "r")

    word, lines, char = 0, 0, 0

    for ligne in outfile:
        
        split_word = string.split(ligne)
        print split_word # test
        
        lines += 1
        word += len(split_word)

        for i in split_word: # enter in a word in split
            for ch in i: # enter in a character in word
                print ch # test
                char += len(ch)
            
    print """
words = %d
lines = %d
characters = %d""" % (word, lines, char)

except IOError:
    print "file not found!"

#Roshan S. University Of Mauritius, dept. of Computer Science (student).
pythopian 10 Junior Poster in Training

If I may propose a semantically equivalent but much shorter alternative...

import re

def analyzeText(text):
    sentences = re.findall(r'\s*(.+?)[.!?]\s*', text)
    wordsets = map(str.split, sentences)
    wordcounts = map(len, wordsets)
    charcounts = [ sum(len(word) for word in words) for words in wordsets ]
    return zip(sentences, wordcounts, charcounts)

Test:

text = """\
Just a simple text.
We can count the sentences!
Why do sentences have to end?

Every now and then a blank line.
Perhaps it will snow!

Wow, another blank line for the count.
That should do it for the test!"""

for sentence, wordCount, charCount in analyzeText(text):
    print 'There are %d words and %d chars in "%s".' % (wordCount, charCount, sentence)

Output:

There are 4 words and 15 chars in "Just a simple text".
There are 5 words and 22 chars in "We can count the sentences".
There are 6 words and 23 chars in "Why do sentences have to end".
There are 7 words and 25 chars in "Every now and then a blank line".
There are 4 words and 17 chars in "Perhaps it will snow".
There are 7 words and 31 chars in "Wow, another blank line for the count".
There are 7 words and 24 chars in "That should do it for the test".
vegaseat 1,735 DaniWeb's Hypocrite Team Colleague

What a difference 4 1/2 years make! I am surprised that Python has made it that long.

halophyte 0 Newbie Poster

With this code:

from itertools import groupby
import doctest

print '-' * 50
 
#try:
  # use a text file you have, or google for this one ...
#  filename = 'text.txt' #'GettysburgAddress.txt'
#  text = open(filename, 'r')
#except IOError:
#  print 'Cannot open file %s for reading' % filename
#  import sys
#  sys.exit(0)

# test text ...
text = """\
Just a simple text.
We can count the sentences!
Why do sentences have to end?
 
Every now and then a blank line.
Perhaps it will snow!
 
Wow, another blank line for the count.
That should do it for the test!"""
 
# write the trs file
fname = "MyText1.txt"
fout = open(fname, "w")
fout.write(text)
fout.close()
 
# read the file back in
try:
    text = open(fname, "r")
except IOError:
  print 'Cannot open file %s for reading' % filename
  import sys
  sys.exit(0)

print text

def printWordFrequencies(text):
    #"""
    #>>> printWordFrequencies("Ob la di ob la da")
    #1 da
    #1 di
    #2 la
    #2 ob"""
    for w, g in groupby(sorted(text.lower().split())):
        print "%s %s" % (len(list(g)), w)

doctest.testmod(verbose=True)

I get this error:

<open file 'MyText1.txt', mode 'r' at 0x15a69b0>

Can someone explain? I'm on a Mac.

snippsat 661 Master Poster

<open file 'MyText1.txt', mode 'r' at 0x15a69b0>
Can someone explain? I'm on a Mac.

You dont make any action for the file object.
text = open(fname, "r").read()
Now it will read file into memory and you can print it out.

So if you call "printWordFrequencies" like this it will work.
printWordFrequencies(text)
doctest.testmod(verbose=True)

Dont ask question in Code Snippet,make a new post next time.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.