Write a Python script/program that reads an arbitrary file and build a concordance of the words in the file. The concordance should contain the line number on which the word first occurs as well as the number of times the word occurs in the file. The program needs to print out the concordance in alphabetical order. For example:

a 2 3
book 1 2
spam 1 7
the 2 4

Start by getting the file name, opening it, and reading it one record at a time http://diveintopython.org/file_handling/file_objects.html. Split each line into words and use a dictionary http://diveintopython.org/native_data_types/index.html#odbchelper.dict to store each unique word. Once that is done, add a routine to use this dictionary to count the number of times the word appears. Don't worry about which line it appears on right now. First post the code for the above and then ask for help with the remainder.

Edited 5 Years Ago by woooee: n/a

i got the count of words that occur...i need to insert how to find the "line number in which the word occurs"

import sys
import string
import re

# initialize the counters
linecount = 0
charcount = 0
wordcount = 0
# this is where I will store the total count of each word
words     = { }

# test text ...
text = """\
Just a simple text.
We can count the sentences!
Why do sentences have to end?

Every now and then a blank line.
Perhaps it will snow!

Wow, another blank line for the count.
That should do it for the test!"""

# write the trs file
fname = "MyText1.txt"
fout = open(fname, "w")
fout.write(text)
fout.close()

# read the file back in
textf = open(fname, "r")

# iterate over each line on MyText1
for line in textf:
    linecount += 1
    charcount += len( line )

    # remove leading and trailing whitespace
    line = string.strip( line ) 

    # split the string into a list of words
    # a word is delimited by whitespace or punctuation
            #"[.,:;?! \t\n]+" , # this is the regex used in my perl version
    for word in re.split(
            "[" + string.whitespace + string.punctuation + "]+" ,
            line ) :

        # make the word lower case
        word = string.lower( word )

        # check to make sure the string is considered a word
        if re.match( "^[" + string.lowercase + "]+$" , word ) :
            wordcount += 1

            # if the word has been found before, increment its count
            # otherwise initialize its count to 1
            if words.has_key( word ) :
                words[ word ] += 1
            else :
                words[ word ] = 1


# Now print out the results of the count:
print
print "Number of lines:" , linecount
print "Total word count:" , wordcount
print "Total character count:" , charcount
print

# print each word and its count in sorted order
sorted_word_list = words.keys()
sorted_word_list.sort()

for word in sorted_word_list :
    print word , ":" , words[ word ]


Output:-

Number of lines: 8
Total word count: 38
Total character count: 201

a : 1
akshay : 1
all : 1
artificial : 1
boss : 1
by : 1
can : 2
clas : 1
count : 1
first : 1
hi : 3
in : 1
intelligence : 1
is : 2
just : 1
program : 2
python : 2
says : 1
shell : 1
simple : 1
solve : 1
text : 1
the : 3
this : 2
to : 1
we : 2
words : 1
writing : 1

Edited 3 Years Ago by Reverend Jim: Fixed formatting

import string
import re

# initialize the counters
linecount = 0
charcount = 0
wordcount = 0
# this is where I will store the total count of each word
words     = { }

# test text ...
text = """\
Just a simple text.
We can count the sentences!
Why do sentences have to end?

Every now and then a blank line.
Perhaps it will snow!

Wow, another blank line for the count.
That should do it for the test!"""
 
# write the trs file
fname = "MyText1.txt"
fout = open(fname, "w")
fout.write(text)
fout.close()
 
# read the file back in
textf = open(fname, "r")

# iterate over each line on MyText1
for line in textf:
    linecount += 1
    charcount += len( line )

    # remove leading and trailing whitespace
    line = string.strip( line ) 

    # split the string into a list of words
    # a word is delimited by whitespace or punctuation
            #"[.,:;?! \t\n]+" , # this is the regex used in my perl version
    for word in re.split(
            "[" + string.whitespace + string.punctuation + "]+" ,
            line ) :

        # make the word lower case
        word = string.lower( word )

        # check to make sure the string is considered a word
        if re.match( "^[" + string.lowercase + "]+$" , word ) :
            wordcount += 1

            # if the word has been found before, increment its count
            # otherwise initialize its count to 1
            if words.has_key( word ) :
                words[ word ] += 1
            else :
                words[ word ] = 1

        
# Now print out the results of the count:
print
print "Number of lines:" , linecount
print "Total word count:" , wordcount
print "Total character count:" , charcount
print

# print each word and its count in sorted order
sorted_word_list = words.keys()
sorted_word_list.sort()

for word in sorted_word_list :
    print word , ":" , words[ word ]

Output:-

Number of lines: 8
Total word count: 38
Total character count: 201

a : 1
akshay : 1
all : 1
artificial : 1
boss : 1
by : 1
can : 2
clas : 1
count : 1
first : 1
hi : 3
in : 1
intelligence : 1
is : 2
just : 1
program : 2
python : 2
says : 1
shell : 1
simple : 1
solve : 1
text : 1
the : 3
this : 2
to : 1
we : 2
words : 1
writing : 1

Your count is wrong as 'a' is for example two times, but you give only one (non-lineoriented way for not giving your direct answer):

import re

fname = "MyText1.txt"
# this is where I will store the total count of each word
words  = {}

# test text ...
open(fname, "w").write( """\
Just a simple text.
We can count the sentences!
Why do sentences have to end?

Every now and then a blank line.
Perhaps it will snow!

Wow, another blank line for the count.
That should do it for the test!""")

# read the file back in
with open(fname, "r") as textf:
    t = textf.read().lower()
    for word in re.findall( "\w+" , t ) :
        words[word] = words.get(word,0) + 1

print t
# Now print out the results of the count:
print
print "Number of lines:" , t.count('\n')+1
print "Total word count:" , sum(words.values())
print "Total character count:" , len(t)
print

# print each word and its count in sorted order
for word in sorted(words) :
    print word , ":" , words[ word ]

Edited 5 Years Ago by pyTony: n/a

This article has been dead for over six months. Start a new discussion instead.