1.11M Members

Word Frequency using Python

 
2
 

This program uses Python module re for splitting a text file into words and removing some common punctuation marks. The word:frequency dictionary is then formed using try/except. In honor of 4th of July the text analyzed is National Anthem of USA (found via Google).

# another word frequency program, uses re
# tested with Python2.4.3   HAB

import re

# this one in honor of 4th July, or pick text file you have!!!!!!!
filename = 'NationalAnthemUSA.txt'

# create list of lower case words, \s+ --> match any whitespace(s)
# you can replace file(filename).read() with given string
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)

# create dictionary of word:frequency pairs
freq_dic = {}
# punctuation marks to be removed
punctuation = re.compile(r'[.?!,":;]') 
for word in word_list:
    # remove punctuation marks
    word = punctuation.sub("", word)
    # form dictionary
    try: 
        freq_dic[word] += 1
    except: 
        freq_dic[word] = 1
    

print 'Unique words:', len(freq_dic)

# create list of (key, val) tuple pairs
freq_list = freq_dic.items()
# sort by key or word
freq_list.sort()
# display result
for word, freq in freq_list:
    print word, freq
 
0
 

How would you take this and organize the words that appear in descending order not alphabetical.

 
0
 

Do you mean highest frequency first?

Simply add this to the end of the code:

print '-'*30

print "sorted by highest frequency first:"
# create list of (val, key) tuple pairs
freq_list2 = [(val, key) for key, val in freq_dic.items()]
# sort by val or frequency
freq_list2.sort(reverse=True)
# display result
for freq, word in freq_list2:
    print word, freq
 
0
 

this code is useful...

 
0
 

nice piece of code :) useful indeed

 
0
 

line 21 onwards:

# form dictionary
    try:
         freq_dic[word] += 1
    except:
         freq_dic[word] = 1

Could be replaced by:

freq_dic[word] = freq_dic.get(word,0) + 1

gets rid of the try except and just makes things a little neater.

 
0
 

what can I add to this code to remove some words listed in some other file prior doing the frequency listing?

 
0
 

This doesn't seem to remove any punctuation marks from the text file, and reads 'it' separately from 'it,'.

What might the problem be?

 
0
 

But, does it works for "There" and "There's".

Isn't it about time forums rewarded their contributors?

Earn rewards points for helping others. Gain kudos. Cash out. Get better answers yourself.

It's as simple as contributing editorial or replying to discussions labeled or OP Kudos

You
This is an OP Kudos discussion and contributors may be rewarded
Post:
Start New Discussion
Tags Related to this Article