Word Frequency using Python

bumsfeld

This program uses Python module re for splitting a text file into words and removing some common punctuation marks. The word:frequency dictionary is then formed using try/except. In honor of 4th of July the text analyzed is National Anthem of USA (found via Google).

1,201 Views
About the Author

student

# another word frequency program, uses re
# tested with Python2.4.3   HAB

import re

# this one in honor of 4th July, or pick text file you have!!!!!!!
filename = 'NationalAnthemUSA.txt'

# create list of lower case words, \s+ --> match any whitespace(s)
# you can replace file(filename).read() with given string
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)

# create dictionary of word:frequency pairs
freq_dic = {}
# punctuation marks to be removed
punctuation = re.compile(r'[.?!,":;]') 
for word in word_list:
    # remove punctuation marks
    word = punctuation.sub("", word)
    # form dictionary
    try: 
        freq_dic[word] += 1
    except: 
        freq_dic[word] = 1
    

print 'Unique words:', len(freq_dic)

# create list of (key, val) tuple pairs
freq_list = freq_dic.items()
# sort by key or word
freq_list.sort()
# display result
for word, freq in freq_list:
    print word, freq
kenmeck03 0 Newbie Poster

How would you take this and organize the words that appear in descending order not alphabetical.

bumsfeld 413 Nearly a Posting Virtuoso

Do you mean highest frequency first?

Simply add this to the end of the code:

print '-'*30

print "sorted by highest frequency first:"
# create list of (val, key) tuple pairs
freq_list2 = [(val, key) for key, val in freq_dic.items()]
# sort by val or frequency
freq_list2.sort(reverse=True)
# display result
for freq, word in freq_list2:
    print word, freq
bipratikgoswami 0 Newbie Poster

this code is useful...

masterofpuppets 19 Posting Whiz in Training

nice piece of code :) useful indeed

mattp23 0 Newbie Poster

line 21 onwards:

# form dictionary
    try:
         freq_dic[word] += 1
    except:
         freq_dic[word] = 1

Could be replaced by:

freq_dic[word] = freq_dic.get(word,0) + 1

gets rid of the try except and just makes things a little neater.

nawaf_ali 0 Newbie Poster

what can I add to this code to remove some words listed in some other file prior doing the frequency listing?

luisbeta04 0 Newbie Poster

This doesn't seem to remove any punctuation marks from the text file, and reads 'it' separately from 'it,'.

What might the problem be?

sujit.shakya.3 0 Newbie Poster

But, does it works for "There" and "There's".

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts learning and sharing knowledge.