1,105,534 Community Members

Word Frequency using Python

Member Avatar
Reputation Points: 399 [?]
Q&As Helped to Solve: 261 [?]
Skill Endorsements: 7 [?]
 
2
 

This program uses Python module re for splitting a text file into words and removing some common punctuation marks. The word:frequency dictionary is then formed using try/except. In honor of 4th of July the text analyzed is National Anthem of USA (found via Google).

# another word frequency program, uses re
# tested with Python2.4.3   HAB

import re

# this one in honor of 4th July, or pick text file you have!!!!!!!
filename = 'NationalAnthemUSA.txt'

# create list of lower case words, \s+ --> match any whitespace(s)
# you can replace file(filename).read() with given string
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)

# create dictionary of word:frequency pairs
freq_dic = {}
# punctuation marks to be removed
punctuation = re.compile(r'[.?!,":;]') 
for word in word_list:
    # remove punctuation marks
    word = punctuation.sub("", word)
    # form dictionary
    try: 
        freq_dic[word] += 1
    except: 
        freq_dic[word] = 1
    

print 'Unique words:', len(freq_dic)

# create list of (key, val) tuple pairs
freq_list = freq_dic.items()
# sort by key or word
freq_list.sort()
# display result
for word, freq in freq_list:
    print word, freq
Member Avatar
kenmeck03
Newbie Poster
5 posts since Oct 2009
Reputation Points: 0 [?]
Q&As Helped to Solve: 0 [?]
Skill Endorsements: 0 [?]
 
0
 

How would you take this and organize the words that appear in descending order not alphabetical.

Member Avatar
bumsfeld
Posting Virtuoso
1,537 posts since Jul 2005
Reputation Points: 399 [?]
Q&As Helped to Solve: 261 [?]
Skill Endorsements: 7 [?]
 
0
 

Do you mean highest frequency first?

Simply add this to the end of the code:

print '-'*30

print "sorted by highest frequency first:"
# create list of (val, key) tuple pairs
freq_list2 = [(val, key) for key, val in freq_dic.items()]
# sort by val or frequency
freq_list2.sort(reverse=True)
# display result
for freq, word in freq_list2:
    print word, freq
Member Avatar
bipratikgoswami
Newbie Poster
2 posts since Aug 2009
Reputation Points: 0 [?]
Q&As Helped to Solve: 0 [?]
Skill Endorsements: 0 [?]
 
0
 

this code is useful...

Member Avatar
masterofpuppets
Posting Whiz in Training
272 posts since Jul 2009
Reputation Points: 5 [?]
Q&As Helped to Solve: 74 [?]
Skill Endorsements: 0 [?]
 
0
 

nice piece of code :) useful indeed

Member Avatar
mattp23
Newbie Poster
1 post since Nov 2009
Reputation Points: 0 [?]
Q&As Helped to Solve: 0 [?]
Skill Endorsements: 0 [?]
 
0
 

line 21 onwards:

# form dictionary
    try:
         freq_dic[word] += 1
    except:
         freq_dic[word] = 1

Could be replaced by:

freq_dic[word] = freq_dic.get(word,0) + 1

gets rid of the try except and just makes things a little neater.

Member Avatar
nawaf_ali
Newbie Poster
3 posts since Oct 2010
Reputation Points: 0 [?]
Q&As Helped to Solve: 0 [?]
Skill Endorsements: 0 [?]
 
0
 

what can I add to this code to remove some words listed in some other file prior doing the frequency listing?

Member Avatar
luisbeta04
Newbie Poster
1 post since Nov 2010
Reputation Points: 0 [?]
Q&As Helped to Solve: 0 [?]
Skill Endorsements: 0 [?]
 
0
 

This doesn't seem to remove any punctuation marks from the text file, and reads 'it' separately from 'it,'.

What might the problem be?

Member Avatar
sujit.shakya.3
Newbie Poster
1 post since Sep 2013
Reputation Points: 0 [?]
Q&As Helped to Solve: 0 [?]
Skill Endorsements: 0 [?]
 
0
 

But, does it works for "There" and "There's".

You
Post:
Start New Discussion
Tags Related to this Article