Word Frequency using Python

bumsfeld bumsfeld is offline Offline Jul 4th, 2006, 1:57 pm |
1
This program uses Python module re for splitting a text file into words and removing some common punctuation marks. The word:frequency dictionary is then formed using try/except. In honor of 4th of July the text analyzed is National Anthem of USA (found via Google).
Quick reply to this message  
Python Syntax
  1. # another word frequency program, uses re
  2. # tested with Python2.4.3 HAB
  3.  
  4. import re
  5.  
  6. # this one in honor of 4th July, or pick text file you have!!!!!!!
  7. filename = 'NationalAnthemUSA.txt'
  8.  
  9. # create list of lower case words, \s+ --> match any whitespace(s)
  10. # you can replace file(filename).read() with given string
  11. word_list = re.split('\s+', file(filename).read().lower())
  12. print 'Words in text:', len(word_list)
  13.  
  14. # create dictionary of word:frequency pairs
  15. freq_dic = {}
  16. # punctuation marks to be removed
  17. punctuation = re.compile(r'[.?!,":;]')
  18. for word in word_list:
  19. # remove punctuation marks
  20. word = punctuation.sub("", word)
  21. # form dictionary
  22. try:
  23. freq_dic[word] += 1
  24. except:
  25. freq_dic[word] = 1
  26.  
  27.  
  28. print 'Unique words:', len(freq_dic)
  29.  
  30. # create list of (key, val) tuple pairs
  31. freq_list = freq_dic.items()
  32. # sort by key or word
  33. freq_list.sort()
  34. # display result
  35. for word, freq in freq_list:
  36. print word, freq
0
kenmeck03 kenmeck03 is offline Offline | Oct 12th, 2009
How would you take this and organize the words that appear in descending order not alphabetical.
 
0
bumsfeld bumsfeld is offline Offline | Oct 12th, 2009
Do you mean highest frequency first?

Simply add this to the end of the code:
  1. print '-'*30
  2.  
  3. print "sorted by highest frequency first:"
  4. # create list of (val, key) tuple pairs
  5. freq_list2 = [(val, key) for key, val in freq_dic.items()]
  6. # sort by val or frequency
  7. freq_list2.sort(reverse=True)
  8. # display result
  9. for freq, word in freq_list2:
  10. print word, freq
Last edited by bumsfeld; Oct 12th, 2009 at 4:06 pm.
 
0
bipratikgoswami bipratikgoswami is offline Offline | Oct 19th, 2009
this code is useful...
 
0
masterofpuppets masterofpuppets is online now Online | Oct 19th, 2009
nice piece of code useful indeed
 
0
mattp23 mattp23 is offline Offline | 27 Days Ago
line 21 onwards:
  1. # form dictionary
  2. try:
  3. freq_dic[word] += 1
  4. except:
  5. freq_dic[word] = 1
Could be replaced by:
  1. freq_dic[word] = freq_dic.get(word,0) + 1
gets rid of the try except and just makes things a little neater.
 
 

Message:


Thread Tools Search this Thread



About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC