943,442 Members | Top Members by Rank

Ad:
  • Python Code Snippet
  • Views: 2361
  • Python RSS
1

Story Statistics (Python)

by on Oct 6th, 2009
This Python code allows you to get selected statistics of a story text. It will count lines, sentences, words, list words and characters by frequency, and give the average word length. It should be easy to add more statistics using the dictionaries created.
Last edited by vegaseat; Oct 6th, 2009 at 6:06 pm.
Python Code Snippet (Toggle Plain Text)
  1. # get some statistics of a story text
  2. # count lines, sentences, words, frequent words ...
  3. # tested with Python 2.5.4 and Python 3.1.1
  4. # vegaseat 06oct2009
  5.  
  6. # test text (9 lines total, 2 blank lines, 8 sentences) ...
  7. text = """\
  8. Just a simple text we can use to count the sentences.
  9. Looks like fun! Why do sentences have to end so soon?
  10.  
  11. Every now and then let's put in a blank line, so we
  12. track those too. Perhaps something with a multitude of
  13. characters.
  14.  
  15. Ah, another blank line for the count. Time for lunch!
  16. That should do it for this longwinded test!"""
  17.  
  18. # write the test file
  19. fname = "MyText1.txt"
  20. fout = open(fname, "w")
  21. fout.write(text)
  22. fout.close()
  23.  
  24. # read the test file back in
  25. # or change the filename to a text you have
  26. textf = open(fname, "r")
  27.  
  28. # set all the counters to zero
  29. lines = 0
  30. blanklines = 0
  31. # start with empty word list and character frequency dictionary
  32. word_list = []
  33. cf_dict = {}
  34. # reads one line at a time
  35. for line in textf:
  36. # count lines and blanklines
  37. lines += 1
  38. if line.startswith('\n'):
  39. blanklines += 1
  40. # create a list of words
  41. # split at any whitespace regardless of length
  42. word_list.extend(line.split())
  43. # create a character:frequency dictionary
  44. # all letters adjusted to lower case
  45. for char in line.lower():
  46. cf_dict[char] = cf_dict.get(char, 0) + 1
  47.  
  48. textf.close()
  49.  
  50. # create a word frequency dictionary
  51. # all words in lower case
  52. word_dict = {}
  53. # a list of punctuation marks (could use string.punctuation)
  54. punctuations = [",", ".", "!", "?", ";", ":"]
  55. for word in word_list:
  56. # get last character of each word
  57. lastchar = word[-1]
  58. # remove any trailing punctuation marks from the word
  59. if lastchar in punctuations:
  60. word = word.rstrip(lastchar)
  61. # convert to all lower case letters
  62. word = word.lower()
  63. word_dict[word] = word_dict.get(word, 0) + 1
  64.  
  65. # assume that each sentence ends with '.' or '!' or '?'
  66. sentences = 0
  67. for key in cf_dict.keys():
  68. if key in '.!?':
  69. sentences += cf_dict[key]
  70.  
  71. number_words = len(word_list)
  72.  
  73. #print word_list # test
  74. #print cf_dict # test
  75. #print word_dict # test
  76.  
  77. # formatted prints will work with Python2 and Python3
  78. print( "Total lines: %d" % lines )
  79. print( "Blank lines: %d" % blanklines )
  80. print( "Sentences : %d" % sentences )
  81. print( "Words : %d" % number_words )
  82.  
  83. print('-' * 30)
  84. # optional things ...
  85. # average word length
  86. num = float(number_words)
  87. avg_wordsize = len(''.join([k*v for k, v in word_dict.items()]))/num
  88.  
  89. # most common words
  90. mcw = sorted([(v, k) for k, v in word_dict.items()], reverse=True)
  91.  
  92. # most common characters
  93. mcc = sorted([(v, k) for k, v in cf_dict.items()], reverse=True)
  94.  
  95. print( "Average word length : %0.2f" % avg_wordsize )
  96. print( "3 most common words : %s" % mcw[:3] )
  97. print( "3 most common characters: %s" % mcc[:3] )
  98.  
  99. """my result -->
  100. Total lines: 9
  101. Blank lines: 2
  102. Sentences : 8
  103. Words : 62
  104. ------------------------------
  105. Average word length : 4.08
  106. 3 most common words : [(3, 'for'), (3, 'a'), (2, 'we')]
  107. 3 most common characters: [(57, ' '), (31, 'e'), (30, 't')]
  108. """
Comments on this Code Snippet
Oct 19th, 2009
-1

Re: Story Statistics (Python)

hi how to do the calculation of the fibonacci series in jst 2 lines in python. plz hlp!!
Newbie Poster
bipratikgoswami is offline Offline
2 posts
since Aug 2009
Sep 29th, 2011
0

Re: Story Statistics (Python)

Can someone please explain to me the difference between these 2 lines:

Python Syntax (Toggle Plain Text)
  1. if lastchar in punctuations:

and

Python Syntax (Toggle Plain Text)
  1. if key in '.!?':

Why in one case can we check against a list of string values and in the other case against a string ?
Light Poster
Skrell is offline Offline
29 posts
since Aug 2011
Sep 29th, 2011
0

Re: Story Statistics (Python)

You can test in for any iterable (sequences, generators and also set):
Python Syntax (Toggle Plain Text)
  1. >>> t = (1, 2,3)
  2. >>> 1 in t
  3. True
  4. >>> 5 in t
  5. False
  6. >>> t = '1231231231'
  7. >>> 1 in t
  8.  
  9. Traceback (most recent call last):
  10. File "<pyshell#4>", line 1, in <module>
  11. 1 in t
  12. TypeError: 'in <string>' requires string as left operand, not int
  13. >>> '1' in t
  14. True
  15. >>> '' in t # this is important to remember!
  16. True
  17. >>> 1 in (int(n) for n in t)
  18. True
  19. >>> t = set(t)
  20. >>> t
  21. set(['1', '3', '2'])
  22. >>> '1' in t
  23. True
  24. >>> 1 in (int(n) for n in t)
  25. True
  26. >>>
Last edited by pyTony; Sep 29th, 2011 at 4:44 pm.
Industrious Poster
pyTony is online now Online
4,198 posts
since Apr 2010
Sep 29th, 2011
0

Re: Story Statistics (Python)

That doesn't really address my confusion. I guess i'm having a hard time understanding when something is a "list" versus when it is a "set".
Light Poster
Skrell is offline Offline
29 posts
since Aug 2011
Message:
Previous Thread in Python Forum Timeline: Running Python on Ubuntu
Next Thread in Python Forum Timeline: A very simple Python program





About Us | Contact Us | Advertise | Acceptable Use Policy
Forum Index | Build Custom RSS Feed


Follow us on Twitter


© 2011 DaniWeb® LLC