I just want to know if this is done right. And some suggestions on the commented line.

msg = ...

import re
msg = re.sub("[^\w ]", " ", msg)

names = []

for x in msg.split():
    if (x[0].isupper()) and not (x[1].isupper()):# I think this should be done in a different way, if the word is PS or something like that it must not count

from collections import defaultdict

wordsCount = defaultdict(int)
for word in imena:
  wordsCount[word] += 1

for word, num in wordsCount.items():
	print word, num
  • you can precompile the regular expression: nonWordRE = re.compile(r'[^\w ]') (I don't get why you have the space after the \w; and I think you need a trailing '+' so it finds multiple non-word characters...) (the ellipsis is a tiny self-referential joke)
  • You can use the regular expression directly to split the sentence: http://docs.python.org/library/re.html#re.RegexObject.split without doing the first substitution at line 4.
  • You can directly add the capitalized words into wordsCount, no need for the intermediate names list (what is imena in your line 15?)
  • Instead of using split() you can use a regular expression that notices only words that are capitalized, and findall() of them in the sentence (docs just after split() mentioned above). This solves your issue at line 9, too. Note the r'\b' regular expression special character.
  • I would prefer line 18 to be for word, num in sorted(wordsCount.items()):

Edited 6 Years Ago by griswolf: n/a

Here one non-re way also:

from itertools import groupby
import string
text = """Just a simple text.
We can count the words!
Why do words have to end?
Every now and then a blank line.
Perhaps it will snow!
Wow, another blank line for the count.
That should do it for the test!"""

# generator to get Title cased words
test = (word.strip(string.punctuation) for word in text.split() if word.istitle())

for word, thewords in groupby(sorted(test)):
        print "%s %s" % (len(list(thewords)), word)

Edited 6 Years Ago by pyTony: n/a

This question has already been answered. Start a new discussion instead.