Hey all,

I have a text file and I want to find out the top 40 most used words in the text file. I managed to do that. But, I have another text file that has hundreds of "stop words." When looping through the text file to find out the top 40 most used words, my program needs to ignore the stop words. I seem to be missing something, cause I just can't figure this out.

Thanks in advance for the help!

Here is the code I have thus far:

from string import punctuation

#opens empty list, reads stopWords.txt
#adds all words in stopWords.txt to open list
stopWordsList = ['']
stopWordsText = open("stopWords.txt", 'r')

for words in stopWordsText:
    words = words.strip(punctuation).lower()
    words = words.strip('\n')
    stopWordsList.append(words)

#finds the top 40 words in debate.txt
#prints out the word and the frequency of the word
def sort_items(x, y):
    """Sort by value first, and by key (reverted) second."""
    return cmp(x[1], y[1]) or cmp(y[0], x[0])

N = 40
words = {}

words_gen = (word.strip(punctuation).lower() for line in open("debate.txt")
                                             for word in line.split())
                                             
for word in words_gen:
    words[word] = words.get(word, 0) + 1


top_words = sorted(words.iteritems(), cmp=sort_items, reverse=True)[:N]
      
for word, frequency in top_words:
    print "%s: %d" % (word, frequency)

Recommended Answers

All 3 Replies

Try replacing the code in line 26 with:

if word not in stopWordsList: words.get(word, 0) + 1

Try replacing the code in line 26 with:

if word not in stopWordsList: words.get(word, 0) + 1

You mean:

if word not in stopWordsList: words[word] = words.get(word, 0) + 1

Oops, extremely sorry for the error!

if word not in stopWordsList: words[word] = words.get(word, 0) + 1
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.