I have been trying to find the frequency distribution of nouns in a given sentence. If I do this:

text = "This ball is blue, small and extraordinary. Like no other ball."
token_text= nltk.word_tokenize(text)
tagged_sent = nltk.pos_tag(token_text)
nouns= []
for word,pos in tagged_sent:
    if pos in ['NN',"NNP"]:
        nouns.append(word)
freq_nouns=nltk.FreqDist(nouns)
print freq_nouns

It considers "ball" and "ball." as separate words. So I went ahead and tokenized the sentence before tokenizing the words:

text = "This ball is blue, small and extraordinary. Like no other ball."
sentences = nltk.sent_tokenize(text)                        
words = [nltk.word_tokenize(sent)for sent in sentences]    
tagged_sent = [nltk.pos_tag(sent)for sent in words]
nouns= []
for word,pos in tagged_sent:
    if pos in ['NN',"NNP"]:
        nouns.append(word)
freq_nouns=nltk.FreqDist(nouns)
print freq_nouns

It gives the following error:

Traceback (most recent call last):
File "C:\beautifulsoup4-4.3.2\Trial.py", line 19, in <module>
for word,pos in tagged_sent:
ValueError: too many values to unpack

What am I doing wrong? Please help.

You may also have problems with capitalized vs all lower case words. Since I have no idea what "nltk.pos_tag(token_text)" is and don't have the time to search for it, the following code will count the number of words but will have a problem with punctuation you want to keep, like "don't", so this is not a complete solution but is the general idea.

from collections import defaultdict

dict_words = defaultdict(int)
text = "This ball is blue, this small and extraordinary. Like don't other ball."
this_word=[]
for word in text.split():
    for chr in word:
        if ("a" <= chr.lower() <= "z") or (chr in ["'", "`"]):
            this_word.append(chr.lower())
    if len(this_word):
        dict_words["".join(this_word)] += 1
        this_word = []
print dict_words

Edited 3 Years Ago by woooee

A few test prints should help ...

''' nltk_tokenize102.py
use module nltk to find the frequency distribution of nouns
in a given text

downloaded and installed:
nltk-2.0.4.win32.exe
from:
https://pypi.python.org/pypi/nltk

tested with Python27
'''

import nltk
import pprint

text = "This ball is blue, small and extraordinary. Like no other ball."
token_text= nltk.word_tokenize(text)

pprint.pprint(token_text)
print('-'*20)

tagged_sent = nltk.pos_tag(token_text)

pprint.pprint(tagged_sent)
print('-'*20)

nouns= []
for word, pos in tagged_sent:
    if pos in ['NN',"NNP"]:
        nouns.append(word)

pprint.pprint(nouns)
print('-'*20)

freq_nouns = nltk.FreqDist(nouns)
print(freq_nouns)

''' result ...
['This',
 'ball',
 'is',
 'blue',
 ',',
 'small',
 'and',
 'extraordinary.',
 'Like',
 'no',
 'other',
 'ball',
 '.']
--------------------
[('This', 'DT'),
 ('ball', 'NN'),
 ('is', 'VBZ'),
 ('blue', 'JJ'),
 (',', ','),
 ('small', 'JJ'),
 ('and', 'CC'),
 ('extraordinary.', 'NNP'),
 ('Like', 'NNP'),
 ('no', 'DT'),
 ('other', 'JJ'),
 ('ball', 'NN'),
 ('.', '.')]
--------------------
['ball', 'extraordinary.', 'Like', 'ball']
--------------------
<FreqDist: 'ball': 2, 'Like': 1, 'extraordinary.': 1>
'''

Edited 3 Years Ago by vegaseat

Thanks vegaseat!
When I tokenized the words, it somehow took "ball." as one word instead of tokenizing 'ball' and '.' separately.

This question has already been answered. Start a new discussion instead.