0

Hello,

I am trying to generate word frequencies using ngrams. I have taken the brown corpus from nltk and changed it for use with ngram calculations by adding <s> and </s> at the beginning and end (in place of period.) I need to try and calculate the frequencies from this file but am unsure how to go about doing this? My end desire is to generate random ngrams based off bigrams, trigrams and quadgrams.

How can I go about with the calculations? Thank you.

import re
import nltk
import nltk.corpus as corpus
import tokenize

from nltk.corpus import brown

def alter_list(row):
    if row[-1] == '.':
        row[-1] = '</s>'
    else:
        row.append('</s>')
    return ['<s>'] + row

news = corpus.brown.sents(categories = 'editorial')
print len(news),'\n'

x = len(news)
for row in news[:x]:
    print(alter_list(row))

Edited by pyTony: Unindented the body text

2
Contributors
3
Replies
4
Views
4 Years
Discussion Span
Last Post by rmbrown09
0

mark_sentance is undefined, alter_list is never called. Why only slice of 5 news, how long they are? You call them row.

0

So after some more looking around I think this equation will do just fine I just need a little help implementing it. What would be the best way to go about doing this?

Equation image here: http://cl.ly/image/2R0G3B2q1v0S

This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.