Hello,

I am trying to generate word frequencies using ngrams. I have taken the brown corpus from nltk and changed it for use with ngram calculations by adding <s> and </s> at the beginning and end (in place of period.) I need to try and calculate the frequencies from this file but am unsure how to go about doing this? My end desire is to generate random ngrams based off bigrams, trigrams and quadgrams.

How can I go about with the calculations? Thank you.

import re
import nltk
import nltk.corpus as corpus
import tokenize

from nltk.corpus import brown

def alter_list(row):
    if row[-1] == '.':
        row[-1] = '</s>'
    else:
        row.append('</s>')
    return ['<s>'] + row

news = corpus.brown.sents(categories = 'editorial')
print len(news),'\n'

x = len(news)
for row in news[:x]:
    print(alter_list(row))

Recommended Answers

All 3 Replies

mark_sentance is undefined, alter_list is never called. Why only slice of 5 news, how long they are? You call them row.

Sorry, should be fixed and show the whole corpus!

So after some more looking around I think this equation will do just fine I just need a little help implementing it. What would be the best way to go about doing this?

Equation image here: http://cl.ly/image/2R0G3B2q1v0S

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.