I am trying to generate word frequencies using ngrams. I have taken the brown corpus from nltk and changed it for use with ngram calculations by adding <s> and </s> at the beginning and end (in place of period.) I need to try and calculate the frequencies from this file but am unsure how to go about doing this? My end desire is to generate random ngrams based off bigrams, trigrams and quadgrams.
How can I go about with the calculations? Thank you.
import re import nltk import nltk.corpus as corpus import tokenize from nltk.corpus import brown def alter_list(row): if row[-1] == '.': row[-1] = '</s>' else: row.append('</s>') return ['<s>'] + row news = corpus.brown.sents(categories = 'editorial') print len(news),'\n' x = len(news) for row in news[:x]: print(alter_list(row))