Generating N-grams from a word

Question

leftyb 0 Newbie Poster

18 Years Ago

hi

im am using this code(Python) to get the n -grams for a word :

import string;
import sys;

# N
N = 6;

# file 
f_in = open("test.txt", 'r');

ln = f_in.read()

wlen = len(ln);
i = 0;
while (i < wlen - N + 1 ):
        for k in ln [i:i+N]: print k,
        print;
        i = i + 1;
   

# close file
f_in.close()

The file "text.txt" contain the word "text"

the result i get for N = 2 is (te,ex,xt)
but the correct result is ( =t,te,ex,xt,t=) where ( = is space)
and the biggest N i can use N=4 the number of the letters. but 1 want to use it for bigger
e.g. N=5 where the result would be (=text,text=,ext==,xt===,t====)

any ideas to solve it would be very helpfull
thanx

Edit: Put code tags around script vegaseat

python

4 Contributors
4 Replies
2K Views
4 Years Discussion Span
Latest Post 14 Years Ago Latest Post by KNatali

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

IamRasheed 0 Newbie Poster · Answer 1 · 2006-02-03T19:35:12+00:00

Here is an example that would work. The clue was to add one space less than N to both the front and the back of the input string, so it becomes preformatted. Then it was simply a matter of selecting the appropriate slices and put those in a return list value.

#!/usr/bin/env python

# File: n-gram.py

def N_Gram(N,text):
    NList = []                      # start with an empty list
    if N> 1:
        space = " " * (N-1)         # add N - 1 spaces
        text = space + text + space # add both in front and back
    # append the slices [i:i+N] to NList
    for i in range( len(text) - (N - 1) ):
        NList.append(text[i:i+N])
    return NList                    # return the list

# test code
for i in range(5):
    print N_Gram(i+1,"text")

# more test code
nList = N_Gram(7,"Here is a lot of text to print")
for ngram in iter(nList):
    print '"' + ngram + '"'

The function N_Gram outputs exactly what you seem to want.

Good luck and happy coding.
_____
René

leftyb 0 Newbie Poster · Answer 2 · 2006-02-06T01:07:40+00:00

That is working perfectly thank you very much really, i am very happy thank you lefteris.

adi_vkool 0 Newbie Poster · Answer 3 · 2009-03-17T01:51:06+00:00

heyy what should i do to find unigrams...any help would be appretiated..

KNatali 0 Newbie Poster · Answer 4 · 2010-03-03T21:47:44+00:00

How serious is the sparse data problem? Investigate the performance of n-gram taggers as n increases from 1 to 6. Tabulate the accuracy score. Estimate the training data required for these taggers, assuming a vocabulary size of 10 in 5degree and a tagset size of 10 in 2 degree. Please help me to solve this exersise!!!