I have a string as below

"i am genie. who are you? we worked together in a hotel called xsis."

Normally string like the above is cleaned. Cleaning involves removing whitespaces. Thus the cleaned version of the above string would be


It is divided into kgrams of fixed size. Assuming size of kgram as five, the kgrams for the above string would be

'iamge', 'amgen', 'mgeni', 'genie', ... , 'isiam', 'siamg'

I want to display the actual text in the original string depending on the kgram chosen. For instance, if kgram 'iamge' is chosen, the output would be 'i am ge'. Please suggest me, if possible with a python implementation. Thanking you in advance.

I'm not sure I understand how you are getting your string to look like that. Could you tell me how that occurs?

This is a simple dictionary with 'iamge' pointing to 'i am ge'. Most of the online tutorials cover dictionaries.

And how do you intend to work with this data format?

Yes, richieking, I see that this n-gram are starting from every letter of the document, so my suggestion would be instead of wooees dict to record something like file position counter for n-gram starting positions. Would add quite a lot to space requirements though. To speed things up you would keep the original text in memory in addition to the n-gram data.

Or you could have tuple n-gram,(list of space indexes in n-gram) This list of indexes would be easy to get from n-gramming process as side product splitting data to n-gram stream and index stream of spaces (hole two to five of them)

Not sure exactly what you are trying to do. You could have a dictionary or list (whatever you want to use) count the letters to each white space then later re-insert the white spaces based off the dictionary. Of course it wouldn't be a permanent fix, each sentence or paragraph would have it's own dictionary or list.Just an idea.

t = "i am genie. who are you? we worked together in a hotel called xsis."
n  = 6
indexes = list(i for i,letter in enumerate(t) if not letter.isspace())
print 'ngram %i with spaces: %s' % (n,t[indexes[n]:indexes[n+5]])

t = "i am genie. who are you? we worked together in a hotel called xsis."
wrap = 7
indexes = list(index for index,letter in enumerate(t+t[:wrap]) if letter.isalpha())
for k in (2,3,5):
    print '\n', t
    print '\n%igram with non-letters:' % k
    print ', '.join(repr((t+t[:wrap])[indexes[n]:indexes[n+k]]) for n in range(len(indexes)-wrap+2))
    print '\n%igram only letters:' % k
    print ','.join("%r" % ''.join((t+t[:wrap])[index]  for index in indexes[n:n+k]) for n in range(len(indexes)-wrap+2))

Is this marked solved?