How can I get every Token (word) and PreviousToken(Previous word) From text file
For example if the text file content is
"Every man has a price. Every woman has a price."

First Token(word) is "Every" PreviousToken(Previous word) is none(no previos)
Second Token(word) is "man" PreviousToken(Previous word) is "Every"
Third Token(word) is "has" PreviousToken(Previous word) is "man"
and so on..

how to go over the text file sequantioly to get all (two) words from file

like in example to get "Every man" then get "man has" then "has a" then "a price"
and so on

please I need help

Unusual, but here is one way to do this ...

data = "Every man has a price. Every woman has a price?"
# optional remove punctuation marks
data = "".join(c for c in data if c not in '.,!?')

my_list = data.split(None)

pair_list = []
for ix, s in enumerate(my_list):
    if ix < len(my_list)-1:
        s += ' ' + my_list[ix+1]
    else:
        # finish with 'none'
        s += ' ' + 'none'
    pair_list.append(s)

print pair_list
"""
my output -->
['Every man', 'man has', 'has a', 'a price', 'price Every',
'Every woman', 'woman has', 'has a', 'a price', 'price none']
"""

It give me error when I wont to integrate with my code
I modify your code but still give me error

What I wont is
How can I get every Token (word) and PreviousToken(Previous word) From multube files and frequency of each two word

my code is trying to get all single word and double word (every Token (word) and PreviousToken(Previous word)) from multube files and get frequency of both. it can get for single word but double word give error

line 50, in most_frequant_word
word1+= ' ' + word_list[ix+1]
IndexError: list index out of range

import __future__
import Tkinter as tk
import os, glob
import sys
import string
import re
import tkFileDialog      
def most_frequant_word():
 browser= tkFileDialog.askdirectory()
 word_freq={}
 word_freq1={}
 count11=0
 for root, dirs, files in os.walk(browser):
    text1.insert(tk.INSERT, 'Found %d dirs and %d files' % (len(dirs), len(files)))
    text1.insert(tk.INSERT, "\n")
    for idx, file in enumerate(files):
     ff = open (os.path.join(root, file), "r")
     text = ff.read ( )
     ff.close ( )
     word_list = text.split()
     my_list = text.split()
     count11=len(word_list)+count11
     text1.insert(tk.INSERT, "total number of tokens %s" % pair_list)
     text1.insert(tk.INSERT, "\n") 
     for ix, word in enumerate(word_list):
      word = word.lower()
      word = word.rstrip('.,/"\ -_;\[](){} ')
     # build the dictionary
      word1=word
      word1+= ' ' + word_list[ix+1]
      count = word_freq.get(word, 0)
      word_freq[word] = count + 1
      count1 = word_freq1.get(word1,0)
      word_freq1[word1] = count1 + 1
       # create a list of (freq, word) tuples
      freq_list = [(word,freq ) for freq,word  in word_freq.items()]
      freq_list1 = [(word1,freq1 ) for freq1,word1  in word_freq.items()]
       # sort the list by the first element in each tuple (default)
      freq_list.sort(reverse=True)
      freq_list1.sort(reverse=True)
     for n, tup in enumerate(freq_list1):
        text1.insert(tk.INSERT, "%s times: %s" % tup)
        text1.insert(tk.INSERT, "\n")

root = tk.Tk(className = " most_frequant_word")
# text entry field, width=width chars, height=lines text
v1 = tk.StringVar()
text1 = tk.Text(root, width=50, height=50, bg='green')
text1.pack()
# function listed in command will be executed on button click
button1 = tk.Button(root, text='Brows', command=most_frequant_word)
button1.pack(pady=5)
text1.focus()
root.mainloop()

the code subose to do
For example if the text file content is
"Every man has a price. Every woman has a price."

First Token(word) is "Every" PreviousToken(Previous word) is none(no previos)
Second Token(word) is "man" PreviousToken(Previous word) is "Every"
Third Token(word) is "has" PreviousToken(Previous word) is "man"
Forth Token(word) is "a" PreviousToken(Previous word) is "has"
Fifth Token(word) is "price" PreviousToken(Previous word) is "a"

Sixth Token(word) is "Every" PreviousToken(Previous word) is none(no previos)
Seventh Token(word) is "man" PreviousToken(Previous word) is "Every"
Eighth Token(word) is "has" PreviousToken(Previous word) is "man"
Ninth Token(word) is "a" PreviousToken(Previous word) is "has"
Tenth Token(word) is "price" PreviousToken(Previous word) is "a"


Frequency of "has a" is 2 (repeated two times first and second sentence)
Frequency of " a price" is 2 (repeated two times first and second sentence)
Frequency of "Every man" is 1 (occur one time only)
Frequency of "man has" is 1 (occur one time only)
Frequency of "Every woman" is 1 (occur one time only)
Frequency of "woman has" is 1 (occur one time only)

please I need help

One way I liked to get words and count them might help you.

wordlist = string.split(" ")
for word in wordlist:
    print word

Then you could use that again, if you wanted each two words together.

print wordlist[0], wordlist[1]

You should use len(wordlist) to find out the total number of words, then make a loop that goes through it from bottom to top.

Mr.Shadow14

it just print first two word
when I make loop give same error
print word_list, word_list[i+1]
IndexError: list index out of range

by your way How can I get frequancy for each the two words?

Is this what you mean?

string = "Hi I hope this helps!"

wordlist = string.split(" ")

print "None", wordlist[0], "- 0"
for x in range(len(wordlist)):
    try:
        jstring = wordlist[x] + " " + wordlist[x+1]
        print jstring + " - " + repr(string.count(jstring))

    except:
        print wordlist[x], "None", "- 0"

it does not I work with arabic charactar files like (حالد جاء الى البيت(

give me garpeg output like


I have put data test in attach

Pleas help me today as soon as possiple

You best bet would be to ask the person on python-forum.org who wrote this program. They would understand it better than anyone else. No one wants to spend their time answering a question that has already been answered on another forum.

It works perfectly, you just need to read about unicode(UTF-8) encoding and other data encoding. By default you are using ascii....which arabic characters do not fall under.

Chris

This question has already been answered. Start a new discussion instead.