I write code to get most frequent words in the file
I won't to implement bigram probability by modifying the code to do the following:
How can I get every Token (word) and PreviousToken(Previous word) and frequency and probability
From text file and put each one in cell in table

For example if the text file content is
"Every man has a price. Every woman has a price."

First Token(word) is "Every" PreviousToken(Previous word) is none(no previos)
Second Token(word) is "man" PreviousToken(Previous word) is "Every"
Third Token(word) is "has" PreviousToken(Previous word) is "man"
Forth Token(word) is "a" PreviousToken(Previous word) is "has"
Fifth Token(word) is "price" PreviousToken(Previous word) is "a"

Sixth Token(word) is "Every" PreviousToken(Previous word) is none(no previos)
Seventh Token(word) is "man" PreviousToken(Previous word) is "Every"
Eighth Token(word) is "has" PreviousToken(Previous word) is "man"
Ninth Token(word) is "a" PreviousToken(Previous word) is "has"
Tenth Token(word) is "price" PreviousToken(Previous word) is "a"


Frequency of "has a" is 2 (repeated two times first and second sentence)
Frequency of " a price" is 2 (repeated two times first and second sentence)
Frequency of "Every man" is 1 (occur one time only)
Frequency of "man has" is 1 (occur one time only)
Frequency of "Every woman" is 1 (occur one time only)
Frequency of "woman has" is 1 (occur one time only)

Probability of "has a" is 2/10 (Frequency of "has a" divided by all word )
Probability of "a price" is 2/10 (Frequency of "a price" divided by all word )
Probability of "Every man" is 1/10 (Frequency "Every man" divided by all word )

Probability of "man has" is 1/10 (Frequency of man has" divided by all word )

Probabilityof "Every woman" is 1/10 (Frequency of "Every woman" divided by all word )
Probability of "woman has" is 1/10 (Frequency of "woman has" divided by all word )

# a look at the Tkinter Text widget

# use ctrl+c to copy, ctrl+x to cut selected text,

# ctrl+v to paste, and ctrl+/ to select all
  # count words in a text and show the first ten items
 # by decreasing frequency

import Tkinter as tk
import os, glob
import sys
import string
import re
import tkFileDialog      
def most_frequant_word():    
 browser= tkFileDialog.askdirectory()
 #browser= os.listdir(a)

 word_freq = {}
 for root, dirs, files in os.walk(browser):
    #print 'Looking into %s' % root.split('\\')[-1]
    #print 'Found %d dirs and %d files' % (len(dirs), len(files))
    text1.insert(tk.INSERT, 'Found %d dirs and %d files' % (len(dirs), len(files)))
    text1.insert(tk.INSERT, "\n")
    for idx, file in enumerate(files):
     
     print 'File #%d: %s' % (idx + 1, file)
       #text1.insert(tk.INSERT, 'File #%d: %s' % (idx + 1, file))
       #text1.insert(tk.INSERT, "\n")
     ff = open (os.path.join(root, file), "r")
     text = ff.read ( )
     ff.close ( )
     #word_freq = {}     
     word_list = text.split()
     for word in word_list:
      word = word.lower()
      word = word.rstrip('.,/"\ -_;\[](){} ')

      #if word.isalpha():
                # build the dictionary
      count = word_freq.get(word, 0)
      word_freq[word] = count + 1
 
       # create a list of (freq, word) tuples
      freq_list = [(word,freq ) for freq,word  in word_freq.items()]
     
       # sort the list by the first element in each tuple (default)
      freq_list.sort(reverse=True)
     
     for n, tup in enumerate(freq_list):
    # print the first ten items
      if n < 5:
       if idx == 3:  
        print "%s times: %s" % tup
        text1.insert(tk.INSERT, "%s times: %s" % tup)
       #text1.insert(tk.INSERT, word)
        text1.insert(tk.INSERT, "\n")
        
# raw_input('\nHit enter to exit')
 
root = tk.Tk(className = " most_frequant_word")
# text entry field, width=width chars, height=lines text
v1 = tk.StringVar()
text1 = tk.Text(root, width=50, height=50, bg='green')
text1.pack()
# function listed in command will be executed on button click
button1 = tk.Button(root, text='Brows', command=most_frequant_word)
button1.pack(pady=5)
text1.focus()
root.mainloop()

Recommended Answers

All 3 Replies

I suggest representing every bigram as a tuple (first,second), so you would parse "Every man has a price. Every woman has a price." into lower case, drop punctuation, forget the border cases and you end up with:

("every","man")
("man","has")
("has","a")
("a","price")
("price","every")
("every","woman")
("woman","has")
("has","a")
("a","price")

Then use each bigram as a dictionary index where the value of the entry is the count of appearances, viz:

>>> num = {}  # Dictionary
>>> def addbig(bigram):
...     try:
...             num[bigram] += 1
...     except KeyError:
...             num[bigram] = 1
...
>>># I'm just manually putting these here so you get the general idea
>>> addbig(("every","man"))
>>> addbig(("man","has"))
>>> addbig(("has","a"))
>>> addbig(("a","price"))
>>> addbig(("price","every"))
>>> addbig(("every","woman"))
>>> addbig(("woman","has"))
>>> addbig(("has","a"))
>>> addbig(("a","price"))
>>> print num
{('has', 'a'): 2, ('every', 'woman'): 1, ('every', 'man'): 1, ('man', 'has'): 1, ('a', 'price'): 2, ('woman', 'has'): 1, ('price', 'every'):1}

... and you should be able to take it from there.

There has to be better code to copy. Unless you have to modify that specific program, I would suggest finding program without the Tkinter stuff and copy that. You can add a GUI later after the rest of it is working. Also, when you copy a program and are trying to understand it, add print statements so so can see what is in the containers, and what is happening in the loops. The first thing though would be to break this up into functions so you have some chance of understanding it, and then test each function individually.

I suggest representing every bigram as a tuple (first,second), so you would parse "Every man has a price. Every woman has a price." into lower case, drop punctuation, forget the border cases and you end up with:

("every","man")
("man","has")
("has","a")
("a","price")
("price","every")
("every","woman")
("woman","has")
("has","a")
("a","price")

Then use each bigram as a dictionary index where the value of the entry is the count of appearances
.

ok but how can I retrieve every two word in sequence from file
as you mintin?

("man","has")
("has","a")
("a","price")
.

how can I ma every two words as my index?

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.