0

Hi!!!

I am making a program that is supose to use Naive bayes classifier to classify text from few categories. This is the best i can do to explain, here is what i have done so far:

import math

d=open('D.txt', 'r')
di=open('kat1.txt', 'r')


posleden = di.readlines()
total = d.readlines()
Di = posleden[len(posleden) -1]
D = total[len(total) -1]

P = float(Di)/float(D)


fajl = open('prob.txt', 'w')
fajl.write('P1=' + str(float(P)))
fajl.close()


for rec in di:
split_rec = rec.split('\t')
if len(split_rec) > 1:
print "word=%s, freq=%s" % \
(split_rec[0], split_rec[1])

Prob=float(split_rec[1] + 1)/(float(Di) + float(D))

fajl = open('prob.txt', 'w')
fajl.write('Prob1 =' + str(float(Prob)))

:(

6
Contributors
8
Replies
9
Views
9 Years
Discussion Span
Last Post by callmerudy
0

Please use code tags so that your indentation is not lost, and so that we may better read your posts.

Code tags go like this:
[code=python] # MY code goes between these tags!

[/code]

0

Mh Great man!
That was great trick that always tricked me.
Tahnks Gribouillis for asking. When one asks one might give answers but many around get knowledge
Bravo!

0

Sorry about that...

Here is the formulas i need to implement, the thing is that i don't understand what is what...

Text Naive Bayes Algorithm

Let V be the vocabulary of all words in the documents in D
For each category ci C
Let Di be the subset of documents in D in category ci
P(ci) = |Di| / |D|
Let Ti be the concatenation of all the documents in Di
Let ni be the total number of word occurrences in Ti
For each word wj V
Let nij be the number of occurrences of wj in Ti
Let P(wj | ci) = (nij + 1) / (ni + |V|)

And here is the code so far...

import math


v=open('V.txt', 'r')
totalv = v.readlines()
V = totalv[len(totalv) -1]

freq=open('freq1.txt', 'r')
totalf = freq.readlines()
Freq = totalf[len(totalf) -1]


for rec in freq:
		split_rec = rec.split('\t')
		if len(split_rec) > 1:
			print "zbor=%s, freq=%s" % \
			(split_rec[0], split_rec[1])
			

			P=(float(split_rec[1]) + 1)/(float(Freq) + float(V))
		
			fajl = open('bayes.txt', 'w')
			fajl.write('P1 =' + str(float(P)))
0

I would use a dictionary with the category as key, and a list of all words in the subset as the value

Let V be the vocabulary of all words in the documents in D
For each category ci
----->ci=dictionary key
Let Di be the subset of documents in D in category ci
----->Di = value associated with each key = list of words in this category
P(ci) = |Di| / |D|
Let Ti be the concatenation of all the documents in Di
----->Already have this as a list is a concatenation in the sense that I think it is being used here
Let ni be the total number of word occurrences in Ti
----->(not unique occurrences but all occurrences??)
----->ni = len(dictionary[ci]) i.e. the length of the list
For each word wj
Let nij be the number of occurrences of wj in Ti
----->You can loop through each key's list or use a_list.count(wj)
Let P(wj | ci) = (nij + 1) / (ni + |V|)
----->Not sure what all of this means, but it's values should be found in the above calcs

This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.