Hi!!!

I am making a program that is supose to use Naive bayes classifier to classify text from few categories. This is the best i can do to explain, here is what i have done so far:

import math

d=open('D.txt', 'r')
di=open('kat1.txt', 'r')


posleden = di.readlines()
total = d.readlines()
Di = posleden[len(posleden) -1]
D = total[len(total) -1]

P = float(Di)/float(D)


fajl = open('prob.txt', 'w')
fajl.write('P1=' + str(float(P)))
fajl.close()


for rec in di:
split_rec = rec.split('\t')
if len(split_rec) > 1:
print "word=%s, freq=%s" % \
(split_rec[0], split_rec[1])

Prob=float(split_rec[1] + 1)/(float(Di) + float(D))

fajl = open('prob.txt', 'w')
fajl.write('Prob1 =' + str(float(Prob)))

:(

Recommended Answers

All 8 Replies

Please use code tags so that your indentation is not lost, and so that we may better read your posts.

Code tags go like this:
[code=python] # MY code goes between these tags!

[/code]

jlm699 how can you put the tag without triggering their action ?

Wrap the code tags in [noparse][/noparse] tags.

Mh Great man!
That was great trick that always tricked me.
Tahnks Gribouillis for asking. When one asks one might give answers but many around get knowledge
Bravo!

Sorry about that...

Here is the formulas i need to implement, the thing is that i don't understand what is what...

Text Naive Bayes Algorithm

Let V be the vocabulary of all words in the documents in D
For each category ci C
Let Di be the subset of documents in D in category ci
P(ci) = |Di| / |D|
Let Ti be the concatenation of all the documents in Di
Let ni be the total number of word occurrences in Ti
For each word wj V
Let nij be the number of occurrences of wj in Ti
Let P(wj | ci) = (nij + 1) / (ni + |V|)

And here is the code so far...

import math


v=open('V.txt', 'r')
totalv = v.readlines()
V = totalv[len(totalv) -1]

freq=open('freq1.txt', 'r')
totalf = freq.readlines()
Freq = totalf[len(totalf) -1]


for rec in freq:
		split_rec = rec.split('\t')
		if len(split_rec) > 1:
			print "zbor=%s, freq=%s" % \
			(split_rec[0], split_rec[1])
			

			P=(float(split_rec[1]) + 1)/(float(Freq) + float(V))
		
			fajl = open('bayes.txt', 'w')
			fajl.write('P1 =' + str(float(P)))

I would use a dictionary with the category as key, and a list of all words in the subset as the value

Let V be the vocabulary of all words in the documents in D
For each category ci
----->ci=dictionary key
Let Di be the subset of documents in D in category ci
----->Di = value associated with each key = list of words in this category
P(ci) = |Di| / |D|
Let Ti be the concatenation of all the documents in Di
----->Already have this as a list is a concatenation in the sense that I think it is being used here
Let ni be the total number of word occurrences in Ti
----->(not unique occurrences but all occurrences??)
----->ni = len(dictionary[ci]) i.e. the length of the list
For each word wj
Let nij be the number of occurrences of wj in Ti
----->You can loop through each key's list or use a_list.count(wj)
Let P(wj | ci) = (nij + 1) / (ni + |V|)
----->Not sure what all of this means, but it's values should be found in the above calcs

thank you, it helped a lot...

Hey do you still have the code for your program?

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.