Hi everyone.

we have an assignment that asks us to count the number of occurrences in a piped raw input text file and print it, calculate its phi statistic, the expected phi for english and the expected phi for random and then print them all out. also, at the end of out program we need to print out english if the value of the phi statistic of the textfile is closer to english and print out random, if the value of the textfile's phi is closer to expected random phi.

I was able to do count the characters occurrences with dictionary as well as to calculate the expected phi for english and random. I'm just having troubles with writing the code that would calculate the phi statistic of the text. I can do it by hand but have no idea on how to do it in python. here's the formula when doing it by hand: phi = f1 (f1 - 1) + f2 (f2 - 1) + ... + fn (fn - 1)

do i need to make a list or is there any other way.

thanks a lot.

I have the same assignment... but pls, how did you do the character occurrences using dictionary. I've been working my butts off to make my code work but no luck.

Hi everyone.

we have an assignment that asks us to count the number of occurrences in a piped raw input text file and print it, calculate its phi statistic, the expected phi for english and the expected phi for random and then print them all out. also, at the end of out program we need to print out english if the value of the phi statistic of the textfile is closer to english and print out random, if the value of the textfile's phi is closer to expected random phi.

I was able to do count the characters occurrences with dictionary as well as to calculate the expected phi for english and random. I'm just having troubles with writing the code that would calculate the phi statistic of the text. I can do it by hand but have no idea on how to do it in python. here's the formula when doing it by hand: phi = f1 (f1 - 1) + f2 (f2 - 1) + ... + fn (fn - 1)

do i need to make a list or is there any other way.

thanks a lot.

phi = f1 (f1 - 1) + f2 (f2 - 1) + ... + fn (fn - 1)

Someone is going to have to explain where f1, f2,... come from or what they mean.

character i occurs fi times and there are n distinct characters.

Someone is going to have to explain where f1, f2,... come from or what they mean.

I have to be misunderstanding the concept.

import random
import string

##--- populate a test dictionary with random quanties
letters_dict = {}
for ltr in string.uppercase:
letters_dict[ltr] = random.randint(1, 100)

## phi = f1 (f1 - 1) + f2 (f2 - 1) + ... + fn (fn - 1)
phi = 0
for ltr in letters_dict:
qty = letters_dict[ltr]
phi += qty * (qty-1)
print phi

I have to be misunderstanding the concept.

It looks like a way to measure an entropy. If you have n values x1,...xn and you take a convex function F (a function with positive 2nd derivative), you have an inequality

F( (x1 + ... + xn)/n ) <= ( F(x1) + ... + F(xn) ) / n

and equality occurs only if all the x's have the same value. The function F here is F(y) = y(y - 1) which is convex, and I think the fi's are the letter frequencies in the text, so fi = Ni/N where N is the number of letters in the text and Ni the number of occurrences of the i-th letter. If the frequencies where the same, you would have fi = 1/26, because there are n = 26 letters in the alphabet and the value of the entropy would be F(1/26). If the frequencies differ, you get F(f1) + ... + F(fn) > 26 F(1/26), and this difference could be used to distinguish an english text from a random text...

I was thinking of something similar as Phi is generally associated with distributions, AFAIK, but note that the equation is
phi = f1 (f1 - 1) + f2 (f2 - 1) + ... + fn (fn - 1)
which looks like some sort of convoluted Fibonacci sequence. Can't do much more without further info.

I thought about entropy because y(y-1) is well known in this context in the range [0, 1]. It's a convex parabola which roots are 0 and 1. Another classical function is y log(y). Perhaps jjrrmm can tell us more about this.

The phi-statistic is a measure of how closely the frequency distribution of the individual characters in text match that of English. In English text, the individual characters tend to occur with fairly consistent frequencies. For example, the letter ``E'' is most common, followed by ``T'', ``A'', etc. A collection of random characters will have frequencies that are all roughly the same.

To compute the phi-statistic of an English text requires the following steps:

1. Compute the number of occurrences of each character in the text.
2. Suppose that character i occurs fi times and that there are n distinct characters. Then
phi = f1 (f1 - 1) + f2 (f2 - 1) + ... + fn (fn - 1)

For example, the text
HKWZA RRPVQ BIVYS MPDMQ MBUDC
has phi-statistic equal to 18 (there are 6 characters that occur twice, one that occurs 3 times, and the rest occur once).

I was thinking of something similar as Phi is generally associated with distributions, AFAIK, but note that the equation is
phi = f1 (f1 - 1) + f2 (f2 - 1) + ... + fn (fn - 1)
which looks like some sort of convoluted Fibonacci sequence. Can't do much more without further info.

So, we are trying to make this more difficult than it is.

##--- populate a test dictionary
letters = 'HKWZA RRPVQ BIVYS MPDMQ MBUDC'
letters_dict = {}
for ltr in letters:
if 'A' <= ltr <= 'Z':
if ltr not in letters_dict:
letters_dict[ltr] = 0
letters_dict[ltr] += 1

print(letters_dict)

## phi = f1 (f1 - 1) + f2 (f2 - 1) + ... + fn (fn - 1)
phi = 0
for ltr in letters_dict:
qty = letters_dict[ltr]
phi += qty * (qty-1)
print("Phi =", phi)

keys = letters_dict.keys()
keys.sort()
for ltr in keys:
print("%s %7.2f" % (ltr, float(letters_dict[ltr])/phi*100))