Have written a script that downloads rss feeds, compares the latest download with the one before using md5, and if they're different sends an email with the updated headlines.
However, have noticed the email update is sent even for a change as minor as the insertion of a comma.
Is there a method to compare files that shows the degree of similarity?
From searches so far it seems the vector space model would do the job but it looks pretty tough to use for someone who is an absolute beginner to python, so would love to hear of any alternatives.


You can try a formula like this one

from zlib import compress

def score(stringA, stringB):
    a = len(compress(stringA))
    b = len(compress(stringB))
    c = len(compress(stringA + stringB))
    return 1.0 -(0.0 +a +b -c )/max (a ,b )

It should return a number close to 0.0 if the 2 strings are similar and close to 1.0 if they're completely different. For a reference to this kind of formulas, see http://paginas.fe.up.pt/~%20ssn/2008/Cilibrasi05.pdf. It may work for your problem or not.

Thanks Gribouillis,

Will give that a go once I've had bit of a sleep... too much work is starting to catch up with me:yawn:


Were you using the formula from page 8 of the pdf?

Shouldn't it be more like this?

return (0.0 + c - min(a, b)) / max(a, b)

I could be wrong, it wouldn't be the first time (grin).

I found the first formula in another paper (in french). In fact it's the same formula because a + b = max(a, b) + min(a, b) :)

I obtained a very intersting image about this compression formula. The idea was to compare the similarity number obtained with the compression formula to the Levenshtein distance between 2 strings, which measures the distance as a number of necessary steps to transform one of the strings into the other.
So I took 300 random strings of length 100. For each of these string, I built a second string obtained by modifying a random number of characters, and for each of these pairs of strings, I computed the compression distance and the Levenshtein distance.
The image shows a strong correlation between the 2 distances, which means that the compression distance is really a good indication of the similarity between strings, especially when it's not too close to 1.0.
I'll try to make other tests, with strings of different sizes. A goal could be to write an approximate formula which gives the Levenshtein distance as a function of the lengths of the strings and the compression distance.
Note: the image was drawn by matplotlib.

Attachments image.png 39.28 KB
This article has been dead for over six months. Start a new discussion instead.