comparing degree of similarity between files

Question

blair.mayston 0 Newbie Poster

16 Years Ago

Hi,

Have written a script that downloads rss feeds, compares the latest download with the one before using md5, and if they're different sends an email with the updated headlines.
However, have noticed the email update is sent even for a change as minor as the insertion of a comma.
Is there a method to compare files that shows the degree of similarity?
From searches so far it seems the vector space model would do the job but it looks pretty tough to use for someone who is an absolute beginner to python, so would love to hear of any alternatives.

Blair

python

3 Contributors
5 Replies
108 Views
1 Day Discussion Span
Latest Post 16 Years Ago Latest Post by Gribouillis

All 5 Replies

Gribouillis 1,391 Programming Explorer

16 Years Ago

You can try a formula like this one

from zlib import compress

def score(stringA, stringB):
    a = len(compress(stringA))
    b = len(compress(stringB))
    c = len(compress(stringA + stringB))
    return 1.0 -(0.0 +a +b -c )/max (a ,b )

It should return a number close to 0.0 if the 2 strings are similar and close to 1.0 if they're completely different. For a reference to this kind of formulas, see http://paginas.fe.up.pt/~%20ssn/2008/Cilibrasi05.pdf. It may work for your problem or not.

Murtan 317 Practically a Master Poster

16 Years Ago

Were you using the formula from page 8 of the pdf?

Shouldn't it be more like this?

return (0.0 + c - min(a, b)) / max(a, b)

I could be wrong, it wouldn't be the first time (grin).

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

blair.mayston 0 Newbie Poster · Answer 1 · 2009-03-16T14:36:16+00:00

Thanks Gribouillis,

Will give that a go once I've had bit of a sleep... too much work is starting to catch up with me:yawn:

Blair

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 2 · 2009-03-16T20:38:50+00:00

I found the first formula in another paper (in french). In fact it's the same formula because a + b = max(a, b) + min(a, b) :)

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 3 · 2009-03-17T15:50:06+00:00

I obtained a very intersting image about this compression formula. The idea was to compare the similarity number obtained with the compression formula to the Levenshtein distance between 2 strings, which measures the distance as a number of necessary steps to transform one of the strings into the other.
So I took 300 random strings of length 100. For each of these string, I built a second string obtained by modifying a random number of characters, and for each of these pairs of strings, I computed the compression distance and the Levenshtein distance.
The image shows a strong correlation between the 2 distances, which means that the compression distance is really a good indication of the similarity between strings, especially when it's not too close to 1.0.
I'll try to make other tests, with strings of different sizes. A goal could be to write an approximate formula which gives the Levenshtein distance as a function of the lengths of the strings and the compression distance.
Note: the image was drawn by matplotlib.

comparing degree of similarity between files

Recommended Answers Collapse Answers

All 5 Replies

Recommended Answers