Hello!
I wonder if anyone used google-diff-match-patch for fuzzy string comparisons, i.e., for finding a closest/most similar string? Can anyone explain and give some examples?
Also, in genera, what would be the best matching algorithm in Python. I tried Levenshtein, but matches are not so good, and I tried difflib.get_close_matches but this one, although better, is so very slow.

Thanks,
PM

3
Contributors
2
Replies
3
Views
7 Years
Discussion Span
Last Post by Neil Fraser

I don't know about google-diff-match-patch, but I once used an algorithm called compression distance. It returns a number in [0.0, 1.0], a "small" value meaning that the strings are close (or equal).

``````# python 2.6
from zlib import compress

def distance (sx, sy):
ab =len (compress (sx + sy))
a ,b =len (compress(sx)),len (compress(sy))
return 1.0 -(0.0 + a + b - ab )/max (a ,b)

def main():
while True:
sx = raw_input("enter string sx: ").strip()
sy = raw_input("enter string sy: ").strip()
print("Compression distance: %.2f" % distance(sx, sy))

if __name__ == "__main__":
main()``````

There are references for this algorithm here http://paginas.fe.up.pt/~%20ssn/2008/Cilibrasi05.pdf and there is a C library http://www.complearn.org/. I think the results of this algorithm are close to using levenshtein distance, but it's faster.
You can also use gzip to compress the strings (it probably gives better results). You can compress a string with

``````# python 2.6
import gzip
from cStringIO import StringIO

def gcompress(s):
sio = StringIO()
f = gzip.GzipFile(fileobj=sio, mode="w")
f.write(s)
f.close()
return sio.getvalue()``````

Edited by Gribouillis: n/a

Hello!
I wonder if anyone used google-diff-match-patch for fuzzy string comparisons, i.e., for finding a closest/most similar string? Can anyone explain and give some examples?
Also, in genera, what would be the best matching algorithm in Python. I tried Levenshtein, but matches are not so good, and I tried difflib.get_close_matches but this one, although better, is so very slow.

Levenstein can be messy if the diffs have lots of coincidental matches. In the example of "plants" vs "stanly" the Levenstien of a normal diff is only 4 whereas one would want 6. The google-diff-match-patch library allows one to do semantic cleanup that gets rid of much of this noise.

Here's an example:

``````#!/usr/bin/python2.4
import diff_match_patch as dmp_module

dmp = dmp_module.diff_match_patch()
dmp.Diff_Timeout = 0.1  # Don't spend more than 0.1 seconds on a diff.

def distance (sx, sy):
diffs = dmp.diff_main(sx, sy)
dmp.diff_cleanupSemantic(diffs)
return dmp.diff_levenshtein(diff)``````
This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.