Member Avatar for neocortex

Hello!
I wonder if anyone used google-diff-match-patch for fuzzy string comparisons, i.e., for finding a closest/most similar string? Can anyone explain and give some examples?
Also, in genera, what would be the best matching algorithm in Python. I tried Levenshtein, but matches are not so good, and I tried difflib.get_close_matches but this one, although better, is so very slow.

Thanks,
PM

Recommended Answers

All 2 Replies

I don't know about google-diff-match-patch, but I once used an algorithm called compression distance. It returns a number in [0.0, 1.0], a "small" value meaning that the strings are close (or equal).

# python 2.6
from zlib import compress
  
def distance (sx, sy):
    ab =len (compress (sx + sy))
    a ,b =len (compress(sx)),len (compress(sy))
    return 1.0 -(0.0 + a + b - ab )/max (a ,b)

def main():
    while True:
        sx = raw_input("enter string sx: ").strip()
        sy = raw_input("enter string sy: ").strip()
        print("Compression distance: %.2f" % distance(sx, sy))

if __name__ == "__main__":
    main()

There are references for this algorithm here http://paginas.fe.up.pt/~%20ssn/2008/Cilibrasi05.pdf and there is a C library http://www.complearn.org/. I think the results of this algorithm are close to using levenshtein distance, but it's faster.
You can also use gzip to compress the strings (it probably gives better results). You can compress a string with

# python 2.6
import gzip
from cStringIO import StringIO

def gcompress(s):
    sio = StringIO()
    f = gzip.GzipFile(fileobj=sio, mode="w")
    f.write(s)
    f.close()
    return sio.getvalue()

Hello!
I wonder if anyone used google-diff-match-patch for fuzzy string comparisons, i.e., for finding a closest/most similar string? Can anyone explain and give some examples?
Also, in genera, what would be the best matching algorithm in Python. I tried Levenshtein, but matches are not so good, and I tried difflib.get_close_matches but this one, although better, is so very slow.

Levenstein can be messy if the diffs have lots of coincidental matches. In the example of "plants" vs "stanly" the Levenstien of a normal diff is only 4 whereas one would want 6. The google-diff-match-patch library allows one to do semantic cleanup that gets rid of much of this noise.

Here's an example:

#!/usr/bin/python2.4
import diff_match_patch as dmp_module

dmp = dmp_module.diff_match_patch()
dmp.Diff_Timeout = 0.1  # Don't spend more than 0.1 seconds on a diff.

def distance (sx, sy):
  diffs = dmp.diff_main(sx, sy)
  dmp.diff_cleanupSemantic(diffs)
  return dmp.diff_levenshtein(diff)
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.