google-diff-match-patch example(s) in Python

Question

neocortex

15 Years Ago

Hello!
I wonder if anyone used google-diff-match-patch for fuzzy string comparisons, i.e., for finding a closest/most similar string? Can anyone explain and give some examples?
Also, in genera, what would be the best matching algorithm in Python. I tried Levenshtein, but matches are not so good, and I tried difflib.get_close_matches but this one, although better, is so very slow.

Thanks,
PM

algorithm google python

3 Contributors
2 Replies
1K Views
6 Months Discussion Span
Latest Post 14 Years Ago Latest Post by Neil Fraser

All 2 Replies

Gribouillis 1,391 Programming Explorer

15 Years Ago

I don't know about google-diff-match-patch, but I once used an algorithm called compression distance. It returns a number in [0.0, 1.0], a "small" value meaning that the strings are close (or equal).

# python 2.6
from zlib import compress
  
def distance (sx, sy):
    ab =len (compress (sx + sy))
    a ,b =len (compress(sx)),len (compress(sy))
    return 1.0 -(0.0 + a + b - ab )/max (a ,b)

def main():
    while True:
        sx = raw_input("enter string sx: ").strip()
        sy = raw_input("enter string sy: ").strip()
        print("Compression distance: %.2f" % distance(sx, sy))

if __name__ == "__main__":
    main()

There are references for this algorithm here http://paginas.fe.up.pt/~%20ssn/2008/Cilibrasi05.pdf and there is a C library http://www.complearn.org/. I think the results of this algorithm are close to using levenshtein distance, but it's faster.
You can also use gzip to compress the strings (it probably gives better results). You can compress a string with

# python 2.6
import gzip
from cStringIO import StringIO

def gcompress(s):
    sio = StringIO()
    f = gzip.GzipFile(fileobj=sio, mode="w")
    f.write(s)
    f.close()
    return sio.getvalue()

Edited 15 Years Ago by Gribouillis because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Neil Fraser 0 Newbie Poster · Answer 1 · 2010-11-02T03:40:03+00:00

Hello!
I wonder if anyone used google-diff-match-patch for fuzzy string comparisons, i.e., for finding a closest/most similar string? Can anyone explain and give some examples?
Also, in genera, what would be the best matching algorithm in Python. I tried Levenshtein, but matches are not so good, and I tried difflib.get_close_matches but this one, although better, is so very slow.

Levenstein can be messy if the diffs have lots of coincidental matches. In the example of "plants" vs "stanly" the Levenstien of a normal diff is only 4 whereas one would want 6. The google-diff-match-patch library allows one to do semantic cleanup that gets rid of much of this noise.

Here's an example:

#!/usr/bin/python2.4
import diff_match_patch as dmp_module

dmp = dmp_module.diff_match_patch()
dmp.Diff_Timeout = 0.1  # Don't spend more than 0.1 seconds on a diff.

def distance (sx, sy):
  diffs = dmp.diff_main(sx, sy)
  dmp.diff_cleanupSemantic(diffs)
  return dmp.diff_levenshtein(diff)

google-diff-match-patch example(s) in Python

Recommended Answers Collapse Answers

All 2 Replies

Recommended Answers