I wrote this function to compare two strings and find their similarity. It returns the percent of their letters in common in place. ie "hello" and "yellow" share the "ello" (8 of 11 characters) and thus are 72.72 % similar. I feel like there is a better way to write this but I've yet to come up with it.

``````def RunComparison(str1, str2, param):
comp = Compare(str1, str2, param)
return comp.percent

class Compare:
def __init__(self, str1, str2, param):
self.str1 = str1.lower()
self.str2 = str2.lower()
self.param = param
self.most = 0

self.percent = self.LIC()

def LIC(self):                                              # Letters in common only
totlen = float( len(self.str1)+len(self.str2) )         # Will run LICIP if more
backup = self.str2                                      # than if greater than
if totlen > 0:                                          # required match is found
for x in self.str1:
if x in backup:
backup = backup.replace(x, "", 1)

InComn = len(self.str2) - len(backup)
percent = (InComn/totlen)*200

if percent >= self.param:
percent = self.LICIP()
if percent >= self.param:
return percent

def LICIP(self):
SearchResult = [ self.FindAll(self.str2, char) for char in self.str1 ]

Sequence = self.ElicitSequentialMatches( SearchResult )

Total = float( len(Sequence) )
percent = Total*200.0 / ( len(self.str1)+len(self.str2) )

return round(percent, 2)

def ElicitSequentialMatches( self, indices ):
# finds a good place to start prevents
# starting on an empty search result.
indices = [ x for x in indices if x ]
if indices:
return max(self.WalkPaths(indices, [indices[0][0]]), key=len)
else:
return []

def WalkPaths(self, path, directions ):

while path:
if len(directions)+len(path) <= self.most:
directions = []
break
try:
last = directions[-1]
except:
last = -1

path = path[1:]

if path:

option = self.NextGreatest(last, path[0])
if option:

directions += [ option ]

# consider the no-option option. Leave
# everything unchanged and continue down the path.
if not self.IsNextBest(last, option, path):
for x in self.WalkPaths(path, directions[:-1]):
yield x

dir_len = len(directions)
if dir_len > self.most:
# possibly just setting
# the return value instead of
# adding these directions
# to the list of possibilities
self.most = dir_len
yield directions

def IsNextBest(self, prev, option, path):
if not option:
return True
if prev + 1 == option:
return True
while path:
for x in path[0]:
if x > prev:
if x <= option:
return False
return True
path = path[1:]
return True

def NextGreatest(self, x, List):
for y in List:
if y > x: return y
return []

def FindAll(self, string, subj):
return [n for n, x in enumerate(string) if x == subj]``````

"Similarity" is kind of vague, and I'm not willing to think my way through all your code to understand what you mean by it. So:
* Are you looking for the greatest common substring (and then some arithmetic on its length versus the length of (which?) primary string?
* Are you looking for the number of characters in common regardless of adjacency?
-- And if so, do you care about order?
* Or something else?
Wikipedia has a decent article on greatest common substring, and the distinction between that and greatest common subsequence (my 'regardless of adjacency' above, where order matters).

If you don't care about order or adjacency, then you can use a 'bag' structure (dictionary with letter:count values) to hold the data from each word, then take the sum over all letters of (the minimum of the count of the letter in each word)

I don't care about 'adjacency,' rather just the order in which they appear. ie

'hello' and 'hzello' should find 'hello' as the common letters. The methods I wrote do that, but are heavy on processing.

Maybe you could use this function I wrote adapted from my scrambled word solver solution:

``````def insequence(k,s):
""" goes through the letters of first word (k) and returns the letters
which are in same order in both words
"""
i = 0
result = ''

for c in k:
i = s.find(c,i)+1
if i > 0 : result+=c ## add found letter to result
return result

for x,y in [('swite','white'), ('hello','hzello'),('hzello','hello'),('hit','time')]:
print x,y,'->',insequence(x,y)

"""output:
swite white -> wite
hello hzello -> hello
hzello hello -> hello
hit time -> i
"""``````

For the general case of an arbitrary number of input sequences, the problem is NP-hard. When the number of sequences is constant, the problem is solvable in polynomial time by dynamic programming ... For the case of two sequences of n and m elements, the running time of the dynamic programming approach is `O(n × m)` .

The same Wikipedia article gives this solution:

``````function  LCSLength(X[1..m], Y[1..n])
C = array(0..m, 0..n)
for i := 0..m
C[i,0] = 0
for j := 0..n
C[0,j] = 0
for i := 1..m
for j := 1..n
if X[i] = Y[j]
C[i,j] := C[i-1,j-1] + 1
else:
C[i,j] := max(C[i,j-1], C[i-1,j])
return C[m,n]``````

Common subsequence normally refers only one run (maybe the longest) occuring in both strings not sum total of all those, which my very simple (and quite pythonic) function collects together, and I think the poster needs.

My code did not consider going over the end of other word before finishing letters from first word. Here is version which breaks out of loop and does not accept match from beginning after first word finishes.

``````def insequence(k,s):
""" goes through the letters of first word (k) and returns the letters
which are in same order in both words
"""
i = 0
ls = len(s)
result = ''

for c in k:
i = s.find(c,i)+1
if i > 0 : result+=c ## add found letter to result
if i >= ls : break
return result

for x,y in [('yellow','ownyel'),
('ownyel','yellow'),
('hello','yellow'),
('hello','hzello'),
('hzello','hello'),
('hit','time')]:
print x,y,'->',insequence(x,y)

"""output:
yellow ownyel -> yel
ownyel yellow -> ow
hello yellow -> ello
hello hzello -> hello
hzello hello -> hello
hit time -> i
"""``````

What about hello elohl and lehlo and elloh?
I think you should check sequence too and compare not only number of letters but their sequence. This complicates it but makes it more accurate, that iselloh is closest to hello ;)

If you are going out of order checking (which is relatively questionable, are emit time so similar even they are anagrams?) you should check my anagram check function from scramble thread.
http://www.daniweb.com/forums/post1206864.html#post1206864

That function works well, the only problem I see is when comparing something like:

``````>>> insequence("blueyellow", "yellowblue")
'blue'``````

The function should return the largest similar value, which would be 'yellow' in this case

This seems to correct the above problem:

``````def insequence(k, s):
# saves several results and
# their accompanying index
result = [['', 0, 0]]
#       chrs, index, len
# a master list of all
# characters found in both strings
master = ''
lk = len(k)
most = 0
for c in k:
wasFound = False
for n, (string, index, Len) in enumerate(result):
found = s.find(c, index)
if found >= 0:
wasFound = True
result[n] = [string+c, found+1, Len+1]
if c not in master:
master += c
if Len == most:
most += 1
# a larger result is possible
if Len+(lk-n) > most:
# chosing no-option option
result.append([string, index+1, Len])
else:
if not wasFound:
# if this character was
# already found, then it is
# pointless to add it to the list
# because it could not possibly return
# a larger result
if c not in master:
found = s.find(c)
if found >= 0:
result.append([c, found+1, 0])
master += c
# return max([z for x, y, z in result])
return max([x for x, y, z in result], key=len)``````

Update:

My previous post was argument position dependant,
this one seems to correct(and complicate) the former.

``````def RunComparison(str1, str2, param):
totlen = float( len(str1)+len(str2) )         # Will run insequence if more
str1, str2 = str1.lower(), str2.lower()       # than if greater than
backup = str2                                 # required match is found
if totlen > 0:
for x in str1:
if x in backup:
backup = backup.replace(x, "", 1)

InComn = len(str2) - len(backup)
percent = (InComn/totlen)*200

if percent >= param:
InComn = insequence(str1, str2)
percent = round((InComn/totlen)*200)
if percent >= param:
return percent

def insequence(k, s):
# saves several results and
# their accompanying index
result = [['', 0, 0]]
#       chrs, index, len
# a master list of all characters
# found in both given strings
master = ''
lk = len(k)
most = 0
for index, c in enumerate(k):
wasFound = False
next = []
remaining_letters = lk-index-1
searched_indices = []
indices_to_delete = []
for n, (string, index, Len) in enumerate(result):
for ind, res in searched_indices:
if ind == index:
found = res
break
else:
found = s.find(c, index)
searched_indices.append([index, found])
if c not in master:
master += c

if found >= 0:
wasFound = True
result[n] = [string+c, found+1, Len+1]
if Len == most:
most += 1
# a larger result is possible
if Len+remaining_letters > most:
# chosing no-option option
next.append([string, index, Len])
elif Len+remaining_letters < most:
# reverse index order, to keep from shifting
# during deletion process
indices_to_delete.insert(0, n)
else:
for n in indices_to_delete:
del result[n]
if not wasFound:
# if this character was
# already found, then it is
# pointless to add it to the list
# because it could not possibly return
# a larger result
if c not in master:
found = s.find(c)
if found >= 0:
if remaining_letters+1 > most:
next.append([c, found+1, 1])
master += c
result += next
return max([z for x, y, z in result])
##    return max([x for x, y, z in result], key=len)``````

Looks bit complicated compared to simple:

``````def runcomparision(str1, str2, param):
ins1, ins2 = insequence(str1, str2) , insequence(str2, str1)
comlen = max(ins1,ins2)
percent = round(200.0 * comlen / len(str1+str2))
if percent >= param: return percent

def insequence(k,s):
""" goes through the letters of first word (k) and returns the number of
letters which are in same order in both words
"""
lcommon = i = 0
ls = len(s)

for c in k:
i = s.find(c,i)+1
if i > 0 : lcommon += 1 ## add length of common part
if i >= ls : break
return lcommon

for x,y in [('yellow','ownyel'),
('hello','yellow'),
('hello','hzello'),
('hit','time'),
('hit','hi'),
('hit','hot')]:
print x,y,'->',runcomparision(x,y,0)``````

Hope your code is more fast than this. You could test anyway and compare: sometimes simpler is better than code trying to be clever and efficient (especially if you use psyco module).

looks better, but it doesnt work when comparing the strings
"joshua wun" and "josh wun"

incorrectly matches
"joshu wun"

Should have not directly assigned find result to i:

``````def runcomparision(str1, str2, param):
ins1, ins2 = insequence(str1, str2) , insequence(str2, str1)
comlen = max(ins1,ins2)
percent = round(200.0 * comlen / len(str1+str2))
if percent >= param: return percent

def insequence(k,s):
""" goes through the letters of first word (k) and returns the number of
letters which are in same order in both words
"""
lcommon = i = 0
ls = len(s)

for c in k:
if i >= ls :
break
j = s.find(c,i)+1
if j > 0 :
lcommon += 1 ## add length of common part
i=j

return lcommon

for x,y in [('yellow','ownyel'),
('hello','yellow'),
('hello','hzello'),
('hit','time'),
('hit','hi'),
('hit','hot'),
("joshua wun", "josh wun")]:
print x,y,'->',runcomparision(x,y,0)

"""Output:
yellow ownyel -> 50.0
hello yellow -> 73.0
hello hzello -> 91.0
hit time -> 29.0
hit hi -> 80.0
hit hot -> 67.0
joshua wun josh wun -> 89.0
"""``````

I tested your code(modified to print matching string) vs mine.

here's what I got (mine first)

``````>>> insequence2("adsasdasd", "dsasasdadsad")
'dsasdasd'

I do not get so, I get 7 length, dsasasd

``````def runcomparision(str1, str2, param):
ins1, ins2 = insequence(str1, str2) , insequence(str2, str1)
comlen = max(ins1,ins2)
print
print 'comlen',comlen
percent = round(200.0 * comlen / len(str1+str2))
if percent >= param: return percent

def insequence(k,s):
""" goes through the letters of first word (k) and returns the number of
letters which are in same order in both words
"""
lcommon = i = 0
ls = len(s)
print

for c in k:
if i >= ls :
break
j = s.find(c,i)+1
if j > 0 :
lcommon += 1 ## add length of common part
print c,
i=j

return lcommon

for x,y in [("joshua wun", "josh wun"),
print x,y,'->',runcomparision(x,y,0),'%'

"""Output:
>>>
joshua wun josh wun ->
j o s h u n
j o s h   w u n
comlen 8
89.0 %
a d s a d
d s a s a s d
comlen 7
67.0 %
"""``````

hmm you're right, I needed to reverse the arguments... However, mine returns one extra letter. I'd prefer your function merely from its simplicity, but it seems to miss one character.

``````insequence2("dsasasdadsad", "adsasdasd")
(8, 'asasdasd')``````

It is when there is difference between maximum match and actual match. Fix would be to find the difference letters and prove to add those letters in place of one letter in match. Comes too complicated, I think. Use your version, lets try to find best formulation for it.

``````def runcomparision(str1, str2, param):
ins1, ins2 = insequence(str1, str2) , insequence(str2, str1)
comlen,comstr = max(ins1,ins2)
print
print 'comlen',comlen,'comstr',comstr
percent = round(200.0 * comlen / len(str1+str2))
if percent >= param: return percent

def insequence(k,s):
""" goes through the letters of first word (k) and returns the number of
letters which are in same order in both words
"""
lcommon = i = 0
ls = len(s)
res=''
for c in k:
j = s.find(c,i)
##        print c,j,s[i:]
if j >= 0 :
res+=c
lcommon += 1 ## add length of common part
i=j+1
if i > ls : break

return (lcommon,res)

for x,y in [("joshua wun", "josh wun"),
print x,y,'->',runcomparision(x,y,0),'%'
print x,y,'->',runcomparision(''.join(sorted(x)),''.join(sorted(y)),0),'%'``````

Have you checked difflib.SequenceMatcher.get_matching_blocks()?

UPDATE: did not get 8 but 7 match:

``````from difflib import SequenceMatcher
def runcomparision(str1, str2, param):
ins1, ins2 = insequence(str1, str2) , insequence(str2, str1)
comlen,comstr = max(ins1,ins2)
print
print 'comlen',comlen,'comstr',comstr
percent = round(200.0 * comlen / len(str1+str2),3)
if percent >= param: return percent

def insequence(k,s):
""" goes through the letters of first word (k) and returns the number of
letters which are in same order in both words
"""
lcommon = i = 0
ls = len(s)
res=''
for c in k:
j = s.find(c,i)
##        print c,j,s[i:]
if j >= 0 :
res+=c
lcommon += 1 ## add length of common part
i=j+1
if i > ls : break

return (lcommon,res)

for x,y in [('yellow','ownyel'),
('hello','yellow'),
('hello','hzello'),
('hit','time'),
('hit','hi'),
('hit','hot'),
("joshua wun", "josh wun"),
print x,y,'->',runcomparision(x,y,0),'%'
print x,y,'->',runcomparision(''.join(sorted(x)),''.join(sorted(y)),0),'%'
s = SequenceMatcher(None,x,y)
seq=''
tot = 0
for (start1,start2,le) in s.get_matching_blocks():
seq+=x[start1:start1+le]
tot+=le
print seq,'length',tot
## gives same numbers as I got above in float format
print 100*s.ratio(),'%..',100*s.quick_ratio(), '%'
print '-'*40``````

So I'm guessing the basic algorithm is similar to yours...

I'd like to see which condition causes the function to miss a match and then see if we could exaggerate it

No, it should use real Ratcliff and Obershelp algorithm recursively finding longest substring from string and then two left over pieces.

Timing looks also much more demanding than my algorith: quadratic.

Timing: The basic Ratcliff-Obershelp algorithm is cubic time in the worst case and quadratic time in the expected case. SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common; best case time is linear.

Here seems to be discussion about its optimizations:
http://mail.python.org/pipermail/python-dev/2001-January/011619.html

Would be interesting to run for example 20 misspelled words against big word list with these three algorithms and see how far they are in speed and results.

Especially I have feeling that if both calculated instances of insequence (my algorith) are same or very near each other, or max of them is near sorted insequence comparison (max possible similarity by my logic, I do not know how they do this their function, did not read the source) my algorithm is good enough. Then if is easy to see the situation where my algorithm is not good enough, it could fall back to heavier SequenceMatcher or your code which seems most accurate (deleting and permuting?).

I have not measured my code's performance, though. It looks linear time on the surface, but maybe it is not in reality. Do you have good example of variously misspelled words, I do have some English word lists for my anagram program to find correction suggestions. difflib has already one ready definition for such use

difflib.get_close_matches(word, possibilities[, n][, cutoff])

The condition of missing is that in sequence from other word comes letter which should match second occurance instead of first one to do best match. If you take out the first letter of the second word from the test case, the 8 long sequence is found. But removing one by one letters maybe does not make sense so much.

If you know from my function that there is 7 letters match, is there simple way to utilize your code to find if there exist longer than that sequences.

Basically there should be possible to remove one letter from match and replace it by matching two from other sequence after the letter position one before removed one. So there must be enough big jump in other string for that to be possible and common letters there in right position in sequence.

I did one implementation by direct transform of the pseudocode of Longest Common Subsequence, this can be much optimized, especially eliminating the index calculations by keeping the right number in simple variables instead of matrix.

Had to adapt to Pythons 0 based string indexes.

``````##import pretty
def runcomparision(a,b,limit):
n1=len(a)
n2=len(b)
lcs=[(n1+1)*[0]]+([[0]+[None]*n1]*n2)
jm1=0 ## j-1
for j in range(1,n2+1):
im1=0 ## i-1
for i in range(1,n1+1):
if a[im1] == b[jm1]:
##                print i-1,j-1,a[i-1],'\t',
lcs[j][i] = 1 + lcs[jm1][im1]
else:
lcs[j][i] = max(lcs[jm1][i],lcs[j][im1])
im1 = i
jm1 = j
##        print
##    print ; pretty.ppr(lcs)
res = lcs[n2][n1]
return (res, 200.0*res/(n1+n2))

for x,y in [('yellow','owyel'),
('hello','hzello'),
('hit','time'),
('hit','hi'),
('hit','hot'),
("joshua wun", "josh wun"),
("computer","houseboat")]:
print x,y,'-> ',
print "%i letters, %.2f %%" % runcomparision(x,y,0)
##    print '-'*40

"""Output without debug prints:
yellow owyel ->  4 letters, 72.73 %
hello hzello ->  5 letters, 90.91 %
hit time ->  1 letters, 28.57 %
hit hi ->  2 letters, 80.00 %
hit hot ->  2 letters, 66.67 %
joshua wun josh wun ->  8 letters, 88.89 %
computer houseboat ->  3 letters, 35.29 %
"""``````

Code has problem with letter repeats in first compared to second parameter:

``````import pretty
def runcomparision(a,b,limit):
n1=len(a)
n2=len(b)
lcs=[(n1+1)*[0]]+([[0]+[None]*n1]*n2)
jm1=0 ## j-1
for j in range(1,n2+1):
print
im1=0 ## i-1
for i in range(1,n1+1):
if a[im1] == b[jm1]:
lcs[j][i] = 1 + lcs[jm1][im1]
print im1,jm1,a[im1],lcs[j][i],'\t',
else:
lcs[j][i] = max(lcs[jm1][i],lcs[j][im1])
im1 = i
jm1 = j
print ; pretty.ppr(lcs)
res = lcs[n2][n1]
return (res, 200.0*res/(n1+n2))

## should produce same answer, but repeating letter matches multiple times,
## when first parameter, resulting answer longer than shorter sequence
for x,y in [('aaaa','aa'),
('aa','aaaa'),
]:
print x,y,'-> ',
print "%i letters, %.2f %%" % runcomparision(x,y,0)``````
``````aaaa aa ->
0 0 a 1 	1 0 a 1 	2 0 a 1 	3 0 a 1
0 1 a 1 	1 1 a 2 	2 1 a 3 	3 1 a 4

[
[0, 0, 0, 0, 0],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]]
4 letters, 133.33 %
aa aaaa ->
0 0 a 1 	1 0 a 1
0 1 a 1 	1 1 a 2
0 2 a 1 	1 2 a 2
0 3 a 1 	1 3 a 2

[
[0, 0, 0],
[0, 1, 2],
[0, 1, 2],
[0, 1, 2],
[0, 1, 2]]
2 letters, 66.67 %
>>>``````

Dict works better:

``````import pretty
def runcomparision(a,b,limit):
n1=len(a)
n2=len(b)
lcs=dict()
jm1=0 ## j-1

for j in range(1,n2+1):
lcs[jm1,0] = lcs[j,0] = 0
im1=0 ## i-1
for i in range(1,n1+1):
lcs[0,im1] = lcs[0,i] = 0
if a[im1] == b[jm1]:
lcs[j,i] = 1 + lcs[jm1,im1]
else:
lcs[j,i] = max(lcs[jm1,i],lcs[j,im1])
im1 = i
jm1 = j
res = lcs[n2,n1]
return (res, 200.0*res/(n1+n2))

## should produce same answer, but repeating letter matches multiple times,
## when first parameter, resulting answer longer than shorter sequence
for x,y in [('yellow','owyel'),
('hello','hzello'),
('hit','time'),
('hit','hi'),
('hit','hot'),
("joshua wun", "josh wun"),
("computer","houseboat"),
['aa','aaaa']]:

print x,y,'-> ',
print "%i letters, %.2f %%" % runcomparision(x,y,0)
""" Output:
yellow owyel ->  3 letters, 54.55 %
hello hzello ->  5 letters, 90.91 %
hit time ->  1 letters, 28.57 %
hit hi ->  2 letters, 80.00 %
hit hot ->  2 letters, 66.67 %
joshua wun josh wun ->  8 letters, 88.89 %
computer houseboat ->  3 letters, 35.29 %
aa aaaa ->  2 letters, 66.67 %
"""``````

I made several adjustments to my original code, and it runs much faster with minimal recursions, however its displaying odd behavior that I'm having difficulty understanding. I've run the same two strings through it 3 times and gotten 3 different results. I only get the correct result when I restart idle. I thought that behavior would be unique to idle, so I ran it through my program. Nope.. It's like it's using variable values from instances past. I tried explicitly deleting the instances of the compare class after each run and still it returns bizarre results.. Here's the code:

``````# this is the code from the main
# program
def RunComparison(str1, str2, param):
from Compare import Compare
instance = Compare(str1, str2, param)
return instance.percent``````
``````class Compare:
recursions = 0
largest_len = -1
Sequence = []
percent = 0

def __init__(self, str1, str2, param):
self.str1 = str1.lower()
self.str2 = str2.lower()
self.param = param

self.CommonCharacters()

def CommonCharacters(self):                                 # Letters in common only
self.totlen = float( len(self.str1)+len(self.str2) )    # Will check sequence if more
backup = self.str2                                      # than if greater than
if self.totlen:                                         # required match is found
for x in self.str1:
if x in backup:
backup = backup.replace(x, "", 1)

InComn = len(self.str2) - len(backup)
percent = (InComn/self.totlen)*200

if percent >= self.param:
# if percent of total characters
# meets or exceeds param requirement
# check for sequential matches
percent = self.CommonSequentialCharacters()
if percent >= self.param:
self.percent = percent

def CommonSequentialCharacters(self):
SearchResult = []
for char in self.str1:
indices = self.FindAll(self.str2, char)
if indices:
SearchResult.append(indices)

self.WalkPaths(SearchResult)

common = float( len(self.Sequence) )
percent = common*200.0 / self.totlen

return round(percent, 2)

def WalkPaths(self, path, directions=[] ):
alternate_options = []
self.recursions += 1
len_dirs = len(directions)
if len_dirs:
last = directions[-1]
else:
last = -1

clearance = (len_dirs+len(path)) - self.largest_len
while path and clearance > -1:
# greater match no
# longer possible

option = self.NextGreatest(last, path[0])
if option != None:
# consider the no-option option. Leave
# everything unchanged and continue down the path.
if not self.IsNextBest(last, option, path):
if clearance > 1:
alternate_options.append([ path[1:], directions[:] ])

directions += [ option ]
last = option
len_dirs += 1
else:
clearance -= 1

path = path[1:]

if len_dirs > self.largest_len:
# new match leader
self.largest_len = len_dirs
self.Sequence = directions[:]

# consider other options after
# parent function has yielded results.
# largest_match will be at its max thus preventing
# unnecessary recursions with 'dead end'
# possibilities.
[ self.WalkPaths(alt_path, alt_dir) for alt_path, alt_dir in alternate_options ]

def IsNextBest(self, prev, option, path):
# can reduce recursions
# by more than 93%
if prev + 1 == option:
return True
while path:
for x in path[0]:
if x > prev:
if x <= option:
return False
return True
path = path[1:]
return True

def NextGreatest(self, x, List):
for y in List:
if y > x: return y

def FindAll(self, string, subj):
return [n for n, x in enumerate(string) if x == subj]

def Return(self):
return ("percent", self.percent), ("length", self.largest_len),\
("sequence", self.Sequence), ("total recursions", self.recursions)``````

here is a simple example that I am not understanding. Running the code in idle.

``````>>> x = Compare("kyle", "elyk", 5)
>>> x.CommonCharacters()
25.0
>>> x = Compare("john", "johnny", 5)
>>> x.CommonCharacters()
40.0
>>> x.Return()
(('percent', 40.0), ('length', 2), ('sequence', [3, 4]), ('total recursions', 1))
>>>``````

... after idle restart

``````>>>
>>> x = Compare("john", "johnny", 5)
>>> x.CommonCharacters()
80.0
>>> x.Return()
(('percent', 80.0), ('length', 4), ('sequence', [0, 1, 2, 3]), ('total recursions', 1))
>>>``````

Here is the standard function with second version for collecting strings of results in addition to length.

``````def runcomparision(a,b,limit):
n1=len(a)
n2=len(b)
lcs=dict()

for j in range(n2+1):
lcs[j,0]=0
for i in range(n1+1):
lcs[0,i]=0

jm1=0 ## j-1

for j in range(1,n2+1):
im1=0 ## i-1
for i in range(1,n1+1):
if a[im1] == b[jm1]:
lcs[j,i] = 1 + lcs[jm1,im1]
else:
lcs[j,i] = max(lcs[jm1,i],lcs[j,im1])
im1 = i
jm1 = j
res = lcs[n2,n1]
return (res, 200.0*res/(n1+n2))

def getruns(a,b):
n1=len(a)
n2=len(b)
lcs=dict()

for j in range(n2+1):
lcs[j,0]=[0,'']
for i in range(n1+1):
lcs[0,i]=[0,'']

jm1=0 ## j-1

for j in range(1,n2+1):
im1=0 ## i-1
for i in range(1,n1+1):
if a[im1] == b[jm1]:
lcs[j,i] = (1 + lcs[jm1,im1][0]),lcs[jm1,im1][1]+a[im1]
else:
lcs[j,i] = max(lcs[jm1,i],lcs[j,im1])
im1 = i
jm1 = j
res,seq = lcs[n2,n1]
return (res, 200.0*res/(n1+n2),seq)

## should produce same answer, but repeating letter matches multiple times,
## when first parameter, resulting answer longer than shorter sequence
for x,y in [('yellow','owyel'),
('hello','hzello'),
('hit','time'),
('hit','hi'),
('hit','hot'),
("joshua wun", "josh wun"),
("computer","houseboat"),
['aa','aaaa']]:
print x,y,'-> ',
print "%i letters, %.2f %%, seq %s" % getruns(x,y)

""" Output:
yellow owyel ->  3 letters, 54.55 %, seq yel
hello hzello ->  5 letters, 90.91 %, seq hello
hit time ->  1 letters, 28.57 %, seq t
hit hi ->  2 letters, 80.00 %, seq hi
hit hot ->  2 letters, 66.67 %, seq ht
joshua wun josh wun ->  8 letters, 88.89 %, seq josh wun