improve this method

Question

ihatehippies 11 Junior Poster

15 Years Ago

I wrote this function to compare two strings and find their similarity. It returns the percent of their letters in common in place. ie "hello" and "yellow" share the "ello" (8 of 11 characters) and thus are 72.72 % similar. I feel like there is a better way to write this but I've yet to come up with it.

def RunComparison(str1, str2, param):
    comp = Compare(str1, str2, param)
    return comp.percent
    

class Compare:
    def __init__(self, str1, str2, param):
        self.str1 = str1.lower()
        self.str2 = str2.lower()
        self.param = param
        self.most = 0

        self.percent = self.LIC()


    def LIC(self):                                              # Letters in common only
        totlen = float( len(self.str1)+len(self.str2) )         # Will run LICIP if more                          
        backup = self.str2                                      # than if greater than
        if totlen > 0:                                          # required match is found
            for x in self.str1:  
                if x in backup:
                    backup = backup.replace(x, "", 1)

            InComn = len(self.str2) - len(backup)
            percent = (InComn/totlen)*200

            if percent >= self.param:
                percent = self.LICIP()
                if percent >= self.param:
                    return percent



    def LICIP(self):
        SearchResult = [ self.FindAll(self.str2, char) for char in self.str1 ]

        Sequence = self.ElicitSequentialMatches( SearchResult )
        
        Total = float( len(Sequence) )
        percent = Total*200.0 / ( len(self.str1)+len(self.str2) )        

        return round(percent, 2)
        

    def ElicitSequentialMatches( self, indices ):
        # finds a good place to start prevents
        # starting on an empty search result.
        indices = [ x for x in indices if x ]
        if indices:
            return max(self.WalkPaths(indices, [indices[0][0]]), key=len)
        else:
            return []


    def WalkPaths(self, path, directions ):

        while path:
            if len(directions)+len(path) <= self.most:
                directions = []
                break
            try:
                last = directions[-1]
            except:
                last = -1
        
            path = path[1:]

            if path:

                option = self.NextGreatest(last, path[0])
                if option:

                    directions += [ option ]

                    # consider the no-option option. Leave
                    # everything unchanged and continue down the path.
                    if not self.IsNextBest(last, option, path):
                        for x in self.WalkPaths(path, directions[:-1]):
                            yield x
                    
        dir_len = len(directions)
        if dir_len > self.most:
            # possibly just setting
            # the return value instead of
            # adding these directions
            # to the list of possibilities
            self.most = dir_len
        yield directions
                   
    def IsNextBest(self, prev, option, path):
        if not option:
            return True
        if prev + 1 == option:
            return True
        while path:
            for x in path[0]:
                if x > prev:
                    if x <= option:
                        return False
                    return True
            path = path[1:]
        return True

    def NextGreatest(self, x, List):
        for y in List:
            if y > x: return y
        return []


    def FindAll(self, string, subj):
        return [n for n, x in enumerate(string) if x == subj]

python

5 Contributors
52 Replies
2K Views
3 Weeks Discussion Span
Latest Post 15 Years Ago Latest Post by TrustyTony

All 52 Replies

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

griswolf 304 Veteran Poster · Answer 1 · 2010-05-20T06:52:00+00:00

"Similarity" is kind of vague, and I'm not willing to think my way through all your code to understand what you mean by it. So:
* Are you looking for the greatest common substring (and then some arithmetic on its length versus the length of (which?) primary string?
* Are you looking for the number of characters in common regardless of adjacency?
-- And if so, do you care about order?
* Or something else?
Wikipedia has a decent article on greatest common substring, and the distinction between that and greatest common subsequence (my 'regardless of adjacency' above, where order matters).

If you don't care about order or adjacency, then you can use a 'bag' structure (dictionary with letter:count values) to hold the data from each word, then take the sum over all letters of (the minimum of the count of the letter in each word)

ihatehippies 11 Junior Poster · Answer 2 · 2010-05-20T07:34:22+00:00

I don't care about 'adjacency,' rather just the order in which they appear. ie

'hello' and 'hzello' should find 'hello' as the common letters. The methods I wrote do that, but are heavy on processing.

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 3 · 2010-05-20T10:19:17+00:00

Maybe you could use this function I wrote adapted from my scrambled word solver solution:

def insequence(k,s):
    """ goes through the letters of first word (k) and returns the letters
        which are in same order in both words
        """
    i = 0
    result = ''

    for c in k:
        i = s.find(c,i)+1 
        if i > 0 : result+=c ## add found letter to result
    return result

for x,y in [('swite','white'), ('hello','hzello'),('hzello','hello'),('hit','time')]:
    print x,y,'->',insequence(x,y)

"""output:
swite white -> wite
hello hzello -> hello
hzello hello -> hello
hit time -> i
"""

griswolf 304 Veteran Poster · Answer 4 · 2010-05-20T10:25:06+00:00

http://en.wikipedia.org/wiki/Longest_common_subsequence_problem :

For the general case of an arbitrary number of input sequences, the problem is NP-hard. When the number of sequences is constant, the problem is solvable in polynomial time by dynamic programming ... For the case of two sequences of n and m elements, the running time of the dynamic programming approach is O(n × m) .

The same Wikipedia article gives this solution:

function  LCSLength(X[1..m], Y[1..n])
    C = array(0..m, 0..n)
    for i := 0..m
       C[i,0] = 0
    for j := 0..n
       C[0,j] = 0
    for i := 1..m
        for j := 1..n
            if X[i] = Y[j]
                C[i,j] := C[i-1,j-1] + 1
            else:
                C[i,j] := max(C[i,j-1], C[i-1,j])
    return C[m,n]

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 5 · 2010-05-20T10:33:45+00:00

Common subsequence normally refers only one run (maybe the longest) occuring in both strings not sum total of all those, which my very simple (and quite pythonic) function collects together, and I think the poster needs.

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 6 · 2010-05-20T20:24:28+00:00

My code did not consider going over the end of other word before finishing letters from first word. Here is version which breaks out of loop and does not accept match from beginning after first word finishes.

def insequence(k,s):
    """ goes through the letters of first word (k) and returns the letters
        which are in same order in both words
        """
    i = 0
    ls = len(s)
    result = ''

    for c in k:
        i = s.find(c,i)+1
        if i > 0 : result+=c ## add found letter to result
        if i >= ls : break
    return result

for x,y in [('yellow','ownyel'),
            ('ownyel','yellow'),
            ('hello','yellow'),
            ('hello','hzello'),
            ('hzello','hello'),
            ('hit','time')]:
    print x,y,'->',insequence(x,y)

"""output:
yellow ownyel -> yel
ownyel yellow -> ow
hello yellow -> ello
hello hzello -> hello
hzello hello -> hello
hit time -> i
"""

Stefano Mtangoo 455 Senior Poster · Answer 7 · 2010-05-20T21:20:17+00:00

What about hello elohl and lehlo and elloh?
I think you should check sequence too and compare not only number of letters but their sequence. This complicates it but makes it more accurate, that iselloh is closest to hello ;)

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 8 · 2010-05-20T21:27:05+00:00

If you are going out of order checking (which is relatively questionable, are emit time so similar even they are anagrams?) you should check my anagram check function from scramble thread.
http://www.daniweb.com/forums/post1206864.html#post1206864

ihatehippies 11 Junior Poster · Answer 9 · 2010-05-21T04:40:45+00:00

That function works well, the only problem I see is when comparing something like:

>>> insequence("blueyellow", "yellowblue")
'blue'

The function should return the largest similar value, which would be 'yellow' in this case

ihatehippies 11 Junior Poster · Answer 10 · 2010-05-21T10:40:00+00:00

This seems to correct the above problem:

def insequence(k, s):
    # saves several results and
    # their accompanying index
    result = [['', 0, 0]]
    #       chrs, index, len
    # a master list of all
    # characters found in both strings
    master = ''
    lk = len(k)
    most = 0
    for c in k:
        wasFound = False
        for n, (string, index, Len) in enumerate(result):
            found = s.find(c, index)
            if found >= 0:
                wasFound = True
                result[n] = [string+c, found+1, Len+1]
                if c not in master:
                        master += c
                if Len == most:
                    most += 1
                # a larger result is possible
                if Len+(lk-n) > most:
                    # chosing no-option option
                    result.append([string, index+1, Len])
            else:
                if not wasFound:
                    # if this character was
                    # already found, then it is
                    # pointless to add it to the list
                    # because it could not possibly return
                    # a larger result
                        if c not in master:
                            found = s.find(c)
                            if found >= 0:
                                result.append([c, found+1, 0])
                                master += c
    # return max([z for x, y, z in result])
    return max([x for x, y, z in result], key=len)

ihatehippies 11 Junior Poster · Answer 11 · 2010-05-22T02:20:20+00:00

Update:

My previous post was argument position dependant,
this one seems to correct(and complicate) the former.

def RunComparison(str1, str2, param):
    totlen = float( len(str1)+len(str2) )         # Will run insequence if more
    str1, str2 = str1.lower(), str2.lower()       # than if greater than
    backup = str2                                 # required match is found
    if totlen > 0:                          
        for x in str1:  
            if x in backup:
                backup = backup.replace(x, "", 1)

        InComn = len(str2) - len(backup)
        percent = (InComn/totlen)*200

        if percent >= param:
            InComn = insequence(str1, str2)
            percent = round((InComn/totlen)*200)
            if percent >= param:
                return percent


def insequence(k, s):
    # saves several results and
    # their accompanying index
    result = [['', 0, 0]]
    #       chrs, index, len
    # a master list of all characters
    # found in both given strings
    master = ''
    lk = len(k)
    most = 0
    for index, c in enumerate(k):
        wasFound = False
        next = []
        remaining_letters = lk-index-1
        searched_indices = []
        indices_to_delete = []
        for n, (string, index, Len) in enumerate(result):
            for ind, res in searched_indices:
                if ind == index:
                    found = res
                    break
            else:
                found = s.find(c, index)
                searched_indices.append([index, found])
                if c not in master:
                    master += c
                    
            if found >= 0:
                wasFound = True
                result[n] = [string+c, found+1, Len+1]
                if Len == most:
                    most += 1
                # a larger result is possible
                if Len+remaining_letters > most:
                    # chosing no-option option
                    next.append([string, index, Len])
            elif Len+remaining_letters < most:
                # reverse index order, to keep from shifting
                # during deletion process
                indices_to_delete.insert(0, n)
        else:
            for n in indices_to_delete:
                del result[n]
            if not wasFound:
                # if this character was
                # already found, then it is
                # pointless to add it to the list
                # because it could not possibly return
                # a larger result
                    if c not in master:
                        found = s.find(c)
                        if found >= 0:
                            if remaining_letters+1 > most:
                                next.append([c, found+1, 1])
                                master += c
            result += next
    return max([z for x, y, z in result])
##    return max([x for x, y, z in result], key=len)

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 12 · 2010-05-22T03:00:50+00:00

Looks bit complicated compared to simple:

def runcomparision(str1, str2, param):
    ins1, ins2 = insequence(str1, str2) , insequence(str2, str1)
    comlen = max(ins1,ins2)
    percent = round(200.0 * comlen / len(str1+str2))
    if percent >= param: return percent

def insequence(k,s):
    """ goes through the letters of first word (k) and returns the number of
        letters which are in same order in both words
        """
    lcommon = i = 0 
    ls = len(s)
    
    for c in k:
        i = s.find(c,i)+1
        if i > 0 : lcommon += 1 ## add length of common part
        if i >= ls : break
    return lcommon

for x,y in [('yellow','ownyel'),
            ('hello','yellow'),
            ('hello','hzello'),
            ('hit','time'),
            ('hit','hi'),
            ('hit','hot')]:
    print x,y,'->',runcomparision(x,y,0)

Hope your code is more fast than this. You could test anyway and compare: sometimes simpler is better than code trying to be clever and efficient (especially if you use psyco module).

ihatehippies 11 Junior Poster · Answer 13 · 2010-05-22T03:17:29+00:00

looks better, but it doesnt work when comparing the strings
"joshua wun" and "josh wun"

incorrectly matches
"joshu wun"

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 14 · 2010-05-22T03:37:49+00:00

Should have not directly assigned find result to i:

def runcomparision(str1, str2, param):
    ins1, ins2 = insequence(str1, str2) , insequence(str2, str1)
    comlen = max(ins1,ins2)
    percent = round(200.0 * comlen / len(str1+str2))
    if percent >= param: return percent

def insequence(k,s):
    """ goes through the letters of first word (k) and returns the number of
        letters which are in same order in both words
        """
    lcommon = i = 0 
    ls = len(s)
    
    for c in k:
        if i >= ls :
            break
        j = s.find(c,i)+1
        if j > 0 :
            lcommon += 1 ## add length of common part
            i=j

    return lcommon

for x,y in [('yellow','ownyel'),
            ('hello','yellow'),
            ('hello','hzello'),
            ('hit','time'),
            ('hit','hi'),
            ('hit','hot'),
            ("joshua wun", "josh wun")]:
    print x,y,'->',runcomparision(x,y,0)

"""Output:
yellow ownyel -> 50.0
hello yellow -> 73.0
hello hzello -> 91.0
hit time -> 29.0
hit hi -> 80.0
hit hot -> 67.0
joshua wun josh wun -> 89.0
"""

ihatehippies 11 Junior Poster · Answer 15 · 2010-05-22T05:46:07+00:00

I tested your code(modified to print matching string) vs mine.

here's what I got (mine first)

>>> insequence2("adsasdasd", "dsasasdadsad")
'dsasdasd'
>>> insequence("adsasdasd", "dsasasdadsad")
(5, 'adsad')

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 16 · 2010-05-22T06:26:47+00:00

I do not get so, I get 7 length, dsasasd

def runcomparision(str1, str2, param):
    ins1, ins2 = insequence(str1, str2) , insequence(str2, str1)
    comlen = max(ins1,ins2)
    print
    print 'comlen',comlen
    percent = round(200.0 * comlen / len(str1+str2))
    if percent >= param: return percent

def insequence(k,s):
    """ goes through the letters of first word (k) and returns the number of
        letters which are in same order in both words
        """
    lcommon = i = 0 
    ls = len(s)
    print
    
    for c in k:
        if i >= ls :
            break
        j = s.find(c,i)+1
        if j > 0 :
            lcommon += 1 ## add length of common part
            print c,
            i=j

    return lcommon

for x,y in [("joshua wun", "josh wun"),
            ("adsasdasd", "dsasasdadsad")]:
    print x,y,'->',runcomparision(x,y,0),'%'

"""Output:
>>> 
joshua wun josh wun ->
j o s h u n
j o s h   w u n
comlen 8
89.0 %
adsasdasd dsasasdadsad ->
a d s a d
d s a s a s d
comlen 7
67.0 %
"""

ihatehippies 11 Junior Poster · Answer 17 · 2010-05-22T06:41:55+00:00

hmm you're right, I needed to reverse the arguments... However, mine returns one extra letter. I'd prefer your function merely from its simplicity, but it seems to miss one character.

insequence2("dsasasdadsad", "adsasdasd")
(8, 'asasdasd')

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 18 · 2010-05-22T14:48:55+00:00

It is when there is difference between maximum match and actual match. Fix would be to find the difference letters and prove to add those letters in place of one letter in match. Comes too complicated, I think. Use your version, lets try to find best formulation for it.

def runcomparision(str1, str2, param):
    ins1, ins2 = insequence(str1, str2) , insequence(str2, str1)
    comlen,comstr = max(ins1,ins2)
    print
    print 'comlen',comlen,'comstr',comstr
    percent = round(200.0 * comlen / len(str1+str2))
    if percent >= param: return percent

def insequence(k,s):
    """ goes through the letters of first word (k) and returns the number of
        letters which are in same order in both words
        """
    lcommon = i = 0 
    ls = len(s)
    res=''
    for c in k:
        j = s.find(c,i)
##        print c,j,s[i:]
        if j >= 0 :
            res+=c
            lcommon += 1 ## add length of common part
            i=j+1
            if i > ls : break
  
    return (lcommon,res)

for x,y in [("joshua wun", "josh wun"),
            ("dsasasdadsad", "adsasdasd")]:
    print x,y,'->',runcomparision(x,y,0),'%'
    print x,y,'->',runcomparision(''.join(sorted(x)),''.join(sorted(y)),0),'%'

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 19 · 2010-05-26T16:53:32+00:00

Have you checked difflib.SequenceMatcher.get_matching_blocks()?

UPDATE: did not get 8 but 7 match:

from difflib import SequenceMatcher
def runcomparision(str1, str2, param):
    ins1, ins2 = insequence(str1, str2) , insequence(str2, str1)
    comlen,comstr = max(ins1,ins2)
    print
    print 'comlen',comlen,'comstr',comstr
    percent = round(200.0 * comlen / len(str1+str2),3)
    if percent >= param: return percent

def insequence(k,s):
    """ goes through the letters of first word (k) and returns the number of
        letters which are in same order in both words
        """
    lcommon = i = 0 
    ls = len(s)
    res=''
    for c in k:
        j = s.find(c,i)
##        print c,j,s[i:]
        if j >= 0 :
            res+=c
            lcommon += 1 ## add length of common part
            i=j+1
            if i > ls : break
  
    return (lcommon,res)

for x,y in [('yellow','ownyel'),
            ('hello','yellow'),
            ('hello','hzello'),
            ('hit','time'),
            ('hit','hi'),
            ('hit','hot'),
            ("joshua wun", "josh wun"),
            ("dsasasdadsad", "adsasdasd")]:
    print x,y,'->',runcomparision(x,y,0),'%'
    print x,y,'->',runcomparision(''.join(sorted(x)),''.join(sorted(y)),0),'%'
    s = SequenceMatcher(None,x,y)
    seq=''
    tot = 0
    for (start1,start2,le) in s.get_matching_blocks():
        seq+=x[start1:start1+le]
        tot+=le
    print seq,'length',tot
    ## gives same numbers as I got above in float format
    print 100*s.ratio(),'%..',100*s.quick_ratio(), '%'
    print '-'*40

ihatehippies 11 Junior Poster · Answer 20 · 2010-05-26T21:11:06+00:00

So I'm guessing the basic algorithm is similar to yours...

I'd like to see which condition causes the function to miss a match and then see if we could exaggerate it

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 21 · 2010-05-26T21:33:00+00:00

No, it should use real Ratcliff and Obershelp algorithm recursively finding longest substring from string and then two left over pieces.

Timing looks also much more demanding than my algorith: quadratic.

Timing: The basic Ratcliff-Obershelp algorithm is cubic time in the worst case and quadratic time in the expected case. SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common; best case time is linear.

Here seems to be discussion about its optimizations:
http://mail.python.org/pipermail/python-dev/2001-January/011619.html

Would be interesting to run for example 20 misspelled words against big word list with these three algorithms and see how far they are in speed and results.

Especially I have feeling that if both calculated instances of insequence (my algorith) are same or very near each other, or max of them is near sorted insequence comparison (max possible similarity by my logic, I do not know how they do this their function, did not read the source) my algorithm is good enough. Then if is easy to see the situation where my algorithm is not good enough, it could fall back to heavier SequenceMatcher or your code which seems most accurate (deleting and permuting?).

I have not measured my code's performance, though. It looks linear time on the surface, but maybe it is not in reality. Do you have good example of variously misspelled words, I do have some English word lists for my anagram program to find correction suggestions. difflib has already one ready definition for such use

difflib.get_close_matches(word, possibilities[, n][, cutoff])

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 22 · 2010-05-27T02:05:02+00:00

The condition of missing is that in sequence from other word comes letter which should match second occurance instead of first one to do best match. If you take out the first letter of the second word from the test case, the 8 long sequence is found. But removing one by one letters maybe does not make sense so much.

If you know from my function that there is 7 letters match, is there simple way to utilize your code to find if there exist longer than that sequences.

Basically there should be possible to remove one letter from match and replace it by matching two from other sequence after the letter position one before removed one. So there must be enough big jump in other string for that to be possible and common letters there in right position in sequence.

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 23 · 2010-05-27T15:09:43+00:00

I did one implementation by direct transform of the pseudocode of Longest Common Subsequence, this can be much optimized, especially eliminating the index calculations by keeping the right number in simple variables instead of matrix.

Had to adapt to Pythons 0 based string indexes.

##import pretty
def runcomparision(a,b,limit):
    n1=len(a)
    n2=len(b)
    lcs=[(n1+1)*[0]]+([[0]+[None]*n1]*n2)
    jm1=0 ## j-1
    for j in range(1,n2+1):
        im1=0 ## i-1
        for i in range(1,n1+1):
            if a[im1] == b[jm1]:
##                print i-1,j-1,a[i-1],'\t',
                lcs[j][i] = 1 + lcs[jm1][im1]
            else:
                lcs[j][i] = max(lcs[jm1][i],lcs[j][im1])
            im1 = i
        jm1 = j
##        print
##    print ; pretty.ppr(lcs)
    res = lcs[n2][n1]
    return (res, 200.0*res/(n1+n2))

for x,y in [('yellow','owyel'),
            ('hello','hzello'),
            ('hit','time'),
            ('hit','hi'),
            ('hit','hot'),
            ("joshua wun", "josh wun"),
            ("adsasdasd", "dsasasdadsad"),
            ("computer","houseboat")]:
    print x,y,'-> ',
    print "%i letters, %.2f %%" % runcomparision(x,y,0)
##    print '-'*40

"""Output without debug prints:
yellow owyel ->  4 letters, 72.73 %
hello hzello ->  5 letters, 90.91 %
hit time ->  1 letters, 28.57 %
hit hi ->  2 letters, 80.00 %
hit hot ->  2 letters, 66.67 %
joshua wun josh wun ->  8 letters, 88.89 %
adsasdasd dsasasdadsad ->  8 letters, 76.19 %
computer houseboat ->  3 letters, 35.29 %
"""

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 24 · 2010-05-27T16:49:00+00:00

Code has problem with letter repeats in first compared to second parameter:

import pretty
def runcomparision(a,b,limit):
    n1=len(a)
    n2=len(b)
    lcs=[(n1+1)*[0]]+([[0]+[None]*n1]*n2)
    jm1=0 ## j-1
    for j in range(1,n2+1):
        print
        im1=0 ## i-1
        for i in range(1,n1+1):
            if a[im1] == b[jm1]:
                lcs[j][i] = 1 + lcs[jm1][im1]
                print im1,jm1,a[im1],lcs[j][i],'\t',
            else:
                lcs[j][i] = max(lcs[jm1][i],lcs[j][im1])
            im1 = i
        jm1 = j
    print ; pretty.ppr(lcs)
    res = lcs[n2][n1]
    return (res, 200.0*res/(n1+n2))

## should produce same answer, but repeating letter matches multiple times,
## when first parameter, resulting answer longer than shorter sequence
for x,y in [('aaaa','aa'),
            ('aa','aaaa'),
            ]:
    print x,y,'-> ',
    print "%i letters, %.2f %%" % runcomparision(x,y,0)

aaaa aa ->
0 0 a 1 	1 0 a 1 	2 0 a 1 	3 0 a 1
0 1 a 1 	1 1 a 2 	2 1 a 3 	3 1 a 4

[
  [0, 0, 0, 0, 0],
  [0, 1, 2, 3, 4],
  [0, 1, 2, 3, 4]]
4 letters, 133.33 %
aa aaaa ->
0 0 a 1 	1 0 a 1
0 1 a 1 	1 1 a 2
0 2 a 1 	1 2 a 2
0 3 a 1 	1 3 a 2

[
  [0, 0, 0],
  [0, 1, 2],
  [0, 1, 2],
  [0, 1, 2],
  [0, 1, 2]]
2 letters, 66.67 %
>>>

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 25 · 2010-05-27T22:24:13+00:00

Dict works better:

import pretty
def runcomparision(a,b,limit):
    n1=len(a)
    n2=len(b)
    lcs=dict()
    jm1=0 ## j-1
    
    for j in range(1,n2+1):
        lcs[jm1,0] = lcs[j,0] = 0
        im1=0 ## i-1
        for i in range(1,n1+1):
            lcs[0,im1] = lcs[0,i] = 0
            if a[im1] == b[jm1]:
                lcs[j,i] = 1 + lcs[jm1,im1]
            else:
                lcs[j,i] = max(lcs[jm1,i],lcs[j,im1])
            im1 = i
        jm1 = j
    res = lcs[n2,n1]
    return (res, 200.0*res/(n1+n2))

## should produce same answer, but repeating letter matches multiple times,
## when first parameter, resulting answer longer than shorter sequence
for x,y in [('yellow','owyel'),
            ('hello','hzello'),
            ('hit','time'),
            ('hit','hi'),
            ('hit','hot'),
            ("joshua wun", "josh wun"),
            ("adsasdasd", "dsasasdadsad"),
            ("computer","houseboat"),
            ['aa','aaaa']]:

    print x,y,'-> ',
    print "%i letters, %.2f %%" % runcomparision(x,y,0)
""" Output:
yellow owyel ->  3 letters, 54.55 %
hello hzello ->  5 letters, 90.91 %
hit time ->  1 letters, 28.57 %
hit hi ->  2 letters, 80.00 %
hit hot ->  2 letters, 66.67 %
joshua wun josh wun ->  8 letters, 88.89 %
adsasdasd dsasasdadsad ->  8 letters, 76.19 %
computer houseboat ->  3 letters, 35.29 %
aa aaaa ->  2 letters, 66.67 %
"""

ihatehippies 11 Junior Poster · Answer 26 · 2010-05-28T00:09:22+00:00

I made several adjustments to my original code, and it runs much faster with minimal recursions, however its displaying odd behavior that I'm having difficulty understanding. I've run the same two strings through it 3 times and gotten 3 different results. I only get the correct result when I restart idle. I thought that behavior would be unique to idle, so I ran it through my program. Nope.. It's like it's using variable values from instances past. I tried explicitly deleting the instances of the compare class after each run and still it returns bizarre results.. Here's the code:

# this is the code from the main
# program
def RunComparison(str1, str2, param):
    from Compare import Compare
    instance = Compare(str1, str2, param)
    return instance.percent

class Compare:
    recursions = 0
    largest_len = -1
    Sequence = []
    percent = 0

    def __init__(self, str1, str2, param):
        self.str1 = str1.lower()
        self.str2 = str2.lower()
        self.param = param

        self.CommonCharacters()




    def CommonCharacters(self):                                 # Letters in common only
        self.totlen = float( len(self.str1)+len(self.str2) )    # Will check sequence if more                          
        backup = self.str2                                      # than if greater than
        if self.totlen:                                         # required match is found
            for x in self.str1:  
                if x in backup:
                    backup = backup.replace(x, "", 1)

            InComn = len(self.str2) - len(backup)
            percent = (InComn/self.totlen)*200

            if percent >= self.param:
                # if percent of total characters
                # meets or exceeds param requirement
                # check for sequential matches
                percent = self.CommonSequentialCharacters()
                if percent >= self.param:
                    self.percent = percent



    def CommonSequentialCharacters(self):
        SearchResult = []
        for char in self.str1:
            indices = self.FindAll(self.str2, char)
            if indices:
                SearchResult.append(indices)

        self.WalkPaths(SearchResult)

        
        common = float( len(self.Sequence) )
        percent = common*200.0 / self.totlen        

        return round(percent, 2)



    def WalkPaths(self, path, directions=[] ):
        alternate_options = []
        self.recursions += 1
        len_dirs = len(directions)
        if len_dirs:
            last = directions[-1]
        else:
            last = -1

        clearance = (len_dirs+len(path)) - self.largest_len
        while path and clearance > -1:
                       # greater match no
                       # longer possible

            option = self.NextGreatest(last, path[0])
            if option != None:
                # consider the no-option option. Leave
                # everything unchanged and continue down the path.
                if not self.IsNextBest(last, option, path):
                    if clearance > 1:
                        alternate_options.append([ path[1:], directions[:] ])

                directions += [ option ]
                last = option
                len_dirs += 1
            else:
                clearance -= 1

            path = path[1:]
                    
        if len_dirs > self.largest_len:
            # new match leader
            self.largest_len = len_dirs
            self.Sequence = directions[:]

        # consider other options after
        # parent function has yielded results.
        # largest_match will be at its max thus preventing
        # unnecessary recursions with 'dead end'
        # possibilities.
        [ self.WalkPaths(alt_path, alt_dir) for alt_path, alt_dir in alternate_options ]

                   
    def IsNextBest(self, prev, option, path):
        # can reduce recursions
        # by more than 93%
        if prev + 1 == option:
            return True
        while path:
            for x in path[0]:
                if x > prev:
                    if x <= option:
                        return False
                    return True
            path = path[1:]
        return True


    def NextGreatest(self, x, List):
        for y in List:
            if y > x: return y


    def FindAll(self, string, subj):
        return [n for n, x in enumerate(string) if x == subj]

    def Return(self):
        return ("percent", self.percent), ("length", self.largest_len),\
               ("sequence", self.Sequence), ("total recursions", self.recursions)

ihatehippies 11 Junior Poster · Answer 27 · 2010-05-28T00:50:08+00:00

here is a simple example that I am not understanding. Running the code in idle.

>>> x = Compare("kyle", "elyk", 5)
>>> x.CommonCharacters()
25.0
>>> x = Compare("john", "johnny", 5)
>>> x.CommonCharacters()
40.0
>>> x.Return()
(('percent', 40.0), ('length', 2), ('sequence', [3, 4]), ('total recursions', 1))
>>>

... after idle restart

>>> 
>>> x = Compare("john", "johnny", 5)
>>> x.CommonCharacters()
80.0
>>> x.Return()
(('percent', 80.0), ('length', 4), ('sequence', [0, 1, 2, 3]), ('total recursions', 1))
>>>

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 28 · 2010-05-28T00:57:56+00:00

Here is the standard function with second version for collecting strings of results in addition to length.

def runcomparision(a,b,limit):
    n1=len(a)
    n2=len(b)
    lcs=dict()

    for j in range(n2+1):
        lcs[j,0]=0
    for i in range(n1+1):
        lcs[0,i]=0

    jm1=0 ## j-1
    
    for j in range(1,n2+1):
        im1=0 ## i-1
        for i in range(1,n1+1):
            if a[im1] == b[jm1]:
                lcs[j,i] = 1 + lcs[jm1,im1]
            else:
                lcs[j,i] = max(lcs[jm1,i],lcs[j,im1])
            im1 = i
        jm1 = j
    res = lcs[n2,n1]
    return (res, 200.0*res/(n1+n2))

def getruns(a,b):
    n1=len(a)
    n2=len(b)
    lcs=dict()

    for j in range(n2+1):
        lcs[j,0]=[0,'']
    for i in range(n1+1):
        lcs[0,i]=[0,'']

    jm1=0 ## j-1
    
    for j in range(1,n2+1):
        im1=0 ## i-1
        for i in range(1,n1+1):
            if a[im1] == b[jm1]:
                lcs[j,i] = (1 + lcs[jm1,im1][0]),lcs[jm1,im1][1]+a[im1]
            else:
                lcs[j,i] = max(lcs[jm1,i],lcs[j,im1])
            im1 = i
        jm1 = j
    res,seq = lcs[n2,n1]
    return (res, 200.0*res/(n1+n2),seq)
    

## should produce same answer, but repeating letter matches multiple times,
## when first parameter, resulting answer longer than shorter sequence
for x,y in [('yellow','owyel'),
            ('hello','hzello'),
            ('hit','time'),
            ('hit','hi'),
            ('hit','hot'),
            ("joshua wun", "josh wun"),
            ("dsasasdadsad","adsasdasd"),
            ("computer","houseboat"),
            ['aa','aaaa']]:
    print x,y,'-> ',
    print "%i letters, %.2f %%, seq %s" % getruns(x,y)

""" Output:
yellow owyel ->  3 letters, 54.55 %, seq yel
hello hzello ->  5 letters, 90.91 %, seq hello
hit time ->  1 letters, 28.57 %, seq t
hit hi ->  2 letters, 80.00 %, seq hi
hit hot ->  2 letters, 66.67 %, seq ht
joshua wun josh wun ->  8 letters, 88.89 %, seq josh wun
dsasasdadsad adsasdasd ->  8 letters, 76.19 %, seq dsasdasd
computer houseboat ->  3 letters, 35.29 %, seq out
aa aaaa ->  2 letters, 66.67 %, seq aa
"""

My bizarre behaviour for code came from the fact that the lines became associated with same list from initialization, even it had no variable references except length counts.

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 29 · 2010-05-28T10:16:28+00:00

TrustyTony 888 ex-Moderator

15 Years Ago

For some reason, looks like you must not make directions default parameter, but include for first call as [].

For the code which seemed to work for me see attachment, my lcs still not optimized.

This attachment is potentially unsafe to open. It may be an executable that is capable of making changes to your file system, or it may require specific software to open. Use caution and only open this attachment if you are comfortable working with zip files.

sequences.zip (2.23 KB)

Edited 15 Years Ago by TrustyTony because: n/a

improve this method

Recommended Answers Collapse Answers

All 52 Replies

Recommended Answers