Thanks for fixing mine, I was going insane. Do you have any idea why declaring the default value f's it up?

***looking at your code now

Could you narrate me through your code so I try and understand it?

just ran a benchmark test on it... Looks like yours is about 4x as fast

#
# mine
#
>>> bench("ajkfdhsajkdhas", "jskahdksjhdewasda")
(('percent', 51.609999999999999), ('length', 8), ('sequence', [0, 2, 4, 7, 8, 10, 13, 14]), ('total recursions', 64))
0.00399994850159 seconds
#
# yours
#
>>> bench2("ajkfdhsajkdhas", "jskahdksjhdewasda")
(8, 51.609999999999999, 'jkhsjhas')
0.000999927520752 seconds

a correction to the speed test. Your processing time increases linearly it appears, whereas mine is exponential.

>>> bench("ahfdjkdewhkjashdkjashfjakjahdajksdhkajewaksdh", "akjsdhaskjfdhajkfhakjhdasjkdhsakjdhasks")
(('percent', 59.520000000000003), ('length', 25), ('sequence', [0, 1, 4, 5, 6, 7, 8, 9, 10, 14, 18, 19, 20, 21, 22, 23, 25, 26, 27, 28, 30, 32, 35, 37, 38]), ('total recursions', 275577))
8.3069999218 seconds
>>> bench2("ahfdjkdewhkjashdkjashfjakjahdajksdhkajewaksdh", "akjsdhaskjfdhajkfhakjhdasjkdhsakjdhasks")
(25, 59.520000000000003, 'akdhaskjhjakjhdajkdhkjaks')
0.007000207901 seconds

..same results, 1/1186.7 the time

It is implementation of the standard Longest Common Subsequence pseudocode dynamic programming algorithm. Idea is to spread the best sequence to line under it or if sequence left of it is longer that, except in case letter pair matches and letter is added to sequence diagonally. Code does not check for hopelessly

I am curious of same run in same computer for my max insequence parameter swap and the difflib function. Also would be nice to know if my remember from previous loop m1 variables hack gained compared to normal -1. Of course if you do not need the sequence, the number only original is faster from my two functions, difference of speed for that compared to the string collecting version would be of interest also.

Also if you could test the performance of all functions with quite similar words, spelling mistakes.

The algorithm was kindly given in Griswolfs post http://www.daniweb.com/forums/post1225592.html#post1225592.

I based mine on version of one university article, but code was practically same. Only the return value I had to figure out myself (don't need max of table, only last element is OK).

Here is a quick benchmark test:

>>> from time import time
>>> from db import database
>>> def bench(word):
	start = time()
	perc = 0
	match = 0
	# database is the attached english dictionary
	for aword in database:
                # getMatch is the longest common sequence function
		new = getMatch(word, aword)
		if new > perc:
			perc = new
			match = aword
	else:
		elapsed = time() - start
		print "found", match, "(", perc, "% )"
		print elapsed, "seconds"

		
>>> bench("bubbulous")
found unbibulous ( 84.0 % )
31.7599999905 seconds
>>>

I got:

found unbibulous ( (8, 84.21052631578948, 'ubbulous') % )
41.5940001011 seconds

When I did:

from time import time
from db import database
from lcs import getruns as getMatch
def bench(word):
    start = time()
    perc = 0
    match = 0
    # database is the attached english dictionary
    for aword in database:
                # getMatch is the longest common sequence function
        new = getMatch(word, aword)
        if new > perc:
            perc = new
            match = aword
    else:
        elapsed = time() - start
        print "found", match, "(", perc, "% )"
        print elapsed, "seconds"

bench("bubbulous")

And if I change the matching to my simple function (runcomparision with third parameter dropped, always return percent):

from insequence import runcomparision as getMatch

I got:

found unbibulous ( 84.0 % )
5.79699993134 seconds

Top ten matches look quite close with both functions:

from time import time
from db import database
from lcs import getruns as getMatch
##from insequence import runcomparision as getMatch

def mybench(word):
    start = time()
    database.sort(key=lambda x: getMatch(word,x), reverse=True)
    elapsed = time() - start
    print "found", [(data, getMatch(word,data)) for data in database[:10]]
    print elapsed, "seconds"

mybench("bubbulous")
raw_input('Press Enter to quit')

insequence based:

found [('unbibulous', 84.0), ('bibulous', 82.0), ('tubulous', 82.0), ('bibulus',
 75.0), ('bulbous', 75.0), ('bullous', 75.0), ('tubulus', 75.0), ('bibulously',
74.0), ('puberulous', 74.0), ('tubulously', 74.0)]
5.20299983025 seconds
Press Enter to quit

lcs based:

found [('unbibulous', 84.21052631578948), ('bibulous', 82.352941176470594), ('tu
bulous', 82.352941176470594), ('bibulus', 75.0), ('bulbous', 75.0), ('bullous',
75.0), ('tubulus', 75.0), ('bibulously', 73.684210526315795), ('blubberous', 73.
684210526315795), ('puberulous', 73.684210526315795)]
38.4060001373 seconds
Press Enter to quit

Huge improvement when you only run lcs on words that meet a certain threshold

>>> from dictionary.db import database
>>> from time import time
>>> def bench(word):
	start = time()
	perc = 0
	match = 0
	# database is the attached english dictionary
	for aword in database:
                # getMatch is the longest common sequence function
                if not LIC(word, aword):
			continue
		new = getMatch(word, aword)
		if new > perc:
			perc = new
			match = aword
	else:
		elapsed = time() - start
		print "found", match, "(", perc, "% )"
		print elapsed, "seconds"

		
>>> def LIC(one, two):
	len_two = len(two)
	total = len(one)+len_two
	for x in one:
		if x in two:
			two = two.replace(x, '', 1)
	return (len_two - len(two))*200.0/total > 70

>>> bench("bubbulous")
found unbibulous ( 84.0 % )
0.897000074387 seconds

Timing in my computer with psyco:
lcs

bubbulous  best 10 matches found:
('unbibulous', 84.21052631578948)
('bibulous', 82.352941176470594)
('tubulous', 82.352941176470594)
('bibulus', 75.0)
('bulbous', 75.0)
('bullous', 75.0)
('tubulus', 75.0)
('bibulously', 73.684210526315795)
('blubberous', 73.684210526315795)
('puberulous', 73.684210526315795)
22.2809998989 seconds
Press Enter to quit

Insequence

bubbulous  best 10 matches found:
('unbibulous', 84.0)
('bibulous', 82.0)
('tubulous', 82.0)
('bibulus', 75.0)
('bulbous', 75.0)
('bullous', 75.0)
('tubulus', 75.0)
('bibulously', 74.0)
('puberulous', 74.0)
('tubulously', 74.0)
2.46900010109 seconds
Press Enter to quit

Huge improvement when you only run lcs on words that meet a certain threshold


0.897000074387 seconds

Yes, looks that takes out also much of difference from my insequence timewise, help from psyco is also not so much:

bubbulous  best 10 matches found:
('unbibulous', 84.21052631578948)
('bibulous', 82.352941176470594)
('tubulous', 82.352941176470594)
('bibulus', 75.0)
('bulbous', 75.0)
('bullous', 75.0)
('tubulus', 75.0)
('bibulously', 73.684210526315795)
('blubberous', 73.684210526315795)
('puberulous', 73.684210526315795)
0.75 seconds
Press Enter to quit

Psyco:

bubbulous  best 10 matches found:
('unbibulous', 84.21052631578948)
('bibulous', 82.352941176470594)
('tubulous', 82.352941176470594)
('bibulus', 75.0)
('bulbous', 75.0)
('bullous', 75.0)
('tubulus', 75.0)
('bibulously', 73.684210526315795)
('blubberous', 73.684210526315795)
('puberulous', 73.684210526315795)
0.671999931335 seconds
Press Enter to quit
bubbulous  best 10 matches found:
('unbibulous', 84.0)
('bibulous', 82.0)
('tubulous', 82.0)
('bibulus', 75.0)
('bulbous', 75.0)
('bullous', 75.0)
('tubulus', 75.0)
('bibulously', 74.0)
('puberulous', 74.0)
('tubulously', 74.0)
0.625 seconds
Press Enter to quit

Psyco:

bubbulous  best 10 matches found:
('unbibulous', 84.0)
('bibulous', 82.0)
('tubulous', 82.0)
('bibulus', 75.0)
('bulbous', 75.0)
('bullous', 75.0)
('tubulus', 75.0)
('bibulously', 74.0)
('puberulous', 74.0)
('tubulously', 74.0)
0.656000137329 seconds
Press Enter to quit
from time import time
from db import database
##from lcs import getruns as getMatch
from insequence import runcomparision as getMatch

def lic(one, two):
    len_two =len(two)
    total = len(one)+len_two
    for x in one:
        if x in two:
            two = two.replace(x, '', 1)
    return (len_two - len(two))*200.0/total

def mybench(word):
    start = time()
    dbase=[d for d in database if lic(word,d)>70]
    dbase.sort(key=lambda x: getMatch(word,x), reverse=True)
    elapsed = time() - start
    print word," best 10 matches found: "
    for i in [(data, getMatch(word,data)) for data in dbase[:10]]:
        print i
    print elapsed, "seconds"


mybench("bubbulous")

raw_input('Press Enter to quit')

I organized the dictionary database by the first two letters of every word. IE there is a corresponding list for words that start with 'aa' and another that start with 'ab' and so on an so forth making a direct lookup much faster. Do you think there is a way to organize the database so that LCS would run much faster.... How does google complete your search with spelling corrections in 0.3 seconds? Super computers?

def lookup(word):
   index1 = ord(word[0])-97
   index2 = ord(word[1])-97
   return word in database2[index1][index2]

Thanks for you for your style of removing letters one by one by in and replace(,,1) I changed my full anagram program to use it instead of letter by letter scan and find and got good news:

Test for anagrams for 'Tony Veijalainen' with Finnish dictionary (my control case)

before: 353 words loaded in 78 ms, 1564 anagrams found in 641 ms
after: 353 words loaded in 71 ms, 1564 anagrams found in 543 ms

Why is google so fast?

It uses extremely great data management,
for example, it has its own GoogleFileSystem,
uses column-oriented dbms
it has some very ingenious matching algorithms,
and not to mention its highly trained batch of pidgeons.

Looks like we are at least beating the difflib:

from time import time
from db import database
from lcs import getruns as getMatch
from difflib import get_close_matches
##from insequence import runcomparision as getMatch

def mybench(word):
    start = time()

    print get_close_matches(word,database,n=10)

    print time()-start, "seconds"
    print '-'*40
    
mybench("bubbulous")

raw_input('Press Enter to quit')
"""Output:
['unbibulous', 'tubulous', 'bibulous', 'tubulus', 'bullous',
'bulbous', 'bibulus', 'unnebulous', 'unfabulous', 'tubulously']
1.73399996758 seconds
----------------------------------------
Press Enter to quit
"""

Without psyco the timing is quite close actually (psyco in the imported module is what matters)

lcs without psyco:

bubbulous  best 10 matches found:
('unbibulous', (8, 84.21052631578948))
('bibulous', (7, 82.352941176470594))
('tubulous', (7, 82.352941176470594))
('bibulously', (7, 73.684210526315795))
('blubberous', (7, 73.684210526315795))
('puberulous', (7, 73.684210526315795))
('tubulously', (7, 73.684210526315795))
('unfabulous', (7, 73.684210526315795))
('unnebulous', (7, 73.684210526315795))
('pseudobulbous', (7, 63.636363636363633))
1.57800006866 seconds
Press Enter to quit

any progress on speeding up lcs?

I made a few minor tweaks to speed up the dictionary lookup. I also factored in the letters in common to the lcs calculation, seems to predict correct spellings more often.

def bench(word):
    start = time()
    word_len = len(word)

    minus1 = word_len - 1

    result = []

    best = 70
    for otherWord in database:
        two, match = otherWord, 0
        total = len(otherWord)+word_len
        req = total*best*.005
        for n, x in enumerate(word):
            if x in two:
                two = two.replace(x, '', 1)
                match += 1
            elif match + (minus1-n) < req:
                    break
        else:
            match = (match*200.0/total+longest_common_sequence(word, otherWord))/2
            result.append( (match, otherWord) )
            if match > best:
                best = match

    elapsed = time() - start

    result.sort(reverse=1)

    if len(result) > 10:
        result = result[:10]

    print result
    print elapsed, "seconds"

Yes, you cut quite some percent from the running time, in this your test case at least. I updated my Python to latest version, so some of the improvement must be from that also.

[(84.21052631578948, 'unbibulous'), (82.352941176470594, 'tubulous'), (82.352941176470594, 'bibulous'), (81.25, 'bulbous'), (78.94736842105263, 'blubberous'), (59.259259259259267, 'tuberculatogibbous')]
0.453000068665 seconds
Press Enter to quit
from __future__ import print_function
from time import time
from db import database
try:
    import psyco                        # python specialising compiler
    psyco.full()
except:
    print('Install psyco for faster execution')
    pass

def longest_common_sequence(a,b):
    n1=len(a)
    n2=len(b)
    lcs=dict()

    for j in range(n2+1):
        lcs[j,0]=0
    for i in range(n1+1):
        lcs[0,i]=0

    jm1=0 ## j-1
    
    for j in range(1,n2+1):
        im1=0 ## i-1
        for i in range(1,n1+1):
            if a[im1] == b[jm1]:
                value = 1 + lcs[jm1,im1]
            else:
                value = lcs[jm1,i]
                if lcs[j,im1]>value:
                    value = lcs[j,im1]
            lcs[j,i] = value           
            im1 = i
        jm1 = j
    res = lcs[n2,n1]
    return 200.0*res/(n1+n2)

def bench(word):
    start = time()
    word_len = len(word)

    minus1 = word_len - 1

    result = []

    best = 70
    for otherWord in database:
        two, match = otherWord, 0
        total = len(otherWord)+word_len
        req = total*best*.005
        for n, x in enumerate(word):
            if x in two:
                two = two.replace(x, '', 1)
                match += 1
            elif match + (minus1-n) < req:
                    break
        else:
            match = (match*200.0/total+longest_common_sequence(word, otherWord))/2
            result.append( (match, otherWord) )
            if match > best:
                best = match

    elapsed = time() - start

    result.sort(reverse=1)

    if len(result) > 10:
        result = result[:10]

    print(result)
    print(elapsed, "seconds")

bench("bubbulous")

raw_input('Press Enter to quit')

Yesterday I did some proofs to replace the dictionary with one line buffer list and three variables sliding window, but I have not finished the code yet. Don't know if it reduce also time or only space requirements (when both things compared are files for example). It would seem to me possible to read both compared words in for over elements without need of indexing.

Maybe should use the standard library's heapq type. I looked little in difflib module and it used it in some places. Also it would be good to see how difflib stores the data of second variable for multiple checks against it.

There is many spell check programs in GPL, there could be ideas from the way they present the correct choice for misspelled words.

I got the code to run with only 3 vars plus one line memory as list.

Unfortunately it is as is slower than array, should probably change to other data type for queue (heapq).

from __future__ import print_function
from time import time
from db import database
try:
    import psyco                        # python specialising compiler
    psyco.full()
except:
    print('Install psyco for faster execution')
    pass

def longest_common_sequence(a,b):
    
    n1=len(a)
    n2=len(b)
       
    previous=[]
    for i in range(n2):
        previous.append(0)

    over = 0
    for ch1 in a:
        left = corner = 0
        for ch2 in b:
            over = previous.pop(0)
            if ch1 == ch2:
                this = corner + 1
            else:
                this = over if over >= left else left
            previous.append(this)
            left, corner = this, over
    return 200.0*previous.pop()/(n1+n2)

def bench(word):
    start = time()
    word_len = len(word)

    minus1 = word_len - 1

    result = []

    best = 70
    for otherWord in database:
        two, match = otherWord, 0
        total = len(otherWord)+word_len
        req = total*best*.005
        for n, x in enumerate(word):
            if x in two:
                two = two.replace(x, '', 1)
                match += 1
            elif match + (minus1-n) < req:
                    break
        else:
            match = (match*200.0/total+longest_common_sequence(word, otherWord))/2
            result.append( (match, otherWord) )
            if match > best:
                best = match

    elapsed = time() - start

    result.sort(reverse=1)

    if len(result) > 10:
        result = result[:10]

    print(result)
    print(elapsed, "seconds")

    
if __name__ == '__main__':

    bench("bubbulous")
    raw_input('Press Enter to quit')

heapq is for collecting items in order, collections.deque is for queue or stack.

I proved them, but it had not effect for speed of this one test case.

from __future__ import print_function
from time import time
from db import database
try:
    import psyco                        # python specialising compiler
    psyco.full()
except:
    print('Install psyco for faster execution')
    pass
from heapq import heappush, heappop, nlargest
from collections import deque


def longest_common_sequence(a,b):
    
    n1=len(a)
    n2=len(b)
       
    previous=deque()
    for i in range(n2):
        previous.append(0)

    over = 0
    for ch1 in a:
        left = corner = 0
        for ch2 in b:
            over = previous.popleft()
            if ch1 == ch2:
                this = corner + 1
            else:
                this = over if over >= left else left
            previous.append(this)
            left, corner = this, over
    return 200.0*this/(n1+n2)

def bench(word):
    start = time()
    word_len = len(word)

    minus1 = word_len - 1

    result = []

    best = 70
    for otherWord in database:
        two, match = otherWord, 0
        total = len(otherWord)+word_len
        req = total*best*.005
        for n, x in enumerate(word):
            if x in two:
                two = two.replace(x, '', 1)
                match += 1
            elif match + (minus1-n) < req:
                    break
        else:
            match = (match*200.0/total+longest_common_sequence(word, otherWord))/2
            heappush(result,(match, otherWord))
                    
            if match > best:
                best = match

    elapsed = time() - start

    print(nlargest(10,result))
    print("%i milliseconds" % (1000*elapsed) )
   
if __name__ == '__main__':
    bench("bubbulous")
    raw_input('Press Enter to quit')

just finished reading http://www.python.org/doc/essays/list2str.html for a way to speed up longest common sequence. Any luck on your end?

.... Any insights from the veterans?

trying to speed up this code

def getruns(a,b):
    n1=len(a)
    n2=len(b)
    lcs=dict()
 
    for j in range(n2+1):
        lcs[j,0]=[0,'']
    for i in range(n1+1):
        lcs[0,i]=[0,'']
 
    jm1=0 ## j-1
 
    for j in range(1,n2+1):
        im1=0 ## i-1
        for i in range(1,n1+1):
            if a[im1] == b[jm1]:
                lcs[j,i] = (1 + lcs[jm1,im1][0]),lcs[jm1,im1][1]+a[im1]
            else:
                lcs[j,i] = max(lcs[jm1,i],lcs[j,im1])
            im1 = i
        jm1 = j
    res,seq = lcs[n2,n1]
    return (res, 200.0*res/(n1+n2),seq)

or renamed version

def longest_common_sequence(one, two):
   len_one, len_two = len(one), len(two)
   longestSequence = {}

   for x in xrange(len_two+1):
      longestSequence[x, 0] = [0, '']
   for x in xrange(len_one+1):
      longestSequence[0, x] = [0, '']

   prev_two=0 ## j-1

   for two_index in xrange(1, len_two+1):
      prev_one = 0 ## i-1
      for one_index in xrange(1, len_one+1):
         if one[prev_one] == two[prev_two]:
            longestSequence[two_index, one_index] = (1 + longestSequence[prev_two, prev_one][0]), longestSequence[prev_two, prev_one][1]+one[prev_one]
         else:
            longestSequence[two_index, one_index] = max(longestSequence[prev_two, one_index], longestSequence[two_index, prev_one])
         prev_one = one_index
      prev_two = two_index
   seq_len, seq = longestSequence[len_two, len_one]
   return round(200.0*seq_len/(len_one+len_two))
##    return (res, round(200.0*seq_len/(len_one+len_two), 2), seq)

I only eliminated that array dict and reduced the space requirement for one line plus three variables as I posted in my last post. It did not give any difference in speed, though, anyway did not slow it down either (with psyco, did not check without).

There is maybe possibility to optimize little if can keep record maximum value in line and the remaining length of shorter string. I do not know it is worth it in book keeping cost though. If there is one already started sequence which will be longest the last loops could be only the end part of line and one shorter each line (longest matching possibility is reduced each iteration by one, beginning maximum is the amount of common letters). I do not know if removing the not common words before call would help the speed, when you do that for the other one (or to use at least that string by putting together those checked, common letters from the other word as they come up in order of the original word.

If you use the version which does not update the actual sequence, it speeds up things a little.

Here is a version of my one line queue version which collects and returns also the actual string.

from collections import deque
def lcs_tuple(a,b):
    
    n1=len(a)
    n2=len(b)

    previous=deque()
    for i in range(n2):
        previous.append((0,''))

    over = (0,'')
    for i in range(n1):
        left = corner = (0,'')
        for j in range(n2):
            over = previous.popleft()
            if a[i] == b[j]:
                this = corner[0] + 1, corner[1]+a[i]
            else:
                this = max(over,left)
            previous.append(this)
            left, corner = this, over
    return 200.0*this[0]/(n1+n2),this[1]
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.