I have another problem. Same dataset as before. I have a huge collection of floats. I have a multi-dimensional list. About 37 per list, and over 60,000 lists. So something like: [[16166.00, 4123.15, ...], [761.1364, ...]].

I'm trying to find the Euclidean distance between one list with each of the lists in another set. What I have is below. I guess it works. But it's too bloody slow. Takes a couple seconds to go through the nested loop once, and since it has to go through it over 60,000 times, we're talking over a day to complete. Plus, I need to find the 3 shortest for each, which will take just as much to process.

(though they're labeled *matrix.. it's a standard list of lists.

for i in datamatrix:
	for j in testmatrix:
		temp = (array(i, float)-array(j, float))**2
		sum = 0.
		[sum + n for n in temp]
		distances.append(sqrt(sum))
	alldistances.append(distances)

Is there some library that will compute this quickly or some better means of writing this in python?

There might be a better algorithm. Why do you need to precompute all of those distances? Could you use lazy evaluation? Or, could you compute distance *squared* for your purpose, which eliminates the expensive sqrt call?

Also, I may be being dense, but what does this line do:

[sum + n for n in temp]

?

Jeff

Ah, I was messing around with the code for quite a bit to see if I could get it to run any faster, and that one line is meaningless I think. Just how it was left when I got frustrated with it.

What I've been attempting to do is work on a k-nearest neighbor algorithm. Basically, I have a huge chunk of data that I know the classes (or labels - whatever you want to call it) for. Then I have another huge chunk of data that I don't know the classes for. The idea being that you can assign the unknown classes by finding the k nearest points from the set you do know the classes for.

I've ditched my original code. It was slow and not terribly neat. I patched together something that fully works from some 'biopython' libraries and Numeric (I know, I know, Numeric is old and outdated, but this is what it called for and I didn't want to spend even more time seeing if I could get numpy or scipy or whatever doing the same thing) hoping it would clear up the speed issues and, sadly, it didn't. It still takes a couple seconds to determine each vector's class (and I have thousands for it to go through).

Well, several things jump out.

First, just so I understand, you are taking a list of 37-dimensional vectors in the datamatrix and finding the three that are closest to something in the testmatrix?

(1) ditch the sqrt() call. If you need the Euclidean metric, then distance squared works just as well as distance for comparison purposes.

(2) The line

temp = (array(i, float)-array(j, float))**2

is confusing. I read it as making an array of the vectors i and j, subtracting those arrays, and then squaring the resulting array. Is that right? If so, then it's not computing the Euclidean distance.

What you would need is something like

tmp = 0
for x in i:
    for y in j:
        tmp += (x-y)**2

(3) If there's some way to order the vectors in the test matrix, you might be able to use lazy evaluation to quit the loop once the distance gets too big.

Hope it helps. I have a couple more thoughts, but they need developing first.

Jeff

I can't edit the post above, so here's a new post.

If you can download numpy, then this appears to be useful. The time for finding a single match among a data set of 60000 came down to 1 second:

import numpy
import time

def findclosest(a,X):
    """finds the vector in X most nearly parallel to a.
    This is the computationally expensive area, so optimize here."""

    if not len(X):
        return None
    current = X[0]
    smallest = numpy.inner(a-current, a-current)
    for item in X:
        d = numpy.inner(a-item, a-item)
        if d < smallest:
            current = item
            smallest = d
    return current

def findclosesttaxicab(a,X):
    """finds the vector in X most nearly parallel to a using taxicab metric."""

    if not len(X):
        return None
    current = X[0]
    smallest = numpy.abs(a-current).sum()
    for item in X:
        d = numpy.abs(a-item).sum()
        if d < smallest:
            current = item
            smallest = d
    return current

def find_first_n(data,X, n=3):
    """
    find_first_n(data, X, n=3) --> [(fit,vector, match), ...]

    Finds up to first n vectors along with their matches, sorted by
    order of fit."""

    if len(data) <= n:
        return data
    else:
        # A little ugly, but we can't use dictionaries, since vectors may not
        # be unique AND because dictionaries don't have ordering.

        # create a list of three winners: [(fit,vector,match),...] ordered
        # by fit from smallest to largest.
        winners = []
        for i in range(n):
            match = findclosest(data[i],X)
            fit = numpy.inner(data[i],match)
            for x in range(len(winners)):
                if fit < winners[x][0]:
                    winners = (winners[:x] + [(fit,data[i],match)] + winners[x:])
                    break
            else:
                winners.append((fit,data[i],match))
                
        return winners

def find_best_n(data, X, n=3):
    """
    find_best_three(data, X, n=3) --> [(fit,vector, match), ...]

    Finds best n vectors along with their matches, sorted by
    order of fit."""

    if len(data) <= n:
        return data
    # Prime the pump
    winners = find_first_n(data,X,n)

    # We've already examined the first n...
    for i in range(n,len(data)):
        match = findclosest(data[i],X)
        fit = numpy.inner(data[i],match)
        for x in range(n):
            if fit < winners[x][0]:
                winners = (winners[:x] + [(fit,data[i],match)] + winners[x:])[:n]
                break
    return winners

X = numpy.random.rand(60000,37)
data = numpy.random.rand(200,37)

# optimization metrics
start = time.time()
for i in data:
    findclosest(i,X)
stop = time.time()
print "findclosest: Dt = %0.2f for %d matches in array of size %d" %\
      (stop - start,len(data),len(X))

start = time.time()
for i in data:
    findclosesttaxicab(i,X)
stop = time.time()
print "findclosesttaxicab: Dt = %0.2f for %d matches in array of size %d" %\
      (stop - start,len(data),len(X))
# END optimization metrics

# 
#best= find_best_n(data,X,3)
#print best, len(best)

>>> 
findclosest: Dt = 198.63 for 200 matches in array of size 60000
findclosesttaxicab: Dt = 275.77 for 200 matches in array of size 60000  # surprise!  I thought it would be faster
>>>

HtH,
Jeff

Oh, wow. So this takes one second to determine the class for each row in the test set? Then it would take me around 18 hours instead of a day and a half to get my results, which is definitely an improvement. I ran what I had on my test sets (I have three) last night and the two smaller ones have finished. The big one is still cranking away.

The bigger issue for me is the data normalisation process; the data is a mess, and with it taking so long to process, it's hard to tell if my data cleaning is proving useful and matching to the correct class. If the results of the sets are off by much, I'll do some tweaks and try your code out.

Thanks for all the help.

This article has been dead for over six months. Start a new discussion instead.