| | |
Array Distances
![]() |
•
•
Join Date: Sep 2007
Posts: 5
Reputation:
Solved Threads: 0
I have another problem. Same dataset as before. I have a huge collection of floats. I have a multi-dimensional list. About 37 per list, and over 60,000 lists. So something like: [[16166.00, 4123.15, ...], [761.1364, ...]].
I'm trying to find the Euclidean distance between one list with each of the lists in another set. What I have is below. I guess it works. But it's too bloody slow. Takes a couple seconds to go through the nested loop once, and since it has to go through it over 60,000 times, we're talking over a day to complete. Plus, I need to find the 3 shortest for each, which will take just as much to process.
(though they're labeled *matrix.. it's a standard list of lists.
Is there some library that will compute this quickly or some better means of writing this in python?
I'm trying to find the Euclidean distance between one list with each of the lists in another set. What I have is below. I guess it works. But it's too bloody slow. Takes a couple seconds to go through the nested loop once, and since it has to go through it over 60,000 times, we're talking over a day to complete. Plus, I need to find the 3 shortest for each, which will take just as much to process.
(though they're labeled *matrix.. it's a standard list of lists.
python Syntax (Toggle Plain Text)
for i in datamatrix: for j in testmatrix: temp = (array(i, float)-array(j, float))**2 sum = 0. [sum + n for n in temp] distances.append(sqrt(sum)) alldistances.append(distances)
Is there some library that will compute this quickly or some better means of writing this in python?
•
•
Join Date: Jul 2006
Posts: 608
Reputation:
Solved Threads: 150
There might be a better algorithm. Why do you need to precompute all of those distances? Could you use lazy evaluation? Or, could you compute distance *squared* for your purpose, which eliminates the expensive sqrt call?
Also, I may be being dense, but what does this line do:
[sum + n for n in temp]
?
Jeff
Also, I may be being dense, but what does this line do:
[sum + n for n in temp]
?
Jeff
•
•
Join Date: Sep 2007
Posts: 5
Reputation:
Solved Threads: 0
Ah, I was messing around with the code for quite a bit to see if I could get it to run any faster, and that one line is meaningless I think. Just how it was left when I got frustrated with it.
What I've been attempting to do is work on a k-nearest neighbor algorithm. Basically, I have a huge chunk of data that I know the classes (or labels - whatever you want to call it) for. Then I have another huge chunk of data that I don't know the classes for. The idea being that you can assign the unknown classes by finding the k nearest points from the set you do know the classes for.
I've ditched my original code. It was slow and not terribly neat. I patched together something that fully works from some 'biopython' libraries and Numeric (I know, I know, Numeric is old and outdated, but this is what it called for and I didn't want to spend even more time seeing if I could get numpy or scipy or whatever doing the same thing) hoping it would clear up the speed issues and, sadly, it didn't. It still takes a couple seconds to determine each vector's class (and I have thousands for it to go through).
What I've been attempting to do is work on a k-nearest neighbor algorithm. Basically, I have a huge chunk of data that I know the classes (or labels - whatever you want to call it) for. Then I have another huge chunk of data that I don't know the classes for. The idea being that you can assign the unknown classes by finding the k nearest points from the set you do know the classes for.
I've ditched my original code. It was slow and not terribly neat. I patched together something that fully works from some 'biopython' libraries and Numeric (I know, I know, Numeric is old and outdated, but this is what it called for and I didn't want to spend even more time seeing if I could get numpy or scipy or whatever doing the same thing) hoping it would clear up the speed issues and, sadly, it didn't. It still takes a couple seconds to determine each vector's class (and I have thousands for it to go through).
•
•
Join Date: Jul 2006
Posts: 608
Reputation:
Solved Threads: 150
Well, several things jump out.
First, just so I understand, you are taking a list of 37-dimensional vectors in the datamatrix and finding the three that are closest to something in the testmatrix?
(1) ditch the sqrt() call. If you need the Euclidean metric, then distance squared works just as well as distance for comparison purposes.
(2) The line
is confusing. I read it as making an array of the vectors i and j, subtracting those arrays, and then squaring the resulting array. Is that right? If so, then it's not computing the Euclidean distance.
What you would need is something like
(3) If there's some way to order the vectors in the test matrix, you might be able to use lazy evaluation to quit the loop once the distance gets too big.
Hope it helps. I have a couple more thoughts, but they need developing first.
Jeff
First, just so I understand, you are taking a list of 37-dimensional vectors in the datamatrix and finding the three that are closest to something in the testmatrix?
(1) ditch the sqrt() call. If you need the Euclidean metric, then distance squared works just as well as distance for comparison purposes.
(2) The line
Python Syntax (Toggle Plain Text)
temp = (array(i, float)-array(j, float))**2
is confusing. I read it as making an array of the vectors i and j, subtracting those arrays, and then squaring the resulting array. Is that right? If so, then it's not computing the Euclidean distance.
What you would need is something like
Python Syntax (Toggle Plain Text)
tmp = 0 for x in i: for y in j: tmp += (x-y)**2
(3) If there's some way to order the vectors in the test matrix, you might be able to use lazy evaluation to quit the loop once the distance gets too big.
Hope it helps. I have a couple more thoughts, but they need developing first.
Jeff
•
•
Join Date: Jul 2006
Posts: 608
Reputation:
Solved Threads: 150
I can't edit the post above, so here's a new post.
If you can download numpy, then this appears to be useful. The time for finding a single match among a data set of 60000 came down to 1 second:
HtH,
Jeff
If you can download numpy, then this appears to be useful. The time for finding a single match among a data set of 60000 came down to 1 second:
Python Syntax (Toggle Plain Text)
import numpy import time def findclosest(a,X): """finds the vector in X most nearly parallel to a. This is the computationally expensive area, so optimize here.""" if not len(X): return None current = X[0] smallest = numpy.inner(a-current, a-current) for item in X: d = numpy.inner(a-item, a-item) if d < smallest: current = item smallest = d return current def findclosesttaxicab(a,X): """finds the vector in X most nearly parallel to a using taxicab metric.""" if not len(X): return None current = X[0] smallest = numpy.abs(a-current).sum() for item in X: d = numpy.abs(a-item).sum() if d < smallest: current = item smallest = d return current def find_first_n(data,X, n=3): """ find_first_n(data, X, n=3) --> [(fit,vector, match), ...] Finds up to first n vectors along with their matches, sorted by order of fit.""" if len(data) <= n: return data else: # A little ugly, but we can't use dictionaries, since vectors may not # be unique AND because dictionaries don't have ordering. # create a list of three winners: [(fit,vector,match),...] ordered # by fit from smallest to largest. winners = [] for i in range(n): match = findclosest(data[i],X) fit = numpy.inner(data[i],match) for x in range(len(winners)): if fit < winners[x][0]: winners = (winners[:x] + [(fit,data[i],match)] + winners[x:]) break else: winners.append((fit,data[i],match)) return winners def find_best_n(data, X, n=3): """ find_best_three(data, X, n=3) --> [(fit,vector, match), ...] Finds best n vectors along with their matches, sorted by order of fit.""" if len(data) <= n: return data # Prime the pump winners = find_first_n(data,X,n) # We've already examined the first n... for i in range(n,len(data)): match = findclosest(data[i],X) fit = numpy.inner(data[i],match) for x in range(n): if fit < winners[x][0]: winners = (winners[:x] + [(fit,data[i],match)] + winners[x:])[:n] break return winners X = numpy.random.rand(60000,37) data = numpy.random.rand(200,37) # optimization metrics start = time.time() for i in data: findclosest(i,X) stop = time.time() print "findclosest: Dt = %0.2f for %d matches in array of size %d" %\ (stop - start,len(data),len(X)) start = time.time() for i in data: findclosesttaxicab(i,X) stop = time.time() print "findclosesttaxicab: Dt = %0.2f for %d matches in array of size %d" %\ (stop - start,len(data),len(X)) # END optimization metrics # #best= find_best_n(data,X,3) #print best, len(best) >>> findclosest: Dt = 198.63 for 200 matches in array of size 60000 findclosesttaxicab: Dt = 275.77 for 200 matches in array of size 60000 # surprise! I thought it would be faster >>>
HtH,
Jeff
•
•
Join Date: Sep 2007
Posts: 5
Reputation:
Solved Threads: 0
Oh, wow. So this takes one second to determine the class for each row in the test set? Then it would take me around 18 hours instead of a day and a half to get my results, which is definitely an improvement. I ran what I had on my test sets (I have three) last night and the two smaller ones have finished. The big one is still cranking away.
The bigger issue for me is the data normalisation process; the data is a mess, and with it taking so long to process, it's hard to tell if my data cleaning is proving useful and matching to the correct class. If the results of the sets are off by much, I'll do some tweaks and try your code out.
Thanks for all the help.
The bigger issue for me is the data normalisation process; the data is a mess, and with it taking so long to process, it's hard to tell if my data cleaning is proving useful and matching to the correct class. If the results of the sets are off by much, I'll do some tweaks and try your code out.
Thanks for all the help.
![]() |
Similar Threads
- Programming student find distances btwn cities using array of structs (C++)
- Dijkstra Algorithm (Computer Science)
- URGENT - Reading from txt file into a 2 dimension array (Java)
- Array limit (C)
- struct dynamic 2d array alloc (C)
- string to integer array transformation (C)
- Array (Visual Basic 4 / 5 / 6)
Other Threads in the Python Forum
- Previous Thread: newb: Print ids, you get 1L, what is L?
- Next Thread: Why is this not working?
| Thread Tools | Search this Thread |
accessdenied advanced apache application argv array beginner book change command converter countpasswordentry csv curved dan08 def dictionary dynamic edit enter event examples file float format function google gui homework import inches input jaunty java keyboard lapse library line lines linux list lists loop microphone mouse movingimageswithpygame mysqlquery newb number numbers numeric obexftp output parameters parsing path phonebook plugin port prime programming projects py2exe pygame pygtk pyopengl python random recursion redirect remote return reverse scrolledtext session simple skinning software sprite statictext string strings syntax terminal text threading time tlapse trick tuple tutorial ubuntu unicode unit urllib urllib2 variable voip wordgame wxpython





