| | |
Array Distances
Please support our Python advertiser: Programming Forums - DaniWeb Sister Site
![]() |
•
•
Join Date: Sep 2007
Posts: 5
Reputation:
Solved Threads: 0
I have another problem. Same dataset as before. I have a huge collection of floats. I have a multi-dimensional list. About 37 per list, and over 60,000 lists. So something like: [[16166.00, 4123.15, ...], [761.1364, ...]].
I'm trying to find the Euclidean distance between one list with each of the lists in another set. What I have is below. I guess it works. But it's too bloody slow. Takes a couple seconds to go through the nested loop once, and since it has to go through it over 60,000 times, we're talking over a day to complete. Plus, I need to find the 3 shortest for each, which will take just as much to process.
(though they're labeled *matrix.. it's a standard list of lists.
Is there some library that will compute this quickly or some better means of writing this in python?
I'm trying to find the Euclidean distance between one list with each of the lists in another set. What I have is below. I guess it works. But it's too bloody slow. Takes a couple seconds to go through the nested loop once, and since it has to go through it over 60,000 times, we're talking over a day to complete. Plus, I need to find the 3 shortest for each, which will take just as much to process.
(though they're labeled *matrix.. it's a standard list of lists.
python Syntax (Toggle Plain Text)
for i in datamatrix: for j in testmatrix: temp = (array(i, float)-array(j, float))**2 sum = 0. [sum + n for n in temp] distances.append(sqrt(sum)) alldistances.append(distances)
Is there some library that will compute this quickly or some better means of writing this in python?
•
•
Join Date: Jul 2006
Posts: 608
Reputation:
Solved Threads: 150
There might be a better algorithm. Why do you need to precompute all of those distances? Could you use lazy evaluation? Or, could you compute distance *squared* for your purpose, which eliminates the expensive sqrt call?
Also, I may be being dense, but what does this line do:
[sum + n for n in temp]
?
Jeff
Also, I may be being dense, but what does this line do:
[sum + n for n in temp]
?
Jeff
•
•
Join Date: Sep 2007
Posts: 5
Reputation:
Solved Threads: 0
Ah, I was messing around with the code for quite a bit to see if I could get it to run any faster, and that one line is meaningless I think. Just how it was left when I got frustrated with it.
What I've been attempting to do is work on a k-nearest neighbor algorithm. Basically, I have a huge chunk of data that I know the classes (or labels - whatever you want to call it) for. Then I have another huge chunk of data that I don't know the classes for. The idea being that you can assign the unknown classes by finding the k nearest points from the set you do know the classes for.
I've ditched my original code. It was slow and not terribly neat. I patched together something that fully works from some 'biopython' libraries and Numeric (I know, I know, Numeric is old and outdated, but this is what it called for and I didn't want to spend even more time seeing if I could get numpy or scipy or whatever doing the same thing) hoping it would clear up the speed issues and, sadly, it didn't. It still takes a couple seconds to determine each vector's class (and I have thousands for it to go through).
What I've been attempting to do is work on a k-nearest neighbor algorithm. Basically, I have a huge chunk of data that I know the classes (or labels - whatever you want to call it) for. Then I have another huge chunk of data that I don't know the classes for. The idea being that you can assign the unknown classes by finding the k nearest points from the set you do know the classes for.
I've ditched my original code. It was slow and not terribly neat. I patched together something that fully works from some 'biopython' libraries and Numeric (I know, I know, Numeric is old and outdated, but this is what it called for and I didn't want to spend even more time seeing if I could get numpy or scipy or whatever doing the same thing) hoping it would clear up the speed issues and, sadly, it didn't. It still takes a couple seconds to determine each vector's class (and I have thousands for it to go through).
•
•
Join Date: Jul 2006
Posts: 608
Reputation:
Solved Threads: 150
Well, several things jump out.
First, just so I understand, you are taking a list of 37-dimensional vectors in the datamatrix and finding the three that are closest to something in the testmatrix?
(1) ditch the sqrt() call. If you need the Euclidean metric, then distance squared works just as well as distance for comparison purposes.
(2) The line
is confusing. I read it as making an array of the vectors i and j, subtracting those arrays, and then squaring the resulting array. Is that right? If so, then it's not computing the Euclidean distance.
What you would need is something like
(3) If there's some way to order the vectors in the test matrix, you might be able to use lazy evaluation to quit the loop once the distance gets too big.
Hope it helps. I have a couple more thoughts, but they need developing first.
Jeff
First, just so I understand, you are taking a list of 37-dimensional vectors in the datamatrix and finding the three that are closest to something in the testmatrix?
(1) ditch the sqrt() call. If you need the Euclidean metric, then distance squared works just as well as distance for comparison purposes.
(2) The line
Python Syntax (Toggle Plain Text)
temp = (array(i, float)-array(j, float))**2
is confusing. I read it as making an array of the vectors i and j, subtracting those arrays, and then squaring the resulting array. Is that right? If so, then it's not computing the Euclidean distance.
What you would need is something like
Python Syntax (Toggle Plain Text)
tmp = 0 for x in i: for y in j: tmp += (x-y)**2
(3) If there's some way to order the vectors in the test matrix, you might be able to use lazy evaluation to quit the loop once the distance gets too big.
Hope it helps. I have a couple more thoughts, but they need developing first.
Jeff
•
•
Join Date: Jul 2006
Posts: 608
Reputation:
Solved Threads: 150
I can't edit the post above, so here's a new post.
If you can download numpy, then this appears to be useful. The time for finding a single match among a data set of 60000 came down to 1 second:
HtH,
Jeff
If you can download numpy, then this appears to be useful. The time for finding a single match among a data set of 60000 came down to 1 second:
Python Syntax (Toggle Plain Text)
import numpy import time def findclosest(a,X): """finds the vector in X most nearly parallel to a. This is the computationally expensive area, so optimize here.""" if not len(X): return None current = X[0] smallest = numpy.inner(a-current, a-current) for item in X: d = numpy.inner(a-item, a-item) if d < smallest: current = item smallest = d return current def findclosesttaxicab(a,X): """finds the vector in X most nearly parallel to a using taxicab metric.""" if not len(X): return None current = X[0] smallest = numpy.abs(a-current).sum() for item in X: d = numpy.abs(a-item).sum() if d < smallest: current = item smallest = d return current def find_first_n(data,X, n=3): """ find_first_n(data, X, n=3) --> [(fit,vector, match), ...] Finds up to first n vectors along with their matches, sorted by order of fit.""" if len(data) <= n: return data else: # A little ugly, but we can't use dictionaries, since vectors may not # be unique AND because dictionaries don't have ordering. # create a list of three winners: [(fit,vector,match),...] ordered # by fit from smallest to largest. winners = [] for i in range(n): match = findclosest(data[i],X) fit = numpy.inner(data[i],match) for x in range(len(winners)): if fit < winners[x][0]: winners = (winners[:x] + [(fit,data[i],match)] + winners[x:]) break else: winners.append((fit,data[i],match)) return winners def find_best_n(data, X, n=3): """ find_best_three(data, X, n=3) --> [(fit,vector, match), ...] Finds best n vectors along with their matches, sorted by order of fit.""" if len(data) <= n: return data # Prime the pump winners = find_first_n(data,X,n) # We've already examined the first n... for i in range(n,len(data)): match = findclosest(data[i],X) fit = numpy.inner(data[i],match) for x in range(n): if fit < winners[x][0]: winners = (winners[:x] + [(fit,data[i],match)] + winners[x:])[:n] break return winners X = numpy.random.rand(60000,37) data = numpy.random.rand(200,37) # optimization metrics start = time.time() for i in data: findclosest(i,X) stop = time.time() print "findclosest: Dt = %0.2f for %d matches in array of size %d" %\ (stop - start,len(data),len(X)) start = time.time() for i in data: findclosesttaxicab(i,X) stop = time.time() print "findclosesttaxicab: Dt = %0.2f for %d matches in array of size %d" %\ (stop - start,len(data),len(X)) # END optimization metrics # #best= find_best_n(data,X,3) #print best, len(best) >>> findclosest: Dt = 198.63 for 200 matches in array of size 60000 findclosesttaxicab: Dt = 275.77 for 200 matches in array of size 60000 # surprise! I thought it would be faster >>>
HtH,
Jeff
•
•
Join Date: Sep 2007
Posts: 5
Reputation:
Solved Threads: 0
Oh, wow. So this takes one second to determine the class for each row in the test set? Then it would take me around 18 hours instead of a day and a half to get my results, which is definitely an improvement. I ran what I had on my test sets (I have three) last night and the two smaller ones have finished. The big one is still cranking away.
The bigger issue for me is the data normalisation process; the data is a mess, and with it taking so long to process, it's hard to tell if my data cleaning is proving useful and matching to the correct class. If the results of the sets are off by much, I'll do some tweaks and try your code out.
Thanks for all the help.
The bigger issue for me is the data normalisation process; the data is a mess, and with it taking so long to process, it's hard to tell if my data cleaning is proving useful and matching to the correct class. If the results of the sets are off by much, I'll do some tweaks and try your code out.
Thanks for all the help.
![]() |
Similar Threads
- Programming student find distances btwn cities using array of structs (C++)
- Dijkstra Algorithm (Computer Science)
- URGENT - Reading from txt file into a 2 dimension array (Java)
- Array limit (C)
- struct dynamic 2d array alloc (C)
- string to integer array transformation (C)
- Array (Visual Basic 4 / 5 / 6)
Other Threads in the Python Forum
- Previous Thread: newb: Print ids, you get 1L, what is L?
- Next Thread: Why is this not working?
| Thread Tools | Search this Thread |
Tag cloud for Python
abrupt ansi anti approximation assignment avogadro backend basic beginner binary bluetooth calculator character code customdialog decimals dictionaries dictionary drive dynamic examples excel exe file float format ftp function gnu graphics gui heads homework http ideas import input java launcher leftmouse line linux list lists loop module mouse number numbers output parsing path pointer port prime program programming progressbar projects py2exe pygame pyqt python random recursion recursive refresh schedule scrolledtext sqlite ssh statistics stdout string strings sudokusolver sum table terminal text thread threading time tkinter tlapse tricks tuple tutorial twoup ubuntu unicode update urllib urllib2 variable wikipedia windows write wxpython xlib





