944,048 Members | Top Members by Rank

Ad:
  • Python Discussion Thread
  • Unsolved
  • Views: 2352
  • Python RSS
Sep 12th, 2007
0

Array Distances

Expand Post »
I have another problem. Same dataset as before. I have a huge collection of floats. I have a multi-dimensional list. About 37 per list, and over 60,000 lists. So something like: [[16166.00, 4123.15, ...], [761.1364, ...]].

I'm trying to find the Euclidean distance between one list with each of the lists in another set. What I have is below. I guess it works. But it's too bloody slow. Takes a couple seconds to go through the nested loop once, and since it has to go through it over 60,000 times, we're talking over a day to complete. Plus, I need to find the 3 shortest for each, which will take just as much to process.

(though they're labeled *matrix.. it's a standard list of lists.
python Syntax (Toggle Plain Text)
  1. for i in datamatrix:
  2. for j in testmatrix:
  3. temp = (array(i, float)-array(j, float))**2
  4. sum = 0.
  5. [sum + n for n in temp]
  6. distances.append(sqrt(sum))
  7. alldistances.append(distances)

Is there some library that will compute this quickly or some better means of writing this in python?
Similar Threads
Reputation Points: 10
Solved Threads: 0
Newbie Poster
grahhh is offline Offline
5 posts
since Sep 2007
Sep 13th, 2007
0

Re: Array Distances

There might be a better algorithm. Why do you need to precompute all of those distances? Could you use lazy evaluation? Or, could you compute distance *squared* for your purpose, which eliminates the expensive sqrt call?

Also, I may be being dense, but what does this line do:

[sum + n for n in temp]

?

Jeff
Reputation Points: 92
Solved Threads: 156
Practically a Master Poster
jrcagle is offline Offline
608 posts
since Jul 2006
Sep 14th, 2007
0

Re: Array Distances

Ah, I was messing around with the code for quite a bit to see if I could get it to run any faster, and that one line is meaningless I think. Just how it was left when I got frustrated with it.

What I've been attempting to do is work on a k-nearest neighbor algorithm. Basically, I have a huge chunk of data that I know the classes (or labels - whatever you want to call it) for. Then I have another huge chunk of data that I don't know the classes for. The idea being that you can assign the unknown classes by finding the k nearest points from the set you do know the classes for.

I've ditched my original code. It was slow and not terribly neat. I patched together something that fully works from some 'biopython' libraries and Numeric (I know, I know, Numeric is old and outdated, but this is what it called for and I didn't want to spend even more time seeing if I could get numpy or scipy or whatever doing the same thing) hoping it would clear up the speed issues and, sadly, it didn't. It still takes a couple seconds to determine each vector's class (and I have thousands for it to go through).
Reputation Points: 10
Solved Threads: 0
Newbie Poster
grahhh is offline Offline
5 posts
since Sep 2007
Sep 14th, 2007
0

Re: Array Distances

Well, several things jump out.

First, just so I understand, you are taking a list of 37-dimensional vectors in the datamatrix and finding the three that are closest to something in the testmatrix?

(1) ditch the sqrt() call. If you need the Euclidean metric, then distance squared works just as well as distance for comparison purposes.

(2) The line

Python Syntax (Toggle Plain Text)
  1. temp = (array(i, float)-array(j, float))**2

is confusing. I read it as making an array of the vectors i and j, subtracting those arrays, and then squaring the resulting array. Is that right? If so, then it's not computing the Euclidean distance.

What you would need is something like

Python Syntax (Toggle Plain Text)
  1. tmp = 0
  2. for x in i:
  3. for y in j:
  4. tmp += (x-y)**2

(3) If there's some way to order the vectors in the test matrix, you might be able to use lazy evaluation to quit the loop once the distance gets too big.

Hope it helps. I have a couple more thoughts, but they need developing first.

Jeff
Reputation Points: 92
Solved Threads: 156
Practically a Master Poster
jrcagle is offline Offline
608 posts
since Jul 2006
Sep 15th, 2007
0

Re: Array Distances

I can't edit the post above, so here's a new post.

If you can download numpy, then this appears to be useful. The time for finding a single match among a data set of 60000 came down to 1 second:

Python Syntax (Toggle Plain Text)
  1. import numpy
  2. import time
  3.  
  4. def findclosest(a,X):
  5. """finds the vector in X most nearly parallel to a.
  6. This is the computationally expensive area, so optimize here."""
  7.  
  8. if not len(X):
  9. return None
  10. current = X[0]
  11. smallest = numpy.inner(a-current, a-current)
  12. for item in X:
  13. d = numpy.inner(a-item, a-item)
  14. if d < smallest:
  15. current = item
  16. smallest = d
  17. return current
  18.  
  19. def findclosesttaxicab(a,X):
  20. """finds the vector in X most nearly parallel to a using taxicab metric."""
  21.  
  22. if not len(X):
  23. return None
  24. current = X[0]
  25. smallest = numpy.abs(a-current).sum()
  26. for item in X:
  27. d = numpy.abs(a-item).sum()
  28. if d < smallest:
  29. current = item
  30. smallest = d
  31. return current
  32.  
  33. def find_first_n(data,X, n=3):
  34. """
  35. find_first_n(data, X, n=3) --> [(fit,vector, match), ...]
  36.  
  37. Finds up to first n vectors along with their matches, sorted by
  38. order of fit."""
  39.  
  40. if len(data) <= n:
  41. return data
  42. else:
  43. # A little ugly, but we can't use dictionaries, since vectors may not
  44. # be unique AND because dictionaries don't have ordering.
  45.  
  46. # create a list of three winners: [(fit,vector,match),...] ordered
  47. # by fit from smallest to largest.
  48. winners = []
  49. for i in range(n):
  50. match = findclosest(data[i],X)
  51. fit = numpy.inner(data[i],match)
  52. for x in range(len(winners)):
  53. if fit < winners[x][0]:
  54. winners = (winners[:x] + [(fit,data[i],match)] + winners[x:])
  55. break
  56. else:
  57. winners.append((fit,data[i],match))
  58.  
  59. return winners
  60.  
  61. def find_best_n(data, X, n=3):
  62. """
  63. find_best_three(data, X, n=3) --> [(fit,vector, match), ...]
  64.  
  65. Finds best n vectors along with their matches, sorted by
  66. order of fit."""
  67.  
  68. if len(data) <= n:
  69. return data
  70. # Prime the pump
  71. winners = find_first_n(data,X,n)
  72.  
  73. # We've already examined the first n...
  74. for i in range(n,len(data)):
  75. match = findclosest(data[i],X)
  76. fit = numpy.inner(data[i],match)
  77. for x in range(n):
  78. if fit < winners[x][0]:
  79. winners = (winners[:x] + [(fit,data[i],match)] + winners[x:])[:n]
  80. break
  81. return winners
  82.  
  83. X = numpy.random.rand(60000,37)
  84. data = numpy.random.rand(200,37)
  85.  
  86. # optimization metrics
  87. start = time.time()
  88. for i in data:
  89. findclosest(i,X)
  90. stop = time.time()
  91. print "findclosest: Dt = %0.2f for %d matches in array of size %d" %\
  92. (stop - start,len(data),len(X))
  93.  
  94. start = time.time()
  95. for i in data:
  96. findclosesttaxicab(i,X)
  97. stop = time.time()
  98. print "findclosesttaxicab: Dt = %0.2f for %d matches in array of size %d" %\
  99. (stop - start,len(data),len(X))
  100. # END optimization metrics
  101.  
  102. #
  103. #best= find_best_n(data,X,3)
  104. #print best, len(best)
  105.  
  106. >>>
  107. findclosest: Dt = 198.63 for 200 matches in array of size 60000
  108. findclosesttaxicab: Dt = 275.77 for 200 matches in array of size 60000 # surprise! I thought it would be faster
  109. >>>

HtH,
Jeff
Reputation Points: 92
Solved Threads: 156
Practically a Master Poster
jrcagle is offline Offline
608 posts
since Jul 2006
Sep 15th, 2007
0

Re: Array Distances

Oh, wow. So this takes one second to determine the class for each row in the test set? Then it would take me around 18 hours instead of a day and a half to get my results, which is definitely an improvement. I ran what I had on my test sets (I have three) last night and the two smaller ones have finished. The big one is still cranking away.

The bigger issue for me is the data normalisation process; the data is a mess, and with it taking so long to process, it's hard to tell if my data cleaning is proving useful and matching to the correct class. If the results of the sets are off by much, I'll do some tweaks and try your code out.

Thanks for all the help.
Reputation Points: 10
Solved Threads: 0
Newbie Poster
grahhh is offline Offline
5 posts
since Sep 2007

This thread is more than three months old

No one has posted to this discussion for at least three months. Please let old threads die and do not reply to them unless you feel you have something new and valuable to contribute that absolutely must be added to make the discussion complete. Otherwise, please start a new thread in this forum instead.
Message:
Previous Thread in Python Forum Timeline: Generating HTML in Python
Next Thread in Python Forum Timeline: Why is this not working?





About Us | Contact Us | Advertise | Acceptable Use Policy
Forum Index | Build Custom RSS Feed


Follow us on Twitter


© 2011 DaniWeb® LLC