The question is what do you mean by 'well distributed' ? Consider a collection of 1000 real numbers. When will you say that they are well distributed on the real line ?
Gribouillis
Posting Maven
2,786 posts since Jul 2008
Reputation Points: 1,044
Solved Threads: 691
I have an idea: suppose that you have N points and that you have already chosen k points P0, P1, ... P(k-1). For each of the remaining points, say Q, you define
f_k(Q) = min(dist(Q, Pi) for 0<= i < k)
Then you choose Pk to be the Q with the maximum f_k(Q). For other Qs, you can then compute by induction
f_(k+1)(Q) = min(f_k(Q), dist(Q, Pk))
This algorithm means that you always choose the point which minimal distance to points already chosen is maximal. You can choose the first point randomly, and the algorithm stops when the desired number of points have been chosen.
Gribouillis
Posting Maven
2,786 posts since Jul 2008
Reputation Points: 1,044
Solved Threads: 691
Unfortunately I'm not a math person so I'm sorry if my replay will sound stupid.
You state that I already have a k number of known point... but how can i get this?
The problem is exactly how to select point according to their distribution. If I can select some of them it's not a huge problem to find the ones which are closer to them. For instance, a kind of buffer around their xy value so that if the selected point is 430x,320y select every point with x>420 and < 440 and y>310 and <330. Isn't it?
But the problem is exactly how to find the right points to use for the filtering.
The idea is that you choose the first point randomly and the algorithm tells you how to find the second point (it's the point farthest to the first one), then the algorithm tells you how to select the third point, etc, it's an algorithm by induction.
Gribouillis
Posting Maven
2,786 posts since Jul 2008
Reputation Points: 1,044
Solved Threads: 691
Well, here is not python code but pseudo-code, which you could translate to python
inf = float("infinity")
f = dict()
for x, y in read_points_from_file():
f[(x,y)] = inf
P = list() # initial list of chosen points
P.append(f.pop())
while len(P) < K: # K = number of desired points
for (x, y), v in f.items():
f[(x,y)] = min(v, dist(P[-1], (x, y)))
v, (x, y) = max((v, k) for k, v in f.items())
P.append((x, y))
del f[(x,y)]
Gribouillis
Posting Maven
2,786 posts since Jul 2008
Reputation Points: 1,044
Solved Threads: 691
Yes, because the 1,1 would be close to 0,0 and not be chosen.
pyTony
pyMod
5,359 posts since Apr 2010
Reputation Points: 782
Solved Threads: 852
I would change the code to
OutputPointsAG="test-filter1.txt"
TmpCopyFile="temp/xylist-copy.txt"
K=50
temp = open(TmpCopyFile)
# Create the OUTPUT file
file3=open(OutputPointsAG,'w')
inf = float("infinity")
f = dict()
#for x, y in read_points_from_file():
for line in temp:
x, y = (float(z) for z in line.split()[:2])
f[(x,y)] = inf
P = list() # initial list of chosen points
k, v = f.popitem()
P.append(k)
def dist2((a, b), (c, d)):
return (a - c)**2 + (b - d)**2
while len(P) < K: # K = number of desired points
for (x, y), v in f.items():
f[(x,y)] = min(v, dist2(P[-1], (x, y)))
v, (x, y) = max((v, k) for k, v in f.items())
P.append((x, y))
del f[(x,y)]
for x, y in P:
file3.write("{0:.5E} {1:.5E}\n".format(x, y))
It would be nice if you send your datafile as an attached file, so that we can test the code.
Gribouillis
Posting Maven
2,786 posts since Jul 2008
Reputation Points: 1,044
Solved Threads: 691
But now I get the error
.5E empty field name
Yes I edited the previous post to solve this error.
To attach a file, edit your post with the button 'use advanced editor' at the bottom of this page, then there is an icon in the advanced editor to attach a file. Then you only need to browse to the desired file and click upload in the popup window which appears. It's better to zip the file (compressed text files are much shorter than the original version).
Gribouillis
Posting Maven
2,786 posts since Jul 2008
Reputation Points: 1,044
Solved Threads: 691
No errors now but the output numbers are
5.07661E+02 3.88728E+02
1.54457E+03 9.47144E+02
1.28122E+03 8.04227E+01
9.08728E+02 7.78651E+02
9.37612E+02 3.40131E+02
1.66442E+03 2.12045E+02
I attach the data file
Aren't there 50 output points ? I chose to output them in the exponential notation. If you don't like it, use f instead of E in the format statement.
Gribouillis
Posting Maven
2,786 posts since Jul 2008
Reputation Points: 1,044
Solved Threads: 691
Oh great!! Really thank you!
If I want to keep also the other 2 values in the raw, should I have to edit line 19?
x, y = (float(z) for z in line.split()[:2])
Last question (and I will stop to profit of your kindness, I promise). Isn't there any kind of random in the script you wrote, right?
I mean, if I run the code several time I will get the same result, isn't it?
I will take some time to read (and understand) your code but it's already very easy because of your comments!
Thanks a lot,
G.
There is something random when i chose the first point with popitem(). You could replace this by choosing the first (x, y) from the first line of the file if you want something deterministic (well, I didn't consider the unlikely possibility of equal values when I select the maximum. This could also introduce some randomness).
If you want to keep the whole line, you could build a second dictionary to keep the lines
lines = dict()
f = dict()
for line in temp:
x, y = (float(z) for z in line.split()[:2])
f[(x,y)] = inf
lines[(x,y)] = line
Then at the end you would print
for x, y in P:
file3.write(lines[(x,y)])
Last thing, it would a good idea to draw scatter plots of the input and output data to see if the algorithm does something close to what you are expecting.
Gribouillis
Posting Maven
2,786 posts since Jul 2008
Reputation Points: 1,044
Solved Threads: 691