Hi,
I've got two tables in a database - one a list of schools (around 50 entries), and one a list of pupils (around 17,000). Both are spatially-attributed using PostGIS. Basically I need to allocate each pupil to their nearest school. Currently I'm looping through the schools repeatedly, using the ST_DISTANCE commands and 3(!) different SQL commands to find which pupils are nearest the school at the time. Once a pupil is allocated a school, they aren't selected in the next loop.
As a nasty hack this works fine but it takes a long time (> 10 mins) to run the code through enough repetitions. For my project specification I really need this to be much, much faster (< 5 mins hopefully), which might involve multithreading.
Regardless, this code doesn't appear very efficient. Can anybody suggest some improvements or better SQL queries?
Thanks very much! Nick Campbell

Table structures
SCHOOLS (id, name, urn [unique identifier], the_geom, capacity, cap_remain [remaining capacity after each loop])
RANDOM_CHILDREN (id, pref_school)

try:
   conn = psycopg2.connect(......);
except:
   print("Unable to connect");
   sys.exit()

cur = conn.cursor()
cur2 = conn.cursor()
cur3 = conn.cursor()

scount = cur.execute("""SELECT COUNT(gid) FROM schools""")
scount = cur.fetchone()[0] ## Number of schools
pcount = cur.execute("""SELECT COUNT(id) FROM random_children""")
pcount = cur.fetchone()[0] ##Number of pupils
print scount,"schools and",pcount,"pupils."

i=0
while (i < 294): #Arbitrary number
   schools = cur.execute("""SELECT name, the_geom, urn, capacity, cap_remain FROM schools WHERE cap_remain > 0 ORDER BY capacity ASC""")
   rows = cur.fetchall()

   print "Working school by school, allocating one at a time.."
   for item in rows:
      print "  School name: ",item[0]," places:",item[4],"/",item[3]
      loc = item[1]
      urny = item[2]
      nearest_pupil = cur2.execute("""SELECT id, ST_DISTANCE(the_geom, '%s') AS dist FROM random_children WHERE pref_school IS NULL ORDER BY dist ASC LIMIT 1""" % loc)
      kids = cur2.fetchall()
      bestkid = kids[0][0]
      # Put the URN of the nearest school into the individual pupil
      # Then....take one off the capacity of the school
      # After one iteration, each school will have lost a "place"
      # Keep going until all school places are filled
      query = {'urn':urny, 'id':bestkid }
      qry = "UPDATE random_children SET pref_school=%(urn)s WHERE id=%(id)s" % query # Gives child a school
      #print "Query", qry
      cur3.execute(qry)
      qry = "UPDATE schools SET cap_remain = cap_remain-1 WHERE urn=%(urn)s" % query # Takes a place from school
      #print "Query 2", qry
      cur3.execute(qry)
      conn.commit() # Must commit otherwise it doesn't run the command....
      print "  ....success"   
   i=i+1

> As a nasty hack this works fine but it takes a long time (> 10 mins) to run the code through enough repetitions.
> For my project specification I really need this to be much, much faster (< 5 mins hopefully)
Why does a program which runs only once per year need optimising from 10 minutes to 5 minutes?

It's not like you're going to be mass reassigning schools on a weekly basis or anything.

> ORDER BY capacity ASC
Does this actually do anything useful?
I see why you select schools with capacity, but why then go to the trouble of ordering it as well?

> kids = cur2.fetchall()
> bestkid = kids[0][0]
I'm guessing the LIMIT 1 in the previous line means fetchall only fetches 1 row, and not potentially up to 17000 rows.
Since you're only ever interested in the first row anyway, what about fetchone() ?
Likewise, isn't all this redundant anyway? What is in the nearest_pupil result on the previous line anyway (which goes unused at present).

You could also consider loading all the schools and children in memory and modify the database only when you know in which school each child goes (this can be done if you're able to compute the distance between a child and a school without the ST_DISTANCE function).
Also, try to use the cProfile module to tell you where time is spent.

You could also consider loading all the schools and children in memory and modify the database only when you know in which school each child goes (this can be done if you're able to compute the distance between a child and a school without the ST_DISTANCE function).
Also, try to use the cProfile module to tell you where time is spent.

In my opinion, the only correct way to do this is to load the records in memory and process them rather than this UPDATE and SELECT through each iteration of your two loops!

@ Salem:
Thanks for your feedback. Those sections of code were a little sloppy!

Why does a program which runs only once per year need optimising from 10 minutes to 5 minutes?
It's not like you're going to be mass reassigning schools on a weekly basis or anything.

The reason I might be running this more regularly is because the pupil data is actually random data from elsewhere. I'm currently loading the different data sets on-the-fly into the db, then some random data for the area, then running this function. It's complicated but essentially my project spec needs this to be possible regularly.

> ORDER BY capacity ASC
Does this actually do anything useful?
I see why you select schools with capacity, but why then go to the trouble of ordering it as well?

By working this way the smallest schools get "filled" first - one of my working assumptions is that the bigger capacity schools will draw pupils from further afield.

@Gribouillis, @DdoubleD:
From load into memory can I assume you mean two giant multidimensional arrays? Thinking about it....I wouldn't have to keep reloading the data and only commit when I've worked it out. Thanks!

@Gribouillis, @DdoubleD:
From load into memory can I assume you mean two giant multidimensional arrays? Thinking about it....I wouldn't have to keep reloading the data and only commit when I've worked it out. Thanks!

Wish I knew Python because I'd love to help you out with this. I definitely think that working with the record sets in memory is the way to go though if you can pull it off; then do your final update(s) to DB. Best of luck!

Doing this in memory is definitely the way to go. It could be 1000 times faster. Putting 17,000 records into memory is not a big deal in today's world. If each of the records is 100 bytes long, then you have 1,700,000 bytes. If your computer has 2 gig of memory (to make the calcs easy) then you are using 1/10 of one percent.

Some ideas:
1. Set a minimum distance that is an automatic for going to that school. If a child lives a block away then they will obviously go to that school and you don't have to test the rest of the schools.

2. Assign a small, simple grid system. If a child lives north of Main St. and west of First Ave. then you can eliminate half of the schools as being too far away, and half of the processing time. As long as the number of schools eliminated is greater than the number of distances calculated to get the right grid, you come out ahead. If you can eliminate one school per child, you have eliminated 17,000 calculations, or 2% of 50 schools, whichever way you want to look at it.

This question has already been answered. Start a new discussion instead.