I'm new to parallel programming, so I was wondering if anyone could help me parallelize a bit of code. It is a code that goes into large h5 files (using h5py) and grabs some data out. After that the data has to be reformatted and printed into another file. I was wanting to parallelize the for loop part of going over the data over several processors. These are very large files so it would help as far as memory and speed goes. Again, I'm very new at this, so if anyone has any ideas to go a completely different direction I'm all ears. Here is the code as it stands serially:
import h5py import glob import sys import os import numpy ### This takes the simulation name as the first argument and the day to extract as the second ### All h5 files must be in the same folder, with no other h5 files (including surface) sim_name = sys.argv day_to_extract = sys.argv total_nodes = int(sys.argv) total_elements = sys.argv files = glob.glob("*.h5") a = numpy.empty(total_nodes) b = numpy.empty(total_nodes) hot_list = numpy.column_stack((a,b)) for file in files: # Get all of the data from h5 file f1 = h5py.File(file, 'r') data = f1['Data'] temp = data['Temperature'] # Getting temperature data for the day needed temp_day = temp[day_to_extract] press_head = data['Pressure_Head'] # Getting pressure head data for the day needed press_day = press_head[day_to_extract] geo = f1['Geometry'] # Obtaining the original node numbers # This keeps temp and press_head indexed correctly nodes = geo['orig_node_number'] orig_nodes = nodes['0'] # Put data into list for i in range(len(orig_nodes)): orig_node_number = int(orig_nodes[i]) hot_list[orig_node_number] = [press_day[i], temp_day[i]] sim_name, piece = file.split('p') part, ext = piece.split('.') print "h5 file number " + part + " read" f1.close()
The last part where it is putting the data into the list is the part that is RAM and CPU intensive. I was hoping to break up the large temperature, pressure head, and original node numbers into smaller chunks and add them to to the hot_list array simultaneously. I'm just not sure how to best go about this. I figure it would be using scatter, but I can't figure out how to make it scatter the file into even lists or numpy arrays around the processors and then gather them back. I know this is asking a lot, but if anyone could at least point me in the right direction I would be happy to do the heavy lifting. Thanks for your time!