Mpi4py

This is an old revision of the document!

MPI for Python (mpi4py) provides bindings of the Message Passing Interface (MPI) standard for the Python programming language, allowing any Python program to exploit multiple processors.

This package is constructed on top of the MPI-1/2 specifications and provides an object oriented interface which closely follows MPI-2 C++ bindings. It supports point-to-point (sends, receives) and collective (broadcasts, scatters, gathers) communications of any picklable Python object, as well as optimized communications of Python object exposing the single-segment buffer interface (NumPy arrays, builtin bytes/string/array objects)

Adapted from the documentation provided by NASA Modeling Guru consider the mpi4py script that implements the scatter-gather procedure:

scatter-gather.py

#--------------
# Loaded Modules
#--------------
import numpy as np
from mpi4py import MPI
 
#-------------
# Communicator
#-------------
comm = MPI.COMM_WORLD
 
my_N = 4
N = my_N * comm.size
 
if comm.rank == 0:
    A = np.arange(N, dtype=np.float64)
else:
    #Note that if I am not the root processor A is an empty array
    A = np.empty(N, dtype=np.float64)
 
my_A = np.empty(my_N, dtype=np.float64)
 
#-------------------------
# Scatter data into my_A arrays
#-------------------------
comm.Scatter( [A, MPI.DOUBLE], [my_A, MPI.DOUBLE] )
 
if comm.rank == 0:
   print "After Scatter:"
 
for r in xrange(comm.size):
    if comm.rank == r:
        print "[%d] %s" % (comm.rank, my_A)
    comm.Barrier()
 
#-------------------------
# Everybody is multiplying by 2
#-------------------------
my_A *= 2
 
#-----------------------
# Allgather data into A again
#-----------------------
comm.Allgather( [my_A, MPI.DOUBLE], [A, MPI.DOUBLE] )
 
if comm.rank == 0:
   print "After Allgather:"
 
for r in xrange(comm.size):
    if comm.rank == r:
        print "[%d] %s" % (comm.rank, A)
    comm.Barrier()

Any MPI job requires you to use mpirun to initiate it, and this should be done through the Grid Engine job scheduler to best utilize the resources on the cluster. Also, if you want to run on more than 1 node (more than 24 or 48 cores depending on the node specifications), then you must initiate a batch job from the head node. Remember if you only have 1 node in your workgroup, then you would need to take advantage of the standby queue to be able to run a job utilizing multiple nodes.

The best results on Mills have been found by using the openmpi-psm.qs template for Open MPI jobs. For example, copy the template and call it mympi4py.qs for the job script using

cp /opt/templates/gridengine/openmpi/openmpi-psm.qs mympi4py.qs

and modify it for your application. Change NPROC to be number of cores you want. If your task is floating point intensive, then you will get the best performance by specifying twice the number of cores and actually using only 1/2 of them since the FPU (Floating-Point Unit) is shared by core pairs. For example, if you need 24 cores for your job, then you would need to specify 48 cores for NPROC, uncomment the options WANT_CPU_AFFINITY=YES and WANT_HALF_CORES_ONLY=YES, then it will use only 24 cores by evenly loading them on the number nodes adjusted based on the PSM resources available. In this example, your job could be spread over 3 nodes using 8 cores each due to other jobs running on the nodes. However if you specify exclusive access by using -l exclusive=1, then no other jobs can be running on the nodes, giving exclusive access to your job, and it would evenly load 2 nodes and use 12 cores on each. It is important to specify a multiple of 24 when using core pairs or exclusive access. Make sure you specify the correct VALET environment for your job. For this example, replace vpkg_require openmpi/1.4.4-gcc with

vpkg_require mpi4py
vpkg_require numpy

Lastly, modify MY_EXE for your mpi4py script. In this example, it would be

MY_EXE="python scatter-gather.py"

All the options for mpirun will automatically be determined based on the options you selected above for your job script. Now to run this job, from the head node, Mills, simply use

qsub mympi4py.qs

or

qsub -l exclusive=1 mympi4py.qs

for exclusive access of the nodes needed for your job.

Remember if you want to specify more cores for NPROC than available in your workgroup, then you need to specify the standby queue -l standby=1 but remember it has limited run times based on the total number of cores you are using for all of your jobs. In this example, since it is only requesting 48 cores, then it can run up to 8 hours using

qsub -l exclusive=1 -l standby=1 mympi4py.qs

Introduction to mpi4py by TACC
A Python Introduction to Parallel Programming with MPI
Simple mpi4py examples from the mpi4py tutorial
More mpi4py examples

Mpi4py

Sample mpi4py script

Batch job

Additional Resources

hpc documentation