Table of Contents

Lustre MPI-IO in Open MPI

In the Lustre filesystem, striping behavior can only be set on a file before it is created. Attempting to modify the striping of an extant file will fail. There are two approaches to using striped Lustre files with the Open MPI implementation of ROMIO:

  1. Create the file from the shell before your program executes
  2. Use MPI_Info parameters passed to the MPI-IO MPI_File_open() function inside your program

Pre-Create the File

An empty, striped file can be created using the lfs setstripe command from the shell:

$ lfs setstripe --size 4M --count 2 scratch.txt
$ ls -l scratch.txt 
-rw-r--r-- 1 traine it_css 0 Sep 11 12:33 scratch.txt

Your code cannot require the MPI_MODE_EXCL mode in MPI_File_open() – the file will be present when the program executes.

MPI_Info Parameters

The Open MPI Lustre ADIO module recognizes two info keys w.r.t. striping:

Key Value
striping_factorThe number of OSTs across which the file is striped.
striping_unitThe file is broken into chunks of this many bytes.

Construct an MPI_Info entity containing the striping properties for the file:

MPI_Info      scrfinfo;
 
MPI_Info_create(&scrfinfo);
 
/* How many OSTs to stripe across? */
MPI_Info_set(scrfinfo, "striping_factor", "4");
 
/* How many bytes per stripe? */
MPI_Info_set(scrfinfo, "striping_unit", "65536");

This MPI_Info is then passed to the MPI_File_open() function. Beware: you cannot use the MPI_MODE_EXCL flag in your call to MPI_File_open() even though the file is not present on the filesystem! The Lustre ADIO module exploits the fact that its "set info" callback is executed before its "open file" callback: if the MPI_Info contains any striping properties, the "set info" callback declares the new file (using a special Lustre flag to the open() system function) and sets the striping properties using ioctl() (which then actually commits the new file to the Lustre MDS). If successful, "set info" closes the file descriptor which allows the "open file" callback to itself use the open() system function to prepare the file for i/o. The "set info" callback will respect the MPI_MODE_EXCL flag, but the "open file" callback will subsequently also require that the file not be present, and will fail.

This behavior is obviously incorrect; the Lustre ADIO module should be modified such that the ADIOI_LUSTRE_SetInfo() function removes the MPI_MODE_EXCL requirement if it succeeds in creating the striped file. A bug will be filed with the Open MPI folks, so this behavior may be remedied in the future.

Here is an example function for creating a striped Lustre file that will use MPI-IO:

NSSCreateLustreFile.c
#include <sys/stat.h>
#include <errno.h>
#include "mpi.h"
 
MPI_File
NSSCreateLustreFile(
  MPI_Comm        comm,
  const char*     path,
  int             stripeCount,
  size_t          stripeSize,
  int             *errorCode
)
{
  struct stat     fInfo;
 
  if ( stat(path, &fInfo) != 0 ) {
    int           rc = 0;
    MPI_File      fh = NULL;
    MPI_Info      finfo = NULL;
 
    if ( (rc = MPI_Info_create(&finfo)) == 0 ) {
      char        strForm[32];
 
      if ( stripeCount < 0 ) stripeCount = -1;
      snprintf(strForm, sizeof(strForm), "%d", stripeCount);
      if ( (rc = MPI_Info_set(finfo, "striping_factor", strForm)) == 0 ) {
        if ( stripeSize < 0 ) stripeSize = 0;
        snprintf(strForm, sizeof(strForm), "%ld", stripeSize);
        if ( (rc = MPI_Info_set(finfo, "striping_unit", strForm)) == 0 ) {
          rc = MPI_File_open(
                    comm,
                    (char*)path,
                    MPI_MODE_RDWR | MPI_MODE_CREATE,
                    finfo,
                    &fh
                  );
        }
      }
      MPI_Info_free(&finfo);
    }
    if ( rc ) {
      if ( errorCode ) *errorCode = rc;
      return NULL;
    }
    return fh;
  } else {
    if ( errorCode ) *errorCode = EEXIST;
  }
  return NULL;
}

Were this function called (successfully) with the following arguments

    :
scratchFile = NSSCreateLustreFile(MPI_COMM_WORLD, "mpibounce.scr", 4, 65536, NULL);
    :

the success of the striping is evident via lfs getstripe:

$ lfs getstripe mpibounce.scr
mpibounce.scr
lmm_stripe_count:   4
lmm_stripe_size:    65536
lmm_stripe_offset:  17
	obdidx		 objid		objid		 group
	    17	      20020417	    0x1317cc1	             0
	    23	      19758316	    0x12d7cec	             0
	     0	      19895589	    0x12f9525	             0
	     6	      19804152	    0x12e2ff8	             0