Lustre MPI-IO in Open MPI
In the Lustre filesystem, striping behavior can only be set on a file before it is created. Attempting to modify the striping of an extant file will fail. There are two approaches to using striped Lustre files with the Open MPI implementation of ROMIO:
- Create the file from the shell before your program executes
- Use
MPI_Info
parameters passed to the MPI-IOMPI_File_open()
function inside your program
Pre-Create the File
An empty, striped file can be created using the lfs setstripe
command from the shell:
$ lfs setstripe --size 4M --count 2 scratch.txt $ ls -l scratch.txt -rw-r--r-- 1 traine it_css 0 Sep 11 12:33 scratch.txt
Your code cannot require the MPI_MODE_EXCL
mode in MPI_File_open()
– the file will be present when the program executes.
MPI_Info Parameters
The Open MPI Lustre ADIO module recognizes two info keys w.r.t. striping:
Key | Value |
---|---|
striping_factor | The number of OSTs across which the file is striped. |
striping_unit | The file is broken into chunks of this many bytes. |
Construct an MPI_Info
entity containing the striping properties for the file:
MPI_Info scrfinfo; MPI_Info_create(&scrfinfo); /* How many OSTs to stripe across? */ MPI_Info_set(scrfinfo, "striping_factor", "4"); /* How many bytes per stripe? */ MPI_Info_set(scrfinfo, "striping_unit", "65536");
This MPI_Info
is then passed to the MPI_File_open()
function. Beware: you cannot use the MPI_MODE_EXCL
flag in your call to MPI_File_open()
even though the file is not present on the filesystem! The Lustre ADIO module exploits the fact that its "set info" callback is executed before its "open file" callback: if the MPI_Info
contains any striping properties, the "set info" callback declares the new file (using a special Lustre flag to the open()
system function) and sets the striping properties using ioctl()
(which then actually commits the new file to the Lustre MDS). If successful, "set info" closes the file descriptor which allows the "open file" callback to itself use the open()
system function to prepare the file for i/o. The "set info" callback will respect the MPI_MODE_EXCL
flag, but the "open file" callback will subsequently also require that the file not be present, and will fail.
ADIOI_LUSTRE_SetInfo()
function removes the MPI_MODE_EXCL
requirement if it succeeds in creating the striped file. A bug will be filed with the Open MPI folks, so this behavior may be remedied in the future.
Here is an example function for creating a striped Lustre file that will use MPI-IO:
- NSSCreateLustreFile.c
#include <sys/stat.h> #include <errno.h> #include "mpi.h" MPI_File NSSCreateLustreFile( MPI_Comm comm, const char* path, int stripeCount, size_t stripeSize, int *errorCode ) { struct stat fInfo; if ( stat(path, &fInfo) != 0 ) { int rc = 0; MPI_File fh = NULL; MPI_Info finfo = NULL; if ( (rc = MPI_Info_create(&finfo)) == 0 ) { char strForm[32]; if ( stripeCount < 0 ) stripeCount = -1; snprintf(strForm, sizeof(strForm), "%d", stripeCount); if ( (rc = MPI_Info_set(finfo, "striping_factor", strForm)) == 0 ) { if ( stripeSize < 0 ) stripeSize = 0; snprintf(strForm, sizeof(strForm), "%ld", stripeSize); if ( (rc = MPI_Info_set(finfo, "striping_unit", strForm)) == 0 ) { rc = MPI_File_open( comm, (char*)path, MPI_MODE_RDWR | MPI_MODE_CREATE, finfo, &fh ); } } MPI_Info_free(&finfo); } if ( rc ) { if ( errorCode ) *errorCode = rc; return NULL; } return fh; } else { if ( errorCode ) *errorCode = EEXIST; } return NULL; }
Were this function called (successfully) with the following arguments
: scratchFile = NSSCreateLustreFile(MPI_COMM_WORLD, "mpibounce.scr", 4, 65536, NULL); :
the success of the striping is evident via lfs getstripe
:
$ lfs getstripe mpibounce.scr mpibounce.scr lmm_stripe_count: 4 lmm_stripe_size: 65536 lmm_stripe_offset: 17 obdidx objid objid group 17 20020417 0x1317cc1 0 23 19758316 0x12d7cec 0 0 19895589 0x12f9525 0 6 19804152 0x12e2ff8 0