technical:generic:caviness-lustre-rebalance

Caviness 2021 Lustre Expansion

Throughout 2020 and into early 2021, usage of the Lustre file system on the Caviness cluster has maintained a level around 80% of total capacity. At this level of usage the performance of the file system begins to suffer. In each instance this has necessitated an email campaign directed at all cluster users, asking that they remove unneeded files. Though cleanup has been effected by the users each time, usage has always afterward steadily increased again until the 80% threshold is exceeded. As of early 2021, the frequency of these occurrences has increased.

The capacity of a Lustre file system embodies two separate metrics (storage classes):

  • The total metadata entries (inodes) provided by metadata target (MDT) devices
  • The total object storage (e.g. bytes or blocks) provided by object storage target (OST) devices

Having extremely large OST capacity combined with insufficient MDT capacity leads to an inability to create additional files despite their being many bytes of object storage available. A similar scenario exists for extraneous MDT capacity over a lack of object storage capacity. Thus, a critical element in provisioning Lustre file systems is balancing the two types of storage so that usage fluctuates at about the same rate.

On Caviness, the existing MDT and OST capacity are being consumed at nearly the same rate. As of February 23, 2021:

  • OST usage at 83%
  • MDT usage at 77%

This is actually good news: it implies a fair balance between the two storage classes under the usage profile of all Caviness users. Planning for addition of capacity can be guided by the existing sizing.

Part of the Generation 2 addition to the Caviness cluster was:

  • (2) OST pools, 12 x 12 TB HDDs
  • (1) MDT pool, 12 x 1.6 TB SSDs

The previous components of the Lustre file system were:

  • (4) OSTs, each 65 TB in size
  • (1) MDT, 4 TB in size

Bringing the new capacity online will require downtime, primarily because the existing MDT and OST usage levels are so high. Every directory currently present on the Lustre filesystem only makes use of the existing MDT (mdt0). Adding a single 16 TB mdt1 to the file system does not effect any change in where metadata is being stored. Metadata striping only takes effect on Lustre directories that are explicitly changed to use both mdt0 and mdt1. Even so, every file and directory is mapped to one of the MDTs based on its name1).

The filename hashing presents a major issue when growing a Lustre file system's metadata capacity: with a directory striped across two MDTs, nominally 50% of new files will map to mdt0. Thus, mdt0 will reach capacity well ahead of mdt1, but 50% of filenames will continue to map to mdt0. These files cannot be created on mdt1 — doing so would require metadata regarding where the metadata was stored and obviate the hashing in the first place! As a consequence, once an MDT that is part of a metadata stripe reaches 100% capacity, a fraction of new files will fail to be created.

Given this information, design tenets for effective use of multiple Lustre MDTs are as follows:

  • Each MDT should be of approximately the same capacity to promote balanced growth
  • The filename hash function must be well-designed (to provide a balanced distributions of hash values)

The second requirement is outside our ability to control (hopefully the Lustre developers did a good job). The first requirement is by definition not met on Caviness since mdt0 is close to full and the new MDT(s) will be empty.

The Generation 1 Lustre metadata was configured with a single MDT serviced by a pair of MDS nodes:

Caviness, Generation 1 Lustre metadata

The three disks in light blue are parity data (for redundancy) and the one disk in light orange is a hot spare. All 12 disks are 400 GB SSD, for a total of 8 x 400 GB = 3200 GB raw capacity. The green connecting line leads to the primary server for the MDT, and the red connects to the failover server.

With the hardware added to Generation 2 and the balanced design tenets outlined above, the 16 TB of new metadata storage will be organized as:

Caviness, Generation 2 Lustre metadata

The 12 disks are again SSD, this time quadruple the capacity as in Generation 1. The disks are split into three pools, with each pool being a mirror of two disks: 2 x 1600 GB = 3200 GB raw capacity, inline with the single pool in Generation 1. The MDTs are handled equally across the two servers: r02mds0 the primary for mdt0 and mdt2, r02mds1 the primary for mdt1 and mdt3.

Just as with the new MDT versus the old, the two new OST pools utilize storage media that are larger capacity than in Generation 1. Though metadata are fixed-size entities, objects are of arbitrary size (anywhere up to the full capacity of an OST) and thus an arbitrary number fit on each OST. When a new file is created the Lustre metadata subsystem chooses an OST (or number of OSTs if the file is striped) on which the file will be placed. The file's metadata (in the MDT) indicates on which OST(s) the object(s) reside and in what pattern.

Since the metadata subsystem can allocate around OSTs that have reached full capacity, it is not quite as critical for the OSTs to be balanced in their usage. It is also beneficial to leave the OST as a single pool rather than split into multiple smaller pools sized to match Generation 1 (as with the Generate 2 MDTs) because the single object size on that OST is much larger as a result.

Thus, the two new OSTs in Generation 2 are setup as two pools, each comprising nine data and two parity HDDs (RAIDZ2); an SSD read cache (L2ARC); and a single hot spare HDD.

Once the new MDTs and OSTs are brought online as part of the existing Lustre filesystem, an imbalance will exist: mdt0, ost0, ost1, ost2, and ost3 will contain all metadata and objects, while mdt1, mdt2, mdt3, ost4, and ost5 will be empty.

Striping of metadata does not happen automatically: a directory must be explicitly configured to do so, and existing files do not migrate if they are modified — only if they are copied. Likewise, existing files' OST layout is fixed at the time of creation, so copying is again necessary to redistribute them.

A rebalancing of the filesystem will be effected by the following once the new MDTs and OSTs are online:

  1. A new directory with metadata striped across all MDTs will be created (/lustre/scratch/altroot).
  2. Existing directories on /lustre/scratch will be copied to /lustre/scratch/altroot; the source files/directories will be removed as the copy progresses.
  3. Once all content has been transferred, the root directory (/lustre/scratch) will be modified to stripe metadata across all MDTs.
  4. Finally, all directories under /lustre/scratch/altroot will be moved back to being under /lustre/scratch as before.

With the metadata of the new copies being striped across all MDTs, and the Lustre metadata subsystem spreading the copies across the new and old OSTs, the net effect will be to rebalance MDT and OST usage across all devices.

All aspects of this workflow were tested using VirtualBox on a Mac laptop. A CentOS 7 VM (of the same version as is in-use on Caviness) was provisioned with Lustre 2.10.3 patchless server kernel modules installed. This VM was diff-cloned to create three additional VMs: mds0, mds1, oss1, oss2.

The four VMs each had a virtual NIC configured in a named internal network (lustre-net) and IP addresses were assigned manually in the OS. Connectivity between the four VMs via that network was confirmed. LNET was configured manually after boot on each node:

[mds0 ~]$ modprobe lnet
[mds0 ~]$ lnetctl net configure --all

The following VDIs were created:

  • 50 GB - mgt
  • 250 GB - mdt0, mdt1
  • 1000 GB - ost0, ost1

The mgt and mdt0 VDIs were attached to mds0 and formatted:

[mds0 ~]$ mkfs.lustre --mgs --reformat \
    --servicenode=mds0@tcp --mgsnode=mds1@tcp \
    --backfstype=ldiskfs \
    /dev/sdb
[mds0 ~]$ mkfs.lustre --mdt --reformat \
    --mgsnode=mds0@tcp --mgsnode=mds1@tcp \
    --servicenode=mds0@tcp --mgsnode=mds1@tcp \
    --backfstype=ldiskfs --fsname=demo \
    /dev/sdc

The ost0 VDI was attached to oss0 and formatted:

[oss0 ~]$ mkfs.lustre --ost --reformat --index=0 \
    --mgsnode=mds0@tcp --mgsnode=mds1@tcp \
    --servicenode=oss0@tcp --mgsnode=oss1@tcp \
    --backfstype=ldiskfs --fsname=demo \
    /dev/sdb

The mgt and mdt0 were brought online:

[mds0 ~]$ mkdir -p /lustre/mgt /lustre/mdt{0,1}
[mds0 ~]$ mount -t lustre /dev/sdb /lustre/mgt
[mds0 ~]$ mount -t lustre /dev/sdc /lustre/mdt0

Finally, ost0 was brought online:

[oss0 ~]$ mkdir -p /lustre/ost{0,1}
[oss0 ~]$ mount -t lustre /dev/sdb /lustre/ost0

Another VM was created with the same version of CentOS 7 and the Lustre 2.10.3 client modules. The VM also had a virtual NIC created as part of the named internal network (lustre-net) and an IP address assigned manually within the OS. Connectivity to the four Lustre VMs was confirmed and LNET configured manually as above.

The "demo" Lustre file system was mounted on the client:

[client ~]$ mkdir /demo
[client ~]$ mount -t lustre mdt0@tcp:mdt1@tcp:/demo /demo

At this point, some tests were performed in order to fill the metadata to approximately 70% of capacity.

The new MDT was formatted and brought online:

[mds1 ~]$ mkfs.lustre --mdt --reformat --index=1 \
    --mgsnode=mds0@tcp --mgsnode=mds1@tcp \
    --servicenode=mds1@tcp --mgsnode=mds0@tcp \
    --backfstype=ldiskfs --fsname=demo \
    /dev/sdb
[mds1 ~]$ mkdir -p /lustre/mgt /lustre/mdt{0,1}
[mds1 ~]$ mount -t lustre /dev/sdb /lustre/mdt1

After a few moments, the client VM received the updated file system configuration and had mounted the new MDT. MDT usage and capacity changed accordingly. This indicated that an online addition of MDTs to a running Lustre file system is possible.

Further testing was performed to confirm that

  • by default all metadata additions were against mdt0
  • creating a new directory with metadata striping over mdt0 and mdt1 initially allowed a balanced creation of new files across both MDTs
  • once mdt0 was filled to capacity, creation of new files whose name hashed and mapped to mdt0 failed; names that hashed and mapped to mdt1 succeeded

The new OST was formatted and brought online:

[mds1 ~]$ mkfs.lustre --ost --reformat --index=1 \
    --mgsnode=mds0@tcp --mgsnode=mds1@tcp \
    --servicenode=oss1@tcp --mgsnode=oss0@tcp \
    --backfstype=ldiskfs --fsname=demo \
    /dev/sdb
[oss1 ~]$ mkdir -p /lustre/ost{0,1}
[oss1 ~]$ mount -t lustre /dev/sdb /lustre/ost1

After a few moments, the client VM received the updated file system configuration and had mounted the new OST. OST usage and capacity changed accordingly. This indicated that an online addition of OSTs to a running Lustre file system is possible.


1)
The filename is hashed using a 64-bit FNV-1 function, and the hash modulo the number of MDTs (2 in this example) provides the MDT index.
  • technical/generic/caviness-lustre-rebalance.txt
  • Last modified: 2021-02-23 17:13
  • by frey