technical:generic:caviness-lustre-rebalance

This is an old revision of the document!


Caviness 2021 Lustre Expansion

Throughout 2020 and into early 2021, usage of the Lustre file system on the Caviness cluster has maintained a level around 80% of total capacity. At this level of usage the performance of the file system begins to suffer. In each instance this has necessitated an email campaign directed at all cluster users, asking that they remove unneeded files. Though cleanup has been effected by the users each time, usage has always afterward steadily increased again until the 80% threshold is exceeded. As of early 2021, the frequency of these occurrences has increased.

The capacity of a Lustre file system embodies two separate metrics (storage classes):

  • The total metadata entries (inodes) provided by metadata target (MDT) devices
  • The total object storage (e.g. bytes or blocks) provided by object storage target (OST) devices

Having extremely large OST capacity combined with insufficient MDT capacity leads to an inability to create additional files despite their being many bytes of object storage available. A similar scenario exists for extraneous MDT capacity over a lack of object storage capacity. Thus, a critical element in provisioning Lustre file systems is balancing the two types of storage so that usage fluctuates at about the same rate.

On Caviness, the existing MDT and OST capacity are being consumed at nearly the same rate. As of February 23, 2021:

  • OST usage at 83%
  • MDT usage at 77%

This is actually good news: it implies a fair balance between the two storage classes under the usage profile of all Caviness users. Planning for addition of capacity can be guided by the existing sizing.

Part of the Generation 2 addition to the Caviness cluster was:

  • (2) OSTs, each 120 TB in size
  • (1) MDT, 16 TB in size

The previous components of the Lustre file system were:

  • (4) OSTs, each 65 TB in size
  • (1) MDT, 4 TB in size

Thus, the additions will nearly double the capacity of the Lustre file system.

Bringing the new capacity online will require downtime, primarily because the existing MDT and OST usage levels are so high. Every directory currently present on the Lustre filesystem only makes use of the existing MDT (MDT0000). Adding the 16 TB MDT0001 to the file system does not effect any change in where metadata is being stored. Metadata striping only takes effect on Lustre directories that are explicitly changed to use both MDT0000 and MDT0001. Even so, every file and directory is mapped to one of the MDTs based on its name1).

The filename hashing presents a major issue when growing a Lustre file system's metadata capacity: with a directory striped across both MDTs, nominally 50% of new files will map to MDT0000. MDT0000 will reach capacity well ahead of MDT0001, but 50% of filenames will continue to map to MDT0000. These files will not be created on MDT0001 instead — doing so would require metadata regarding where the metadata was stored and obviate the hashing in the first place! As a consequence, once an MDT that is part of a metadata stripe reaches 100% capacity, a fraction of new files will fail to be created.

Thus, one design tenet for multiple Lustre MDTs:

  • Each MDT should be of approximately the same capacity to promote balanced growth
  • The filename hash function must be well-designed (e.g. to avoid mapping like-named files to the same MDT)

The second requirement is outside our ability to control (hopefully the Lustre developers did a good job). The first requirement is by definition not met on Caviness since MDT0000 is close to full and the new MDT(s) will be empty.

The new MDT storage is 4 times the size of the original. To properly balance, the 16 TB of pooled storage will be split into 4 MDT devices.


1)
The filename is hashed using a 64-bit FNV-1 function, and the hash modulus the number of MDTs (2 in this case) provides the MDT index.
  • technical/generic/caviness-lustre-rebalance.1614106321.txt.gz
  • Last modified: 2021-02-23 13:52
  • by frey