====== Caviness: Rebuild of First-Generation OSS and NFS Nodes ====== The in-rack storage servers (Lustre OSS and NFS) require a higher number of PCIe expansion lanes versus the compute nodes. Each storage server requires two SAS HBA ports (PCIe x8) to connect to the external disk array(s). The OSS servers need one 100 Gbps OPA HFI (PCIe x16), and the NFS servers need one 10 Gbps ethernet port (PCIe x8). In the first-generation OSS and NFS nodes, getting the requisite number of PCIe lanes required a less-common node design from the vendor, with a few trade-offs (2OU chassis, no integrated console video port, a single multiplexed LAN port carrying both IPMI and data). Thankfully, the second-generation OSS and NFS nodes with Cascade Lake processors have enough PCIe expansion in 1 OU of vertical rack space to accommodate the network and SAS adapters that are necessary. One major support issue with the first-generation OSS and NFS nodes is that the system logging gets written into RAM, not to a persistent storage medium. When an OSS or NFS node crashes, there is no way to analyze what was happening on the node prior to the crash. The first-generation compute nodes all have a 960 GB SSD that's broken into swap, ''/var'', and ''/tmp'' partitions, so system logs/state survive a reboot. Each first-generation OSS and NFS node was designed with the following complement of storage devices originally slated for specific uses: ^Physical location^Qty^Device^Purpose^ |internal|1|240 GB SSD|L2ARC| |external JBOD|2|400 GB SSD|ZIL (mirror)| |external JBOD|10|8000 GB HDD|RAIDZ2 + hot spare (OSS); RAIDZ3 + hot spare (NFS)| ===== Issues ===== There are two issues with the original intent for the storage devices present in first-generation storage servers. ==== L2ARC ==== Having the L2ARC device present in the storage node itself is not an issue for the NFS servers. However, for the OSS servers operating as high-avaibility partners, the L2ARC device for a partner's pool is obviously not available (it's internal to the partner node). Thus, when attempting to import the partner's pool, the L2ARC device is not found and the pool cannot be used (which precludes the entire failover behavior). In operation, the NFS server's L2ARC sees a fair amount of usage. So having an L2ARC //is// useful. But it requires the device to be visible to both nodes of the H.A. pair. ==== ZIL ==== ZFS leverages the intent log (ZIL) to accelerate the client side of synchronous transactions. By default, ZFS creates datasets (file systems) with fully-synchronous behavior. For a client's write operation to complete, the data //must// be committed to pervasive storage. The ZIL is usually comprised of storage media that are far faster than the bulk devices comprising the rest of the pool. ZFS will queue writes into the ZIL, which allows control to return to the client more quickly, and then shifts data from the ZIL into the slower media of the pool as a background task. The ZIL must be resilient, so a two-device mirror is typically used. In an HPC environment, the fully-synchronous behavior -- even with a ZIL -- tends to get overwhelmed and writes begin to lag. The fully-synchronous behavior is typically disabled on our ZFS file systems: ZFS queues write operations in RAM instead of a persistent device. If the node loses power, any queued writes are lost and the file(s) in question may be corrupted. Historically, this behavior has been the case on NFS servers used in HPC environments, so the risk is understood and acceptable. With fully-synchronous behavior disabled, the ZIL serves no purpose: [root@r00nfs0 ~]# zpool iostat -v capacity operations bandwidth pool alloc free read write read write ---------------------------------------- ----- ----- ----- ----- ----- ----- r00nfs0 12.4T 52.6T 104 334 2.90M 11.7M raidz3 12.4T 52.6T 104 332 2.90M 11.7M 35000c500950389c7 - - 11 37 330K 1.30M 35000c50095039373 - - 11 37 330K 1.30M 35000c50095208afb - - 11 35 330K 1.30M 35000c500950394cb - - 11 37 330K 1.30M 35000c5009503950f - - 11 37 330K 1.30M 35000c50095039557 - - 11 37 330K 1.30M 35000c5009503964b - - 11 37 330K 1.30M 35000c5009520775f - - 11 35 330K 1.30M 35000c5009520835f - - 11 35 330K 1.30M logs - - - - - - mirror 640K 372G 0 2 0 34.2K 35000cca0950134c0 - - 0 1 0 17.1K 35000cca095015e6c - - 0 1 0 17.1K cache - - - - - - Micron_5100_MTFDDAK240TCC_172619295CF8 78.4G 145G 17 16 2.16M 1.94M ---------------------------------------- ----- ----- ----- ----- ----- ----- and on an OSS server with Lustre on top of the pool: [root@r00oss0 ~]# zpool iostat -v capacity operations bandwidth pool alloc free read write read write --------------------- ----- ----- ----- ----- ----- ----- ost0pool 27.8T 37.2T 63 805 3.58M 40.9M raidz2 27.8T 37.2T 63 800 3.58M 40.9M 35000c500950395b3 - - 7 89 411K 4.54M 35000c5009515c8a3 - - 6 87 405K 4.55M 35000c50095039577 - - 7 89 407K 4.54M 35000c50095038d97 - - 7 89 411K 4.54M 35000c5009520053f - - 6 87 405K 4.55M 35000c50095092ad7 - - 7 87 407K 4.54M 35000c5009503960b - - 7 89 411K 4.54M 35000c5009515fa93 - - 6 87 405K 4.55M 35000c500950396cb - - 7 89 407K 4.54M logs - - - - - - mirror 0 372G 0 5 0 22.0K 350011731014cb4ec - - 0 2 0 11.0K 350011731014cbdac - - 0 2 0 11.0K --------------------- ----- ----- ----- ----- ----- ----- In short, none of the first-generation OSS and NFS servers are using the 400 GB SSDs in the external JBOD. ===== Design of Second-Generation ===== With the ZIL unused and the L2ARC device unused on the OSS servers, the second-generation storage servers were designed accordingly: ^Physical location^Qty^Device^Purpose^ |internal|1|960 GB SSD|swap, ''/tmp'', ''/var''| |external JBOD|1|480 GB SSD|L2ARC| |external JBOD|11|12000 GB HDD|RAIDZ2 + hot spare (OSS); RAIDZ3 + hot spare (NFS)| The L2ARC has been moved to the JBOD, so that each OSS server's partner can also see the device. The ZIL has been removed altogether, which leaves room for one additional HDD over the first-generation design; the extra HDD together with the larger per-disk HDD capacity makes each second-generation pool 66% larger than the previous generation. Additionally, having ''/var'' present on persistent storage media makes post-crash analysis possible. The second-generation node design will require changes to the Warewulf VNFS and provisioning (to partition and format the internal SSD, etc.). The same changes could be leveraged to rebuild the first-generation nodes to match. ===== First-Generation Rebuild ===== To match the second-generation design, the first-generation nodes' SSDs could be repurposed for the OSSs and OSTs: ^Physical location^Qty^Device^Purpose^ |internal|1|240 GB SSD|**swap, ''/var'', ''/tmp''**| |external JBOD|2|400 GB SSD|**L2ARC + spare**| |external JBOD|10|8000 GB HDD|RAIDZ2 + hot spare| The 400 GB ZIL mirror can be destroyed and removed from the ZFS pool and one device added back as an L2ARC without taking the OSS node offline. A reboot on a new VNFS and provisioning configuration would be necessary to repurpose the internal 240 GB SSD as swap and stateful OS storage. However, Lustre's H.A. capabilities allow an OST to be gracefully picked-up by the OSS's partner server without major service disruption, so a staged reboot of the four first-generation OST's should be possible without taking the Lustre file system offline. ===== Proposed Timeline ===== When the second-generation hardware is integrated into Caviness a modified VNFS and provisioning profile is mandated. In some regard, the successful deployment of the second-generation OSS servers serves as a proof of concept for the rebuild of the first-generation OSTs and OSSs. The second-generation hardware is currently expected to be delivered and in operation by September 2019. Initial integration of a single second-generation OST into the existing Lustre file system for testing seems advisable. During the months of September and October, the new OST design will be tested in production. Integration of the second-generation MDT into the Lustre filesystem may require some system downtime, so scheduling integration of the remaining second-generation OST and rebuild of the first-generation OSTs coincident with the MDT integration minimizes possible outages/downtime.