Differences

This shows you the differences between two versions of the page.

--- abstract:caviness:filesystems:lustre [2020-03-06 09:47] – [All About Lustre] anita
+++ abstract:caviness:filesystems:lustre [2020-05-29 16:19] (current) – frey
@@ Line 5: / Line 5: @@
 Buying a better (and more expensive) disk is one way to improve i/o performance, but once the fastest, most expensive disk has been purchased this path leaves no room for further improvement.  The demands of an HPC cluster with several hundred (maybe even thousands) of compute nodes quickly outpaces the speed at which a single disk can shuttle bytes back and forth.  Parallelism saves the day:  store the filesystem blocks on more than one disk and the i/o performance of each will sum (to a degree).  For example, consider a computer that can move data to its hard disks in //1 cycle// with a hard disk that requires //4 cycles// to write a block.  Storing four blocks to just one hard disk would require 20 cycles: 1 cycle to move the block to the disk and 4 cycles to write it, with each block waiting on the completion of the previous:
-{{ serial-vs-parallel.png?300 |Writing 4 blocks to (a) one disks and (b) four disks in parallel.}}
+{{ :abstract:caviness:filesystems:serial-vs-parallel.png?300 |Writing 4 blocks to (a) one disks and (b) four disks in parallel.}}
 With four disks being used in parallel (example (b) above), the block writing overlaps and takes just 8 cycles to complete.
-Parallel use of multiple disks is the key behind many higher-performance disk technologies.  RAID (Redundant Array of Independent Disks) level 6 uses three or more disks to improve i/o performance while retaining //parity// copies of data((The two parity copies in RAID-6 imply that given //N// 2 TB disks, only //N-2// actually store data.  E.g. a three disk RAID-6 volume has a capacity of 2 TB.)).  Should one or two of the constituent disks fail, the missing data can be reconstructed using the parity copies.  It is RAID-6 that forms the basic building block of the Lustre filesystem on our clusters.
+Parallel use of multiple disks is the key behind many higher-performance disk technologies.  Caviness makes extensive use of //ZFS storage pools//, which allow multiple physical disks to be used as a single unit with parallel i/o benefits.  //Parity data// is constructed when data is written to the pool such that the loss of a hard disk can be sustained: when a disk fails a new disk is substituted and the parity data yields the missing data on that disk.  ZFS offers different levels of data redundancy, from simple mirroring of data on two disks to triple-parity that can tolerate the failure of three disks.  ZFS double-parity (raidz2) pools form the basis for the Lustre file system in Caviness.
 ===== A Storage Node =====
-For example, the Mills cluster contains five //storage appliances// that each contain many hard disks.  For example, ''storage1'' contains 36 SATA hard disks (2 TB each) arranged as six 8 TB RAID-6 units:
+The Caviness cluster contains multiple //Object Storage Targets// (OSTs) in each rack that each contain many hard disks.  For example, ''ost0'' contains 10 SATA hard disks (8 TB each, 1 hot spare) managed as a ZFS storage pool with an SSD acting as a read cache for improved performance:
+{{ :abstract:caviness:filesystems:caviness-lustre-oss_ost.png?400 |Example image of Caviness Lustre OSS/OST. }}
-{{ osts.png |Example image of the Mills storage1 appliance.}}
+Each OST can tolerate the concurrent failure of one or two hard disks at the expense of storage space:  the raw capacity of ''ost0'' is 72 TB, but the data resilience afforded by raidz2 costs around 22% capacity (leaving 56 TB usable).  The OST is managed by an //Object Storage Server// (OSS) visible to all login, management, and compute nodes on the cluster network.  Nodes funnel their i/o requests to the OSS, which in turn produces i/o requests to the OST in question.  Each OSS in Caviness has primary responsibility for one OST, but can also handle i/o for its partner OSS's OST.  The figure above shows that ''r00oss0'' is primary for ''ost0'' and on failure of ''r00oss1'' can handle i/o for ''ost1''.
-Each of the six OST (Object Storage Target) units can survive the concurrent failure of one or two hard disks at the expense of storage space:  the raw capacity of ''storage1'' is 72 TB, but the data resilience afforded by RAID-6 costs a full third of that capacity (leaving 48 TB).
-The storage appliances are limited in their capabilities:  they only function to move blocks of data to and from the disks they contain.  In an HPC cluster the storage is shared between many compute nodes.  Nodes funnel their i/o requests to the shared storage system by way of the cluster's private network.  A dedicated machine called an OSS (Object Storage Server) acts as the middleman between the cluster nodes and the OSTs:
-{{ oss-osts.png |Cluster nodes send i/o requests to an OSS, which services a set of OSTs.}}
-Each OSS is primarily responsible for one storage appliance's OSTs.  As illustrated above, ''OST0000'' through ''OST0005'' are serviced primarily by ''OSS1''.  If ''OSS1'' were to fail compute nodes would no longer be able to interact with those OSTs.  This situation is tempered by having each OSS act as a //failover// OSS for a secondary set of OSTs.  If ''OSS1'' fails, then ''OSS2'' will take control of ''OST0000'' through ''OST0005'' in addition to its own ''OST0006'' through ''OST000B''.  When ''OSS1'' is repaired, it can retake control of its OSTs from its partner.
-<note important>The action of an OSS's taking over for its failover partner is not immediate.  Usually anywhere from 5 to 10 minutes will pass before the partner OSS has fully assumed control over the OSTs.</note>
 <note warning>Anytime an OST is unavailable, i/o operations destined for it will "hang" while they wait to be completed.  Users will perceive this hang in everything from file listings to reading from or writing to files residing on that OST.  You can check OST availability using the ''lfs check servers'' command.</note>
@@ Line 31: / Line 22: @@
 ===== The Lustre Filesystem =====
-As illustrated thus far, each OST increases i/o performance by simultaneously moving data blocks to the six hard disks of a RAID-6 unit.  Each OSS services six OSTs, accepting and interleaving six unique i/o workloads to further increase the speed with which data moves to and from the OSTs.  Having multiple OSSs (and thus additional OSTs) adds yet another level of parallelism to the scheme:  each OSS processes six unique i/o workloads.  The agglomeration of multiple OSS nodes, each servicing one or more OST, is the basis of a Lustre filesystem((The other half of the filesystem is the MDS (Meta-Data Server) and MDT (Meta-Data Target) which is constructed similarly to an OSS/OST but solely holds directory structure and file attributes like name, ownership, and permissions.)).
+Each OST increases i/o performance by simultaneously moving data blocks to the hard disks of a raidz2 pool.  Each OSS services its OST, accepting and interleaving many i/o workloads.  Having multiple OSSs and OSTs adds yet another level of parallelism to the scheme.  The agglomeration of multiple OSS nodes, each servicing an OST, is the basis of a Lustre filesystem((The other half of the filesystem is the MDSs (Meta-Data Servers) and MDTs (Meta-Data Targets) which are constructed similarly to an OSS/OST but holds directory structure and file attributes like name, ownership, and permissions.)).  The figure above shows that each rack in Caviness contains OSSs and OSTs:  Caviness is designed to grow its Lustre file system with each additional rack that is added, increasing capacity and performance of this resource.
 The benefits of a Lustre filesystem should be readily apparent from the discussion above:
-  * Parallelism is leveraged at multiple levels to increase i/o performance
+  * parallelism is leveraged at multiple levels to increase i/o performance
-  * RAID-6 storage at the base level provides resilience
+  * raidz2 pools provide resilience
-  * File system capacity is not limited by hard disk size
+  * file system capacity and performance is not limited by hard disk size
-The capacity of a Lustre filesystem is the sum of its constituent OSTs, so a Lustre filesystem's capacity can be grown by the addition of OSTs (and possibly OSSs to serve them).  For example, should the 172 TB Lustre filesystem begins to reach its capacity, additional capacity could be added with zero downtime by buying and installing another OSS pair.
 <note important>Creating extremely large filesystems has one drawback:  traversing the filesystem takes so much time that it becomes impossible to create off-site backups for further data resilience.  For this reason Lustre filesystems are most often treated as volatile/scratch storage.</note>
@@ Line 49: / Line 37: @@
 For large files or files that are internally organized as "records((A //record// consists of a fixed-size sequence of bytes; the //i//-th record exists at an easily calculated offset within the file.))" i/o performance can be further improved by //striping// the file across multiple OSTs.  Striping divides a file into a set of sequential, fixed-size chunks.  The stripes are distributed round-robin to //N// unique Lustre objects -- and thus on //N// unique OSTs.  For example, consider a 13 MiB file:
-{{ lustre-striping.png?500 |Lustre striping.}}
+{{ :abstract:caviness:filesystems:caviness-lustre-striping.png?500 |Lustre striping.}}
-Without striping, all 13 MiB of the file resides in a single object on ''OST0001'' (see (a) above).  All i/o with respect to this file is handled by ''OSS1''; appending 5 MiB to the file will grow the object to 18 MiB.
+Without striping, all 13 MiB of the file resides in a single object on ''ost0'' (see (a) above).  All i/o with respect to this file is handled by ''oss0''; appending 5 MiB to the file will grow the object to 18 MiB.
-With a stripe count of three and size of 4 MiB, the Lustre filesystem pre-allocates three objects on unique OSTs on behalf of the file (see (b) above).  The file is split into sequential segments of 4 MiB -- a stripe -- and the stripes are written round-robin to the objects allocated to the file.  In this case, appending 5 MiB to the file will see stripe 3 extended to a full 4 MiB and a new stripe of 2 MiB added to the object on ''OST0007''.  For large files and record-style files, striping introduces another level of parallelism that can dramatically increase the performance of programs that access them.
+With a stripe count of three and size of 4 MiB, the Lustre filesystem pre-allocates three objects on unique OSTs on behalf of the file (see (b) above).  The file is split into sequential segments of 4 MiB -- a stripe -- and the stripes are written round-robin to the objects allocated to the file.  In this case, appending 5 MiB to the file will see stripe 3 extended to a full 4 MiB and a new stripe of 2 MiB added to the object on ''ost1''.  For large files and record-style files, striping introduces another level of parallelism that can dramatically increase the performance of programs that access them.
 <note tip>File striping is established when the file is created.  Use the ''lfs setstripe'' command to pre-allocate the objects for a striped file:  ''lfs setstripe -c 4 -s 8m my_new_file.nc'' would create the file ''my_new_file.nc'' containing zero bytes with a stripe size (-s) of 8 MiB and striped across four objects (-c).</note>