abstract:farber:filesystems:lustre

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
abstract:farber:filesystems:lustre [2017-12-02 16:46] – created sraskarabstract:farber:filesystems:lustre [2020-05-29 11:43] (current) frey
Line 10: Line 10:
 With four disks being used in parallel (example (b) above), the block writing overlaps and takes just 8 cycles to complete. With four disks being used in parallel (example (b) above), the block writing overlaps and takes just 8 cycles to complete.
  
-Parallel use of multiple disks is the key behind many higher-performance disk technologies.  RAID (Redundant Array of Independent Disks) level 6 uses three or more disks to improve i/o performance while retaining //parity// copies of data((The two parity copies in RAID-6 imply that given //N// 2 TB disks, only //N-2// actually store data.  E.g. a three disk RAID-6 volume has a capacity of 2 TB.)).  Should one or two of the constituent disks fail, the missing data can be reconstructed using the parity copies.  It is RAID-6 that forms the basic building block of the Lustre filesystem on the Mills cluster.+Parallel use of multiple disks is the key behind many higher-performance disk technologies.  RAID (Redundant Array of Independent Disks) level 6 uses three or more disks to improve i/o performance while retaining //parity// copies of data((The two parity copies in RAID-6 imply that given //N// 2 TB disks, only //N-2// actually store data.  E.g. a three disk RAID-6 volume has a capacity of 2 TB.)).  Should one or two of the constituent disks fail, the missing data can be reconstructed using the parity copies.  It is RAID-6 that forms the basic building block of the Lustre filesystem on our cluster.
  
 ===== A Storage Node ===== ===== A Storage Node =====
  
-The Mills cluster contains five //storage appliances// that each contain many hard disks.  For example, ''storage1'' contains 36 SATA hard disks (2 TB each) arranged as six 8 TB RAID-6 units:+The Farber cluster contains five //storage appliances// that each contain many hard disks.  For example, ''storage1'' contains 36 SATA hard disks (2 TB each) arranged as six 8 TB RAID-6 units:
  
-{{ osts.png |The Mills storage1 appliance.}}+{{ osts.png |The Farber storage1 appliance.}}
  
 Each of the six OST (Object Storage Target) units can survive the concurrent failure of one or two hard disks at the expense of storage space:  the raw capacity of ''storage1'' is 72 TB, but the data resilience afforded by RAID-6 costs a full third of that capacity (leaving 48 TB). Each of the six OST (Object Storage Target) units can survive the concurrent failure of one or two hard disks at the expense of storage space:  the raw capacity of ''storage1'' is 72 TB, but the data resilience afforded by RAID-6 costs a full third of that capacity (leaving 48 TB).
Line 24: Line 24:
 {{ oss-osts.png |Cluster nodes send i/o requests to an OSS, which services a set of OSTs.}} {{ oss-osts.png |Cluster nodes send i/o requests to an OSS, which services a set of OSTs.}}
  
-In Mills, each OSS is primarily responsible for one storage appliance's OSTs.  As illustrated above, ''OST0000'' through ''OST0005'' are serviced primarily by ''OSS1'' If ''OSS1'' were to fail compute nodes would no longer be able to interact with those OSTs.  This situation is tempered by having each OSS act as a //failover// OSS for a secondary set of OSTs.  If ''OSS1'' fails, then ''OSS2'' will take control of ''OST0000'' through ''OST0005'' in addition to its own ''OST0006'' through ''OST000B'' When ''OSS1'' is repaired, it can retake control of its OSTs from its partner.+Each OSS is primarily responsible for one storage appliance's OSTs.  As illustrated above, ''OST0000'' through ''OST0005'' are serviced primarily by ''OSS1'' If ''OSS1'' were to fail compute nodes would no longer be able to interact with those OSTs.  This situation is tempered by having each OSS act as a //failover// OSS for a secondary set of OSTs.  If ''OSS1'' fails, then ''OSS2'' will take control of ''OST0000'' through ''OST0005'' in addition to its own ''OST0006'' through ''OST000B'' When ''OSS1'' is repaired, it can retake control of its OSTs from its partner.
  
 <note important>The action of an OSS's taking over for its failover partner is not immediate.  Usually anywhere from 5 to 10 minutes will pass before the partner OSS has fully assumed control over the OSTs.</note> <note important>The action of an OSS's taking over for its failover partner is not immediate.  Usually anywhere from 5 to 10 minutes will pass before the partner OSS has fully assumed control over the OSTs.</note>
Line 40: Line 40:
   * File system capacity is not limited by hard disk size   * File system capacity is not limited by hard disk size
  
-The capacity of a Lustre filesystem is the sum of its constituent OSTs, so a Lustre filesystem's capacity can be grown by the addition of OSTs (and possibly OSSs to serve them).  For example, should the 172 TB Lustre filesystem on Mills begin to approach its capacity, additional capacity could be added with zero downtime by buying and installing another OSS pair.+The capacity of a Lustre filesystem is the sum of its constituent OSTs, so a Lustre filesystem's capacity can be grown by the addition of OSTs (and possibly OSSs to serve them).  For example, should the 172 TB Lustre filesystem begins to reach its capacity, additional capacity could be added with zero downtime by buying and installing another OSS pair.
  
 <note important>Creating extremely large filesystems has one drawback:  traversing the filesystem takes so much time that it becomes impossible to create off-site backups for further data resilience.  For this reason Lustre filesystems are most often treated as volatile/scratch storage.</note> <note important>Creating extremely large filesystems has one drawback:  traversing the filesystem takes so much time that it becomes impossible to create off-site backups for further data resilience.  For this reason Lustre filesystems are most often treated as volatile/scratch storage.</note>
Line 59: Line 59:
  
 <note tip>Once a file has been created its striping cannot be changed.  However, creating a new file with ''lfs setstripe'' and copying the contents of the old file into it effectively changes the data's striping pattern.</note> <note tip>Once a file has been created its striping cannot be changed.  However, creating a new file with ''lfs setstripe'' and copying the contents of the old file into it effectively changes the data's striping pattern.</note>
 +
 +===== Lustre utilities =====
 +
 +There are two custom utilities, ''lrm'' and ''ldu'', available for removing files and checking disk usage on Lustre.   They both make use of specially-written code that rate-limits calls to the ''lstat()'', ''unlink()'' and ''rmdir()'' C functions to minimize the stress on Lustre.  Both of these utilities should be used on a compute node only (using ''qlogin'').
 +
 +<note tip>If you will be processing relatively large directories with ''lrm'' or ''ldu'' it's a good idea to first start a ''screen'' session and then login to a compute node using ''qlogin''.</note>
 +
 +==== Delete or remove (lrm) ====
 +
 +''lrm'' is a custom utility available for removing files on Lustre. It reuses all but the ''%%--%%force'' flag of the native ''rm'' utility and reproduces its runtime behavior as closely as possible.  An additional option is present to track the size of all the removed items and report that at the end of the process. ''lrm'' should be used on a compute node only.
 +
 +<code>
 +[traine@farber ~]$ lrm
 +usage:
 +
 +  lrm {options} <path> {<path> ..}
 +
 + options:
 +
 +  -h/--help                This information
 +  -V/--version             Version information
 +  -q/--quiet               Minimal output, please
 +  -v/--verbose             Increase the level of output to stderr as the program
 +
 +  --interactive{=WHEN}     Prompt the user for removal of items.  Values for WHEN
 +                           are never, once (-I), or always (-i).  If WHEN is not
 +                           specified, defaults to always
 +  -i                       Shortcut for --interactive=always
 +  -I                       Shortcut for --interactive=once; user is prompted one time
 +                           only if a directory is being removed recursively or if more
 +                           than three items are being removed
 +  -r/--recursive           Remove directories and their contents recursively
 +
 +  -s/--summary             Display a summary of how much space was freed...
 +    -k/--kilobytes         ...in kilobytes
 +    -H/--human-readable    ...in a size-appropriate unit
 +
 +  -S/--stat-limit #.#      Rate limit on calls to stat(); floating-point value in
 +                           units of calls / second
 +  -U/--unlink-limit #.#    Rate limit on calls to unlink() and rmdir(); floating-
 +                           point value in units of calls / second
 +  -R/--rate-report         Always show a final report of i/o rates
 +
 + $Id: lrm.c 470 2013-08-22 17:40:01Z frey $
 +</code>
 +
 +The example below shows user ''traine'' in workgroup ''it_nss'' on compute node ''n012'' removing ''/lustre/scratch/traine/projects/namd'' directory and all files and subdirectories using the ''%%--%%recursive'' option. The additional option ''%%--%%summary'' is also used to display how much space was freed in bytes. Note ''traine'' was already in ''/lustre/scratch/traine/projects'' before using ''qlogin'' to login into the compute node ''n012''.
 +
 +<code>
 +[(it_nss:traine)@farber projects]$ qlogin
 +Your job 369292 ("QLOGIN") has been submitted
 +waiting for interactive job to be scheduled ...
 +Your interactive job 369292 has been successfully scheduled.
 +Establishing /opt/shared/OpenGridScheduler/local/qlogin_ssh session to host n012 ...
 +Last login: Thu Aug 22 14:32:16 2013 from login000
 +
 +[traine@n012 projects]$ pwd
 +/lustre/scratch/traine/projects
 +
 +[traine@n012 projects]$ lrm --summary --recursive --stat-limit 100 --unlink-limit 100 ./namd
 +lrm: removed 354826645 bytes
 +</code>
 +
 +<note important>Please set both limits (''%%--%%stat-limit'' and ''%%--%%unlink-limit'') to 100 in order to remove files at a slow pace on Lustre.</note>
 +
 +==== Disk usage (ldu) ====
 +
 +''ldu'' is a custom utility available for summarizing disk usage on Lustre. It reproduces its runtime behavior as closely as possible with the native ''du'' based on the options available below.  ''ldu'' should be used on a compute node only.
 +
 +<code>
 +[traine@farber ~]$ ldu 
 +usage:
 +
 +  ldu {options} <path> {<path> ..}
 +
 + options:
 +
 +  -h/--help                This information
 +  -V/--version             Version information
 +  -q/--quiet               Minimal output, please
 +  -v/--verbose             Increase the level of output to stderr as the program
 +
 +  -k/--kilobytes           Display usage sums in kilobytes
 +  -H/--human-readable      Display usage sums in a size-appropriate unit
 +
 +  -S/--stat-limit #.#      Rate limit on calls to stat(); floating-point value in
 +                           units of calls / second
 +  -R/--rate-report         Always show a final report of i/o rates
 +
 + $Id: ldu.c 470 2013-08-22 17:40:01Z frey $
 +</code>
 +
 +The example below shows user ''traine'' in workgroup ''it_nss'' on compute node ''n012'' summarizing their disk usage on ''/lustre/scratch/traine/projects'' directory. Note ''traine'' was already in ''/lustre/scratch/traine/projects'' before using ''qlogin'' to login into compute node ''n012''.
 +
 +<code>
 +[(it_nss:traine)@farber projects]$ qlogin
 +Your job 369292 ("QLOGIN") has been submitted
 +waiting for interactive job to be scheduled ...
 +Your interactive job 369292 has been successfully scheduled.
 +Establishing /opt/shared/OpenGridScheduler/local/qlogin_ssh session to host n012 ...
 +Last login: Thu Aug 22 14:32:16 2013 from login000
 +
 +[traine@n012 projects]$ pwd
 +/lustre/scratch/traine/projects
 +
 +[traine@n012 projects]$ ldu --human-readable --stat-limit 100 ./
 +[2013-08-22 13:48:00-0400] leon_stat:  25765 calls over 103.396 seconds (249 calls/sec)
 +[2013-08-22 13:48:37-0400] leon_stat:  25793 calls over 141.063 seconds (183 calls/sec)
 +[2013-08-22 13:49:19-0400] leon_stat:  25838 calls over 183.257 seconds (141 calls/sec)
 +[2013-08-22 13:50:43-0400] leon_stat:  26778 calls over 266.790 seconds (100 calls/sec)
 +821.07 GiB    /lustre/scratch/traine/projects
 +</code>
 +
 +<note important>Please set the stat limit (''%%--%%stat-limit'') to 100 in order to summarize the disk usage on Lustre at a slow pace.
 +
 +You may have noticed that the first rate shown above is NOT 100 as requested.  It takes 1 second for the rate-limiting logic to gather initial ''lstat()'', ''rmdir()'' and ''unlink()'' profiles.  After that, instantaneously meeting the desired rate would have the utility calling lstat() in bursts with long periods of inactivity between those bursts.  This is not the desired behavior.  Instead, the utility uses much shorter periods of inactivity to //eventually// meet the requested rate.</note>
 +
 +<note tip>For long-running ''lrm'' or ''ldu'' cases, the ''USR1'' signal will cause the utility to display the current i/o rate(s) as seen in the above example.  The ''USR1'' signal can be delivered to the utility by finding its process id (using ''ps'') and then issuing the ''kill -USR1 <pid>'' command on the compute node on which ''lrm'' or ''ldu'' is running:<code bash>
 +[traine@n012 ~]$ ps -u traine
 +  PID TTY          TIME CMD
 + 7059 ?        00:00:00 sshd
 + 7060 pts/12   00:00:00 bash
 + 8834 pts/12   00:00:00 ps
 + 6000 ?        00:00:00 sshd
 + 6001 pts/12   00:00:00 bash
 + 6008 pts/12   00:00:00 ldu
 +
 +[traine@n012 ~]$ kill -USR1 6008
 +</code></note>
 +
 +
  • abstract/farber/filesystems/lustre.1512251177.txt.gz
  • Last modified: 2017-12-02 16:46
  • by sraskar