abstract:caviness:filesystems:filesystems

Caviness Filesystems

The 65 TB permanent filesystem uses 3 TB enterprise class SATA drives in a triple-parity RAID configuration for high reliability and availability. The filesystem is accessible to the head node via 10 Gbit/s Ethernet and to the compute nodes via 1 Gbit/s Ethernet.

Each user has 20 GB of disk storage reserved for personal use on the home file system. Users' home directories are in /home (e.g., /home/1005), and the directory name is put in the environment variable $HOME at login. The permanent file system is configured to allow nearly instantaneous, consistent snapshots. The snapshot contains the original version of the file system, and the live file system contains any changes made since the snapshot was taken. In addition, all your files are regularly replicated at UD's off-campus disaster recovery site. You can use read-only snapshots to revert a previous version, or request to have your files restored from the disaster recovery site.

You can check to see the size and usage of your home directory with the command

df -h $HOME

Each research group has at least 1000 GB of shared group (workgroup) storage in the /work directory identified by the «investing_entity» (e.g., /work/it_css) and is referred to as your workgroup directory. This is used for input files, supporting data files, work files, and output files, source code and executables that need to be shared with your research group. Just as your home directory, read-only snapshots of workgroup's files are made several times for the passed week. In addition, the filesystem is replicated on UD's off-campus disaster recovery site. Snapshots are user-accessible, and older files may be retrieved by special request.

You can check the size and usage of your workgroup directory by using the workgroup command to spawn a new workgroup shell, which sets the environment variable $WORKDIR

df -h $WORKDIR
Auto-mounted ZFS dataset: The workgroup storage is auto-mounted and thus invisible until you use it. When you list the /work directory you will only see the directories most recently mounted. If you do not see your directory, you can auto-mount by using it in any command, such as the df.

User storage is available on a high-performance Lustre-based filesystem having 403TB TB of usable space. This is used for temporary input files, supporting data files, work files, and output files associated with computational tasks run on the cluster. The filesystem is accessible to all of the processor cores via Omni-path Infiniband. The default stripe count is set to 1 and the default striping is a single stripe distributed across all available OSTs on Lustre. See Lustre Best Practices from Nasa.

Source code and executables must be stored in and executed from Home ($HOME) or Workgroup ($WORKDIR) storage. No executables are permitted to run from the Lustre filesystem. There are technical reasons why this is not permissible.

However, a script (written in Bash or Python) is not always an executable. When you embed a hash-bang #! in the script and chmod +x on the file, it is executed directly. But executing python and/or bash asking it to read a script from a file is not the same as can be seen in the examples below

[traine@login00 traine]$ pwd
/lustre/scratch/traine
[traine@login00 traine]$ cat test.py
#!/usr/bin/env python
 
import sys
 
print 'This is running with arguments ' + str(sys.argv)
 
[traine@login00 traine]$ ./test.py arg1 arg2
-bash: ./test.py: Permission denied
 
[traine@login00 traine]$ /usr/bin/env python test.py arg1 arg2
This is running with arguments ['test.py', 'arg1', 'arg2']

When executing the Python script directly, it does not work. But when executed with python and tell it to read test.py for the script to run, it's fine. The same goes for shell scripts

[traine@login00 traine]$ pwd
/lustre/scratch/traine
[traine@login00 traine]$ cat test.sh
#!/bin/bash
 
echo "This is running with arguments $@"
 
[traine@login00 traine]$ ./test.sh arg1 arg2
-bash: ./test.sh: Permission denied
[traine@login00 traine]$ bash test.sh arg1 arg2
This is running with arguments arg1 arg2

The Lustre filesystem is not backed up nor are there snapshots to recover deleted files. However, it is a robust RAID-6 system. Thus, the filesystem can survive a concurrent disk failure of two independent hard drives and still rebuild its contents automatically.

The /lustre filesystem is partitioned as shown below:

Directory Description
scratch Public scratch space for all users

All users have access to the public scratch directory (lustre/scratch). IT staff may run cleanup procedures as needed to purge aged files or directories in /lustre/scratch if old files are degrading system performance.

Remember the Lustre filesystem is temporary disk storage. Your workflow should start by copying needed data files to the high performance Lustre filesystem, /lustre/scratch, and finish by copying results back to your private $HOME or shared $WORKDIR directory. Please clean up (delete) all of the remaining files in /lustre/scratch no longer needed. If you do not clean up properly, then IT staff will request all users to clean up /lustre/scratch, but may need to enable an automatic cleanup procedure to avoid critical situations in the future.

Note: A full filesystem inhibits use for everyone preventing jobs from running.

Each compute node has its own 900GB local hard drive (or for enhanced local scratch nodes 32TB), which is needed for time-critical tasks such as managing virtual memory. The system usage of the local disk is kept as small as possible to allow some local disk for your applications, running on the node.

To help users maintain awareness of quotas and their usage on the /home filesystem, the command my_quotas is now available to display a list of the quota-controlled filesystems on which the user has storage space.

For example,

$ my_quotas
Type  Path                       In-use / kiB Available / kiB  Pct
----- -------------------------- ------------ ------------ ----
user  /home/1201                      1691648     20971520   8%
group /work/it_css                   39649280   1048576000   4%
IMPORTANT: All users are encouraged to delete files no longer needed once results are gathered and collated. Email notifications are sent to each user when $HOME is close to or has exceeded their quota. Principle stakeholders are notified via email when $WORKDIR is close to or has exceeded their quota. Remember all quota issues will likely result in job failures, and especially $WORKDIR will likely cause jobs to fail for everyone in the workgroup. And of course Lustre is extremely important to clean up. IT will periodically send email to all users or principle stakeholders to clean up to keep /lustre/scratch below 80%. However if /lustre/scratch fills up, this will likely cause ALL jobs to fail for everyone on Caviness.
Please take the time to periodically cleanup your files in $HOME, $WORKDIR and /lustre/scratch by doing so from a compute node. We recommend using the devel partition for this purpose. Specify your workgroup (e.g. workgroup -g it_css) and use salloc –partition=devel to put you on a compute node with the default resources (1 core, 1 GB memory, and 30 minutes) to delete unnecessary files. If you think you will need additional resources (like more time), see Caviness partitions for complete details on max resources allowed to be requested on the devel partition.

Each user's home directory has a hard quota limit of 20 GB. To check usage, use

    df -h $HOME

The example below displays the usage for the home directory (/home/1201) for the account traine as 24 MB out of 20 GB.

Filesystem                          Size  Used Avail Use% Mounted on
r01nfs0-10Gb:/fs/r01nfs0/home/1201   20G   24M   20G   1% /home/1201

Each group's work directory has a quota designed to give your group 1 TB of disk space or more depending on the number of nodes in your workgroup. Use the workgroup -g command to define the $WORKDIR environment variable, then use the df -h command to check usage.

    df -h $WORKDIR

The example below shows 0 GB used from the 1.0 TB total size for the it_css workgroup.

[traine@login00 ~]$ workgroup -g it_css
[(it_css:traine)@login00 ~]$ df -h $WORKDIR
Filesystem                            Size  Used Avail Use% Mounted on
r01nfs0-10Gb:/fs/r01nfs0/work/it_css  1.0T     0  1.0T   0% /work/it_css

All of Lustre is considered scratch storage and subject to removal if necessary for Lustre-performance reasons. All users can create their own directories under the /lustre/scratch directory and manage them based on understanding the concepts of Lustre. To check Lustre usage, use df -h /lustre/scratch.

The example below is based on user traine in workgroup it_css showing 225 TB used from a total filesystem size of 367 TB available on Lustre.

[(it_css:traine)@login01 ~]$  df -h /lustre/scratch
Filesystem                                  Size  Used Avail Use% Mounted on
10.65.32.18@o2ib:10.65.32.19@o2ib:/scratch  367T  225T  142T  62% /lustre/scratch
The df -h /lustre command shows the use of Lustre for all users.

The node scratch is mounted on /tmp for each of your nodes. There is no quota, and if you exceed the physical size of the disk you will get disk failure messages. To check the usage of your disk, use the df -h command on the compute node where your job is running.

We strongly recommend that you refer to the node scratch by using the environment variable, $TMPDIR, which is defined by SLURM when using salloc or srunor sbatch.

For example, the command

   ssh r00n36 df -h /tmp

shows size, used and available space in M, G or T units.

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3       889G   33M  889G   1% /tmp

This node r00n36 has a 900 GB disk, with only 33 MB used, so 889 GB is available for your job.

There is a physical disk installed on each node that is used for time critical tasks, such as swapping memory. The compute nodes are configured with 900 GB disk, however, the /tmp filesystem will never have the total disk. Large memory nodes will use more of the disk for swap space.

Snapshots are read-only images of the filesystem at the time the snapshot is taken. They are available under the .zfs/snapshot directory from the base of the filesystem (e.g., $WORKDIR/.zfs/snapshot/ or $HOME/.zfs/snapshot/). The .zfs directory does not show up in a directory listing using ls -a as it is hidden, but you can "go to" the directory with the cd command. In there you will find directories with the name of yyyymmdd-HHMM, where the yyyy is the 4 digit year, mm is the 2 digit month, dd is a 2 digit day, HH is the hour (in 24-hour format) of the day, and MM is the minute inside that hour when the snapshot was taken. They are named like this to allow any number of snapshots and easilly identify when the snapshot was taken. Multiple snapshots are kept per day for 3 days, then daily snapshots going back a month, after this there are weekly, and finally monthly retention policies in place. This allows for retrieving file “backups” from the system well into the past.

When an initial snapshot is taken, no space is used as it is a read-only reference for the current filesystem image. However, as the filesystem changes, copy-on-write of data blocks is done and will cause snapshots to use space. These new blocks used by snapshots do not count against the 1TB limit that the group's filesystem can reference, but they do count toward a 4TB limit per research group (workgroup). As directories begin to reach these limits, the number of snapshots will automatically be reduced to keep the workgroup and home directories from filling up.

Some example uses of snapshots for users are:

  • If a file is deleted or modified during the afternoon on November 26th, you can go to the 20141126-1215 snapshot and retrieve the file as it existed at that time.
  • If a file was deleted on November 26th and you do not realize until Monday you can use the 20141125-2215 snapshot to retrieve the file.

Example recovering .ssh directory from snapshot

By default, your .ssh directory is set up for you as part of your account on the clusters to have the proper SSH keys to allow you to use salloc to connect to compute nodes. Sometime clients report they are no longer able to use salloc to connect to a compute node because the SSH keys have been changed, usually by accident. No worries, you can restore your .ssh directory from a snapshot when you know it was last working. For example, say you could use salloc on December 1st, 2017, but you realized on December 4th, 2017 that it stopped working. The example below shows how to go to the snapshot in your home directory, find the corresponding snapshot directory for December 1st, 2017 which is 20171201-1315 for the afternoon (1:15pm) snapshot on that day and then copy the files from this snapshot to replace the ones no longer working. Just remember if you did make other changes to these files after December 1st, then you will lose those changes.

This example shows how to restore your entire .ssh directory.

    $ cd ~/.zfs/snapshot/20171201-1315
    $ rsync -arv .ssh/ ~/.ssh/
The trailing slashes are important, be sure to include them.

This example shows how to restore individuals files from your .ssh directory.

$ cd ~/.zfs/snapshot

$ ls -l
   :

$ cd 20171201-1315/.ssh

$ ls -l
total 42
-rw-------  1 traine everyone 493 May 11 18:46 authorized_keys
-rw-------  1 traine everyone 365 May 11 15:57 id_ecdsa
-rw-r--r--  1 traine everyone 272 May 11 15:57 id_ecdsa.pub
-rw-r--r--  1 traine everyone 757 Aug  1 10:14 known_hosts


$ cp -a * ~/.ssh
cp: overwrite `/home/1201/.ssh/authorized_keys'? y
cp: overwrite `/home/1201/.ssh/id_ecdsa'? y
cp: overwrite `/home/1201/.ssh/id_ecdsa.pub'? y
cp: overwrite `/home/1201/.ssh/known_hosts'? y

$

Home directory: Use your home directory to store private files. Application software you use will often store its configuration, history and cache files in your home directory. Generally, keep this directory free and use it for files needed to configure your environment. For example, add symbolic links in your home directory to point to files in any of the other directory. The /home/ filesystem is backed-up with snapshots.

Workgroup directory: Use the workgroup directory (/work/«investing_entity») to build applications for you or your group to use as well as important data, modified source or any other files need to be shared by your research group. See the Application development section for information on building applications. You should create a VALET package for your fellow researchers to access applications you want to share. A typical workflow is to copy the files needed from /work to /lustre/scratch for the actual run. The /work system is backed-up with snapshots.

Public scratch directory: Use the public Lustre scratch directory (/lustre/scratch) for files where high performance is required. Store files produced as intermediate work files, and remove them when your current project is done. That will free up the public scratch workspace others also need. This is also a good place for sharing files and and data with all users. Files in this directory are not backed up, and subject to removal. Use Lustre utilities from a compute node to check disk usage and remove files no longer needed.

Node scratch directory: Use the node scratch directory (/scratch) for temporary files. The job scheduler software (Grid Engine) creates a subdirectory in /scratch specifically for each job's temporary files. This is done on each node assigned to the job. When the job is complete, the subdirectory and its contents are deleted. This process automatically frees up the local scratch storage that others may need. Files in node scratch directories are not available to the head node, or other compute nodes.

  • abstract/caviness/filesystems/filesystems.txt
  • Last modified: 2023-01-13 14:04
  • by anita