Caviness Filesystems
Permanent filesystem
The 65 TB permanent filesystem uses 3 TB enterprise class SATA drives in a triple-parity RAID configuration for high reliability and availability. The filesystem is accessible to the head node via 10 Gbit/s Ethernet and to the compute nodes via 1 Gbit/s Ethernet.
Home storage
Each user has 20 GB of disk storage reserved for personal use on the home file system. Users' home directories are in /home (e.g., /home/1005
), and the directory name is put in the environment variable $HOME
at login.
The permanent file system is configured to allow nearly instantaneous, consistent snapshots. The snapshot contains the original version of the file system, and the live file system contains any changes made since the snapshot was taken. In addition,
all your files are regularly replicated at UD's off-campus disaster recovery site. You can use read-only snapshots to revert a previous version, or request to have your files restored from the disaster recovery site.
You can check to see the size and usage of your home directory with the command
df -h $HOME
Workgroup storage
Each research group has at least 1000 GB of shared group (workgroup) storage in the /work
directory identified by the «investing_entity» (e.g., /work/it_css
) and is referred to as your workgroup directory. This is used for input files, supporting data files, work files, and output files, source code and executables that need to be shared with your research group.
Just as your home directory, read-only snapshots of workgroup's files are made several times for the passed week. In addition, the filesystem is replicated on UD's off-campus disaster recovery site. Snapshots are user-accessible, and older files may be retrieved by special request.
You can check the size and usage of your workgroup directory by using the workgroup
command to spawn a new workgroup shell, which sets the environment variable $WORKDIR
df -h $WORKDIR
/work
directory you will only see the directories most recently mounted. If you do not see your directory, you can auto-mount by using it in any command, such as the df
.
High-performance filesystem
Lustre storage
User storage is available on a high-performance Lustre-based filesystem having 403TB TB of usable space. This is used for temporary input files, supporting data files, work files, and output files associated with computational tasks run on the cluster. The filesystem is accessible to all of the processor cores via Omni-path Infiniband. The default stripe count is set to 1 and the default striping is a single stripe distributed across all available OSTs on Lustre. See Lustre Best Practices from Nasa.
$HOME
) or Workgroup ($WORKDIR
) storage. No executables are permitted to run from the Lustre filesystem. There are technical reasons why this is not permissible.
However, a script (written in Bash or Python) is not always an executable. When you embed a hash-bang #!
in the script and chmod +x
on the file, it is executed directly. But executing python
and/or bash
asking it to read a script from a file is not the same as can be seen in the examples below
[traine@login00 traine]$ pwd /lustre/scratch/traine [traine@login00 traine]$ cat test.py #!/usr/bin/env python import sys print 'This is running with arguments ' + str(sys.argv) [traine@login00 traine]$ ./test.py arg1 arg2 -bash: ./test.py: Permission denied [traine@login00 traine]$ /usr/bin/env python test.py arg1 arg2 This is running with arguments ['test.py', 'arg1', 'arg2']
When executing the Python script directly, it does not work. But when executed with python
and tell it to read test.py
for the script to run, it's fine. The same goes for shell scripts
[traine@login00 traine]$ pwd /lustre/scratch/traine [traine@login00 traine]$ cat test.sh #!/bin/bash echo "This is running with arguments $@" [traine@login00 traine]$ ./test.sh arg1 arg2 -bash: ./test.sh: Permission denied [traine@login00 traine]$ bash test.sh arg1 arg2 This is running with arguments arg1 arg2
The Lustre filesystem is not backed up nor are there snapshots to recover deleted files. However, it is a robust RAID-6 system. Thus, the filesystem can survive a concurrent disk failure of two independent hard drives and still rebuild its contents automatically.
The /lustre filesystem is partitioned as shown below:
Directory | Description |
---|---|
scratch | Public scratch space for all users |
All users have access to the public scratch directory (lustre/scratch
). IT staff may run cleanup procedures as needed to purge aged files or directories in /lustre/scratch
if old files are degrading system performance.
/lustre/scratch
, and finish by copying results back to your private $HOME
or shared $WORKDIR
directory. Please clean up (delete) all of the
remaining files in /lustre/scratch
no longer needed. If you do not clean up properly, then IT staff will request all users to clean up /lustre/scratch
, but may need to enable an automatic cleanup procedure to avoid critical situations in the future.
Note: A full filesystem inhibits use for everyone preventing jobs from running.
Local filesystem
Node scratch
Each compute node has its own 900GB local hard drive (or for enhanced local scratch nodes 32TB), which is needed for time-critical tasks such as managing virtual memory. The system usage of the local disk is kept as small as possible to allow some local disk for your applications, running on the node.
Quotas and usage
To help users maintain awareness of quotas and their usage on the /home
filesystem, the command my_quotas
is now available to display a list of the quota-controlled filesystems on which the user has storage space.
For example,
$ my_quotas Type Path In-use / kiB Available / kiB Pct ----- -------------------------- ------------ ------------ ---- user /home/1201 1691648 20971520 8% group /work/it_css 39649280 1048576000 4%
$HOME
is close to or has exceeded their quota. Principle stakeholders are notified via email when $WORKDIR
is close to or has exceeded their quota. Remember all quota issues will likely result in job failures, and especially $WORKDIR
will likely cause jobs to fail for everyone in the workgroup. And of course Lustre is extremely important to clean up. IT will periodically send email to all users or principle stakeholders to clean up to keep /lustre/scratch
below 80%. However if /lustre/scratch
fills up, this will likely cause ALL jobs to fail for everyone on Caviness.
$HOME
, $WORKDIR
and /lustre/scratch
by doing so from a compute node. We recommend using the devel
partition for this purpose. Specify your workgroup (e.g. workgroup -g it_css
) and use salloc –partition=devel
to put you on a compute node with the default resources (1 core, 1 GB memory, and 30 minutes) to delete unnecessary files. If you think you will need additional resources (like more time), see Caviness partitions for complete details on max resources allowed to be requested on the devel
partition.
Home
Each user's home directory has a hard quota limit of 20 GB. To check usage, use
df -h $HOME
The example below displays the usage for the home directory (/home/1201
) for the account traine
as 24 MB out of 20 GB.
Filesystem Size Used Avail Use% Mounted on r01nfs0-10Gb:/fs/r01nfs0/home/1201 20G 24M 20G 1% /home/1201
Workgroup
Each group's work directory has a quota designed to give your group 1 TB of disk space or more depending on the number of nodes in your workgroup. Use the workgroup -g
command to define the $WORKDIR
environment variable, then use the df -h
command to check usage.
df -h $WORKDIR
The example below shows 0 GB used from the 1.0 TB total size for the it_css
workgroup.
[traine@login00 ~]$ workgroup -g it_css [(it_css:traine)@login00 ~]$ df -h $WORKDIR Filesystem Size Used Avail Use% Mounted on r01nfs0-10Gb:/fs/r01nfs0/work/it_css 1.0T 0 1.0T 0% /work/it_css
Lustre
All of Lustre is considered scratch storage and subject to removal if necessary for Lustre-performance reasons. All users can create their own directories under the /lustre/scratch
directory and manage them based on understanding the concepts of Lustre. To check Lustre usage, use df -h /lustre/scratch
.
The example below is based on user traine
in workgroup it_css
showing 225 TB used from a total filesystem size of 367 TB available on Lustre.
[(it_css:traine)@login01 ~]$ df -h /lustre/scratch Filesystem Size Used Avail Use% Mounted on 10.65.32.18@o2ib:10.65.32.19@o2ib:/scratch 367T 225T 142T 62% /lustre/scratch
df -h /lustre
command shows the use of Lustre for all users.
Node scratch
The node scratch is mounted on /tmp
for each of your nodes. There is no quota, and if you exceed the physical size of the disk you will get disk failure messages. To check the usage of your disk, use the df -h
command on the compute node where your job is running.
We strongly recommend that you refer to the node scratch by using the environment variable, $TMPDIR
, which is defined by SLURM when using salloc
or srun
or sbatch
.
For example, the command
ssh r00n36 df -h /tmp
shows size, used and available space in M, G or T units.
Filesystem Size Used Avail Use% Mounted on /dev/sda3 889G 33M 889G 1% /tmp
This node r00n36
has a 900 GB disk, with only 33 MB used, so 889 GB is available for your job.
/tmp
filesystem will never have the total disk. Large memory nodes will use more of the disk for swap space.
Recovering files
Home and Workgroup snapshots
Snapshots are read-only images of the filesystem at the time the snapshot is taken. They are available under the .zfs/snapshot
directory from the base of the filesystem (e.g., $WORKDIR/.zfs/snapshot/
or $HOME/.zfs/snapshot/
). The .zfs
directory does not show up in a directory listing using ls -a
as it is hidden, but you can "go to" the directory with the cd command. In there you will find directories with the name of yyyymmdd-HHMM
, where the yyyy
is the 4 digit year, mm
is the 2 digit month, dd
is a 2 digit day, HH
is the hour (in 24-hour format) of the day, and MM
is the minute inside that hour when the snapshot was taken. They are named like this to allow any number of snapshots and easilly identify when the snapshot was taken. Multiple snapshots are kept per day for 3 days, then daily snapshots going back a month, after this there are weekly, and finally monthly retention policies in place. This allows for retrieving file “backups” from the system well into the past.
When an initial snapshot is taken, no space is used as it is a read-only reference for the current filesystem image. However, as the filesystem changes, copy-on-write of data blocks is done and will cause snapshots to use space. These new blocks used by snapshots do not count against the 1TB limit that the group's filesystem can reference, but they do count toward a 4TB limit per research group (workgroup). As directories begin to reach these limits, the number of snapshots will automatically be reduced to keep the workgroup and home directories from filling up.
Some example uses of snapshots for users are:
- If a file is deleted or modified during the afternoon on November 26th, you can go to the
20141126-1215
snapshot and retrieve the file as it existed at that time. - If a file was deleted on November 26th and you do not realize until Monday you can use the
20141125-2215
snapshot to retrieve the file.
Example recovering .ssh directory from snapshot
By default, your .ssh
directory is set up for you as part of your account on the clusters to have the proper SSH keys to allow you to use salloc
to connect to compute nodes. Sometime clients report they are no longer able to use salloc
to connect to a compute node because the SSH keys have been changed, usually by accident. No worries, you can restore your .ssh
directory from a snapshot when you know it was last working. For example, say you could use salloc
on December 1st, 2017, but you realized on December 4th, 2017 that it stopped working. The example below shows how to go to the snapshot in your home directory, find the corresponding snapshot directory for December 1st, 2017 which is 20171201-1315
for the afternoon (1:15pm) snapshot on that day and then copy the files from this snapshot to replace the ones no longer working. Just remember if you did make other changes to these files after December 1st, then you will lose those changes.
This example shows how to restore your entire .ssh directory.
$ cd ~/.zfs/snapshot/20171201-1315 $ rsync -arv .ssh/ ~/.ssh/
This example shows how to restore individuals files from your .ssh directory.
$ cd ~/.zfs/snapshot $ ls -l : $ cd 20171201-1315/.ssh $ ls -l total 42 -rw------- 1 traine everyone 493 May 11 18:46 authorized_keys -rw------- 1 traine everyone 365 May 11 15:57 id_ecdsa -rw-r--r-- 1 traine everyone 272 May 11 15:57 id_ecdsa.pub -rw-r--r-- 1 traine everyone 757 Aug 1 10:14 known_hosts $ cp -a * ~/.ssh cp: overwrite `/home/1201/.ssh/authorized_keys'? y cp: overwrite `/home/1201/.ssh/id_ecdsa'? y cp: overwrite `/home/1201/.ssh/id_ecdsa.pub'? y cp: overwrite `/home/1201/.ssh/known_hosts'? y $
Usage Recommendations
Home directory: Use your home directory to store private files. Application software you use will often store its configuration, history and
cache files in your home directory. Generally, keep this directory free and use it for files needed to configure your environment. For example, add symbolic links in your home directory to point to files in any of the other directory. The /home/
filesystem is backed-up with snapshots.
Workgroup directory: Use the workgroup directory (/work/«investing_entity») to build applications for you or your group to use as well as important data, modified source or any other files need to be shared by your research group. See the Application development section for information on building applications. You should create a VALET package for your fellow researchers to access applications you want to share. A typical workflow is to copy the files needed from /work
to /lustre/scratch
for the actual run. The /work
system is backed-up with snapshots.
Public scratch directory: Use the public Lustre scratch directory (/lustre/scratch) for files where high performance is required. Store files produced as intermediate work files, and remove them when your current project is done. That will free up the public scratch workspace others also need. This is also a good place for sharing files and and data with all users. Files in this directory are not backed up, and subject to removal. Use Lustre utilities from a compute node to check disk usage and remove files no longer needed.
Node scratch directory: Use the node scratch directory (/scratch) for temporary files. The job scheduler software (Grid Engine) creates a subdirectory in /scratch specifically for each job's temporary files. This is done on each node assigned to the job. When the job is complete, the subdirectory and its contents are deleted. This process automatically frees up the local scratch storage that others may need. Files in node scratch directories are not available to the head node, or other compute nodes.