====== Caviness Filesystems ====== ===== Permanent filesystem ===== The 65 TB permanent filesystem uses 3 TB enterprise class SATA drives in a triple-parity RAID configuration for high reliability and availability. The filesystem is accessible to the head node via 10 Gbit/s Ethernet and to the compute nodes via 1 Gbit/s Ethernet. ==== Home storage ==== Each user has 20 GB of disk storage reserved for personal use on the home file system. Users' home directories are in /home (e.g., ''/home/1005''), and the directory name is put in the environment variable ''$HOME'' at login. The permanent file system is configured to allow nearly instantaneous, consistent snapshots. The snapshot contains the original version of the file system, and the live file system contains any changes made since the snapshot was taken. In addition, all your files are regularly replicated at UD's off-campus disaster recovery site. You can use read-only [[filesystems#home-and-workgroup-snapshots|snapshots]] to revert a previous version, or request to have your files restored from the disaster recovery site. You can check to see the size and usage of your home directory with the command df -h $HOME ==== Workgroup storage ==== Each research group has at least 1000 GB of shared group ([[abstract:caviness:app_dev:compute_env#using-workgroup-and-directories|workgroup]]) storage in the ''/work'' directory identified by the <> (e.g., ''/work/it_css'') and is referred to as your workgroup directory. This is used for input files, supporting data files, work files, and output files, source code and executables that need to be shared with your research group. Just as your home directory, read-only snapshots of workgroup's files are made several times for the passed week. In addition, the filesystem is replicated on UD's off-campus disaster recovery site. [[filesystems#home-and-workgroup-snapshots|Snapshots]] are user-accessible, and older files may be retrieved by special request. You can check the size and usage of your workgroup directory by using the ''workgroup'' command to spawn a new workgroup shell, which sets the environment variable ''$WORKDIR'' df -h $WORKDIR **Auto-mounted ZFS dataset:** The workgroup storage is auto-mounted and thus invisible until you use it. When you list the ''/work'' directory you will only see the directories most recently mounted. If you do not see your directory, you can auto-mount by using it in any command, such as the ''df''. ===== High-performance filesystem ===== ==== Lustre storage==== User storage is available on a [[abstract:caviness:filesystems:lustre|high-performance Lustre-based filesystem]] having 403TB TB of usable space. This is used for temporary input files, supporting data files, work files, and output files associated with computational tasks run on the cluster. The filesystem is accessible to all of the processor cores via Omni-path Infiniband. The default stripe count is set to 1 and the default striping is a single stripe distributed across all available OSTs on Lustre. See [[https://www.nas.nasa.gov/hecc/support/kb/lustre-best-practices_226.html|Lustre Best Practices]] from Nasa. Source code and executables must be stored in and executed from Home (''$HOME'') or Workgroup (''$WORKDIR'') storage. No executables are permitted to run from the Lustre filesystem. There are technical reasons why this is not permissible. However, a script (written in Bash or Python) is not //always// an executable. When you embed a hash-bang ''#!'' in the script and ''chmod +x'' on the file, it is executed directly. But executing ''python'' and/or ''bash'' asking it to read a script from a file is not the same as can be seen in the examples below [traine@login00 traine]$ pwd /lustre/scratch/traine [traine@login00 traine]$ cat test.py #!/usr/bin/env python import sys print 'This is running with arguments ' + str(sys.argv) [traine@login00 traine]$ ./test.py arg1 arg2 -bash: ./test.py: Permission denied [traine@login00 traine]$ /usr/bin/env python test.py arg1 arg2 This is running with arguments ['test.py', 'arg1', 'arg2'] When executing the Python script directly, it does not work. But when executed with ''python'' and tell it to read ''test.py'' for the script to run, it's fine. The same goes for shell scripts [traine@login00 traine]$ pwd /lustre/scratch/traine [traine@login00 traine]$ cat test.sh #!/bin/bash echo "This is running with arguments $@" [traine@login00 traine]$ ./test.sh arg1 arg2 -bash: ./test.sh: Permission denied [traine@login00 traine]$ bash test.sh arg1 arg2 This is running with arguments arg1 arg2 The Lustre filesystem is not backed up nor are there snapshots to recover deleted files. However, it is a robust RAID-6 system. Thus, the filesystem can survive a concurrent disk failure of two independent hard drives and still rebuild its contents automatically. The /lustre filesystem is partitioned as shown below: ^ Directory ^ Description ^ | scratch | Public scratch space for all users | All users have access to the public scratch directory (''lustre/scratch''). IT staff may run cleanup procedures as needed to purge aged files or directories in ''/lustre/scratch'' if old files are degrading system performance. Remember the Lustre filesystem is temporary disk storage. Your workflow should start by copying needed data files to the high performance Lustre filesystem, ''/lustre/scratch'', and finish by copying results back to your private ''$HOME'' or shared ''$WORKDIR'' directory. Please clean up (delete) all of the remaining files in ''/lustre/scratch'' no longer needed. If you do not clean up properly, then IT staff will request all users to clean up ''/lustre/scratch'', but may need to enable an automatic cleanup procedure to avoid critical situations in the future. **Note**: A full filesystem inhibits use for everyone preventing jobs from running. ===== Local filesystem ===== ==== Node scratch ==== Each compute node has its own 900GB local hard drive (or for enhanced local scratch nodes 32TB), which is needed for time-critical tasks such as managing virtual memory. The system usage of the local disk is kept as small as possible to allow some local disk for your applications, running on the node. ===== Quotas and usage ===== To help users maintain awareness of quotas and their usage on the ''/home'' filesystem, the command ''my_quotas'' is now available to display a list of the quota-controlled filesystems on which the user has storage space. For example, $ my_quotas Type Path In-use / kiB Available / kiB Pct ----- -------------------------- ------------ ------------ ---- user /home/1201 1691648 20971520 8% group /work/it_css 39649280 1048576000 4% **IMPORTANT**: All users are encouraged to delete files no longer needed once results are gathered and collated. Email notifications are sent to each user when ''$HOME'' is close to or has exceeded their quota. Principle stakeholders are notified via email when ''$WORKDIR'' is close to or has exceeded their quota. Remember all quota issues will likely result in job failures, and especially ''$WORKDIR'' will likely cause jobs to fail for everyone in the workgroup. And of course Lustre is extremely important to clean up. IT will periodically send email to all users or principle stakeholders to clean up to keep ''/lustre/scratch'' below 80%. However if ''/lustre/scratch'' fills up, this will likely cause ALL jobs to fail for everyone on Caviness. Please take the time to periodically cleanup your files in ''$HOME'', ''$WORKDIR'' and ''/lustre/scratch'' by doing so from a compute node. We recommend using the ''devel'' partition for this purpose. Specify your workgroup (e.g. ''workgroup -g it_css'') and use ''salloc --partition=devel'' to put you on a compute node with the default resources (1 core, 1 GB memory, and 30 minutes) to delete unnecessary files. If you think you will need additional resources (like more time), see [[abstract:caviness:runjobs:queues#the-devel-partition|Caviness partitions]] for complete details on max resources allowed to be requested on the ''devel'' partition. ==== Home ==== Each user's home directory has a hard quota limit of 20 GB. To check usage, use df -h $HOME The example below displays the usage for the home directory (''/home/1201'') for the account ''traine'' as 24 MB out of 20 GB. Filesystem Size Used Avail Use% Mounted on r01nfs0-10Gb:/fs/r01nfs0/home/1201 20G 24M 20G 1% /home/1201 ==== Workgroup ==== Each group's work directory has a quota designed to give your group 1 TB of disk space or more depending on the number of nodes in your workgroup. Use the ''workgroup -g'' command to define the ''$WORKDIR'' environment variable, then use the ''df -h'' command to check usage. df -h $WORKDIR The example below shows 0 GB used from the 1.0 TB total size for the ''it_css'' workgroup. [traine@login00 ~]$ workgroup -g it_css [(it_css:traine)@login00 ~]$ df -h $WORKDIR Filesystem Size Used Avail Use% Mounted on r01nfs0-10Gb:/fs/r01nfs0/work/it_css 1.0T 0 1.0T 0% /work/it_css ==== Lustre ==== All of Lustre is considered scratch storage and subject to removal if necessary for Lustre-performance reasons. All users can create their own directories under the ''/lustre/scratch'' directory and manage them based on understanding the concepts of [[abstract:caviness:filesystems:lustre|Lustre]]. To check Lustre usage, use ''df -h /lustre/scratch''. The example below is based on user ''traine'' in workgroup ''it_css'' showing 225 TB used from a total filesystem size of 367 TB available on Lustre. [(it_css:traine)@login01 ~]$ df -h /lustre/scratch Filesystem Size Used Avail Use% Mounted on 10.65.32.18@o2ib:10.65.32.19@o2ib:/scratch 367T 225T 142T 62% /lustre/scratch The ''df -h /lustre'' command shows the use of Lustre for all users. ==== Node scratch ==== The node scratch is mounted on ''/tmp'' for each of your nodes. There is no quota, and if you exceed the physical size of the disk you will get disk failure messages. To check the usage of your disk, use the ''df -h'' command **on the compute node** where your job is running. We strongly recommend that you refer to the node scratch by using the environment variable, ''$TMPDIR'', which is defined by SLURM when using ''salloc'' or ''srun''or ''sbatch''. For example, the command ssh r00n36 df -h /tmp shows size, used and available space in M, G or T units. Filesystem Size Used Avail Use% Mounted on /dev/sda3 889G 33M 889G 1% /tmp This node ''r00n36'' has a 900 GB disk, with only 33 MB used, so 889 GB is available for your job. There is a physical disk installed on each node that is used for time critical tasks, such as swapping memory. The compute nodes are configured with 900 GB disk, however, the ''/tmp'' filesystem will never have the total disk. Large memory nodes will use more of the disk for swap space. ===== Recovering files ===== ==== Home and Workgroup snapshots ==== Snapshots are read-only images of the filesystem at the time the snapshot is taken. They are available under the ''.zfs/snapshot'' directory from the base of the filesystem (e.g., ''$WORKDIR/.zfs/snapshot/'' or ''$HOME/.zfs/snapshot/''). The ''.zfs'' directory does not show up in a directory listing using ''ls -a'' as it is hidden, but you can "go to" the directory with the **cd** command. In there you will find directories with the name of ''yyyymmdd-HHMM'', where the ''yyyy'' is the 4 digit year, ''mm'' is the 2 digit month, ''dd'' is a 2 digit day, ''HH'' is the hour (in 24-hour format) of the day, and ''MM'' is the minute inside that hour when the snapshot was taken. They are named like this to allow any number of snapshots and easilly identify when the snapshot was taken. Multiple snapshots are kept per day for 3 days, then daily snapshots going back a month, after this there are weekly, and finally monthly retention policies in place. This allows for retrieving file “backups” from the system well into the past. When an initial snapshot is taken, no space is used as it is a read-only reference for the current filesystem image. However, as the filesystem changes, copy-on-write of data blocks is done and will cause snapshots to use space. These new blocks used by snapshots do not count against the 1TB limit that the group's filesystem can reference, but they do count toward a 4TB limit per research group (workgroup). As directories begin to reach these limits, the number of snapshots will automatically be reduced to keep the workgroup and home directories from filling up. Some example uses of snapshots for users are: * If a file is deleted or modified during the afternoon on November 26th, you can go to the ''20141126-1215'' snapshot and retrieve the file as it existed at that time. * If a file was deleted on November 26th and you do not realize until Monday you can use the ''20141125-2215'' snapshot to retrieve the file. === Example recovering .ssh directory from snapshot === By default, your ''.ssh'' directory is set up for you as part of your account on the clusters to have the proper SSH keys to allow you to use ''salloc'' to connect to compute nodes. Sometime clients report they are no longer able to use ''salloc'' to connect to a compute node because the SSH keys have been changed, usually by accident. No worries, you can restore your ''.ssh'' directory from a snapshot when you know it was last working. For example, say you could use ''salloc'' on December 1st, 2017, but you realized on December 4th, 2017 that it stopped working. The example below shows how to go to the snapshot in your home directory, find the corresponding snapshot directory for December 1st, 2017 which is ''20171201-1315'' for the afternoon (1:15pm) snapshot on that day and then copy the files from this snapshot to replace the ones no longer working. Just remember if you did make other changes to these files after December 1st, then you will lose those changes. This example shows how to restore your entire .ssh directory. $ cd ~/.zfs/snapshot/20171201-1315 $ rsync -arv .ssh/ ~/.ssh/ The trailing slashes are important, be sure to include them. This example shows how to restore individuals files from your .ssh directory. $ cd ~/.zfs/snapshot $ ls -l : $ cd 20171201-1315/.ssh $ ls -l total 42 -rw------- 1 traine everyone 493 May 11 18:46 authorized_keys -rw------- 1 traine everyone 365 May 11 15:57 id_ecdsa -rw-r--r-- 1 traine everyone 272 May 11 15:57 id_ecdsa.pub -rw-r--r-- 1 traine everyone 757 Aug 1 10:14 known_hosts $ cp -a * ~/.ssh cp: overwrite `/home/1201/.ssh/authorized_keys'? y cp: overwrite `/home/1201/.ssh/id_ecdsa'? y cp: overwrite `/home/1201/.ssh/id_ecdsa.pub'? y cp: overwrite `/home/1201/.ssh/known_hosts'? y $ ==== Usage Recommendations ==== **Home directory**: Use your [[#home|home]] directory to store private files. Application software you use will often store its configuration, history and cache files in your home directory. Generally, keep this directory free and use it for files needed to configure your environment. For example, add [[http://en.wikipedia.org/wiki/Symbolic_link#POSIX_and_Unix-like_operating_systems|symbolic links]] in your home directory to point to files in any of the other directory. The ''/home/'' filesystem is backed-up with [[#home-workgroup-snapshots|snapshots]]. **Workgroup directory**: Use the [[#workgroup|workgroup]] directory (/work/<>) to build applications for you or your group to use as well as important data, modified source or any other files need to be shared by your research group. See the [[abstract:caviness:app_dev:app_dev|Application development]] section for information on building applications. You should create a VALET package for your fellow researchers to access applications you want to share. A typical workflow is to copy the files needed from ''/work'' to ''/lustre/scratch'' for the actual run. The ''/work'' system is backed-up with [[#home-workgroup-snapshots|snapshots]]. **Public scratch directory**: Use the public [[#lustre|Lustre]] scratch directory (/lustre/scratch) for files where high performance is required. Store files produced as intermediate work files, and remove them when your current project is done. That will free up the public scratch workspace others also need. This is also a good place for sharing files and and data with all users. Files in this directory are not backed up, and subject to removal. Use [[#lustre-utilities|Lustre utilities]] from a compute node to check disk usage and remove files no longer needed. **Node scratch directory**: Use the [[#node-scratch|node scratch]] directory (/scratch) for temporary files. The job scheduler software (Grid Engine) creates a subdirectory in /scratch specifically for each job's temporary files. This is done on each node assigned to the job. When the job is complete, the subdirectory and its contents are deleted. This process automatically frees up the local scratch storage that others may need. Files in node scratch directories are not available to the head node, or other compute nodes.