abstract:farber:filesystems:filesystems

Farber Storage

The 65 TB permanent filesystem uses 3 TB enterprise class SATA drives in a triple-parity RAID configuration for high reliability and availability. The filesystem is accessible to the head node via 10 Gbit/s Ethernet and to the compute nodes via 1 Gbit/s Ethernet.

Each user has 20 GB of disk storage reserved for personal use on the home file system. Users' home directories are in /home (e.g., /home/1005), and the directory name is put in the environment variable $HOME at login. The permanent file system is configured to allow nearly instantaneous, consistent snapshots. The snapshot contains the original version of the file system, and the live file system contains any changes made since the snapshot was taken. In addition, all your files are regularly replicated at UD's off-campus disaster recovery site. You can use read-only snapshots to revert a previous version, or request to have your files restored from the disaster recovery site.

You can check to see the size and usage of your home directory with the command

df -h $HOME

Each research group has at least 1000 GB of shared group (workgroup) storage in the /home/work directory identified by the «investing_entity» (e.g., /home/work/it_css) and is referred to as your workgroup directory. This is used for input files, supporting data files, work files, and output files, source code and executables that need to be shared with your research group. Just as your home directory, read-only snapshots of workgroup's files are made several times for the passed week. In addition, the filesystem is replicated on UD's off-campus disaster recovery site. Snapshots are user-accessible, and older files may be retrieved by special request.

You can check the size and usage of your workgroup directory by using the workgroup command to spawn a new workgroup shell, which sets the environment variable $WORKDIR

df -h $WORKDIR
Auto-mounted ZFS dataset: The workgroup storage is auto-mounted and thus invisible until you use it. When you list the /home/work directory you will only see the directories most recently mounted. If you do not see your directory, you can auto-mount by using it in any command, such as the df.

User storage is available on a high-performance Lustre-based filesystem having 257 TB of usable space. This is used for temporary input files, supporting data files, work files, and output files associated with computational tasks run on the cluster. The filesystem is accessible to all of the processor cores via 56 Gbps (FDR) Infiniband. The default stripe count is set to 1 and the default striping is a single stripe distributed across all available OSTs on Lustre. See Lustre Best Practices from Nasa.

Source code and executables must be stored in and executed from Home ($HOME) or Workgroup ($WORKDIR) storage. No executables are permitted to run from the Lustre filesystem.

The Lustre filesystem is not backed up. However, it is a robust RAID-6 system. Thus, the filesystem can survive a concurrent disk failure of two independent hard drives and still rebuild its contents automatically.

The /lustre filesystem is partitioned as shown below:

Directory Description
work Private work directories for individual investor-groups
scratch Public scratch space for all users

All users will use the public scratch directory (lustre/scratch). IT staff will run regular cleanup procedures to purge aged files or directories in /lustre/scratch to avoid degrading system performance.

An investing-entity may purchase private Lustre storage that is mounted in /lustre/work and not be subject to the same regular cleanup procedures as /lustre/scratch. Each investing-entity's principal stakeholder is responsible for maintenance of the group's private Lustre directory. The default group-ownership for a file created in a private work directory is the group name. Its default file permissions are 644.

Remember all of the Lustre filesystem is temporary disk storage. Your workflow should start by copying needed data files to the high performance Lustre filesystem, /lustre/scratch, and finish by copying results back to your private /home or shared /home/work directory. Please clean up (delete) all of the remaining files in /lustre/scratch no longer needed by using the custom Lustre utilities. If you do not clean up properly, then files will be purged from /lustre/scratch by the regular cleanup procedures.

Note: A full filesystem inhibits use for everyone.

Each compute node has its own 500GB local hard drive, which is needed for time-critical tasks such as managing virtual memory. The system usage of the local disk is kept as small as possible to allow some local disk for your applications, running on the node.

A mount point is the term that describes where the computer puts the files in a hierarchical file system on the cluster. All the filesystems appear as one unified filesystem structure. Here is a list of mount points on Farber:

Mount point Backed up Description
/home yes Permanent storage, available to all nodes. The location for each user's home directory. A per-user quota system constrains each user.
/lustre no High-performance storage available to all nodes. May be used for group and for individual use. Appropriate space allocation is controlled by agreed-upon published policies.
/scratch no A local storage disk only accessible by a single node. Typically used by applications running on that node and having extensive I/O. It is your responsibility to remove files left on that filesystem at the end of your job if a job scheduler hasn't removed them automatically.

To help users maintain awareness of quotas and their usage on the /home filesystem, the command my_quotas is now available to display a list of the quota-controlled filesystems on which the user has storage space.

For example,

$ my_quotas
Type  Path                       In-use / kiB Available / kiB  Pct
----- -------------------------- ------------ ------------ ----
user  /home/1201                      1691648     20971520   8%
group /home/work/it_css              39649280   1048576000   4%

Each user's home directory has a hard quota limit of 20 GB. To check usage, use

    df -h $HOME

The example below displays the usage for the home directory (/home/1201) for the account traine as 3.0 MB out of 20 GB.

Filesystem            Size  Used Avail Use% Mounted on
storage-nfs1:/export/home/1201
                       20G  3.0M   20G   1% /home/1201

Each group's work directory has a quota designed to give your group 1 TB of disk space or more depending on the number of nodes in your workgroup. Use the workgroup -g command to define the $WORKDIR environment variable, then use the df -h command to check usage.

    df -h $WORKDIR

The example below shows 0 GB used from the 1000 GB total size for the it_css workgroup.

[traine@farber ~]$ workgroup -g it_css
[(it_css:traine)@farber ~]$ df -h $WORKDIR
Filesystem            Size  Used Avail Use% Mounted on
storage-nfs1:/export/work/it_css
                     1000G     0 1000G   0% /home/work/it_css

All of Lustre is considered scratch storage and subject to removal if necessary for Lustre-performance reasons. All users can create their own directories under the /lustre/scratch directory and manage them using custom Lustre utilities. To check Lustre usage, use df -h /lustre.

The example below is based on user traine in workgroup it_css showing 5.9 GB used from a total filesystem size of 257 TB available on Lustre.

[(it_css:traine)@farber ~]$ df -h /lustre
Filesystem            Size  Used Avail Use% Mounted on
ddn-mds2-ib@o2ib1:ddn-mds1-ib@o2ib1:/farber
                      257T  5.9G  244T   1% /lustre
The df -h /lustre command shows the use of Lustre for all users.
Please use custom Lustre utilities to remove files and check disk usage on the Lustre filesystem .

The node scratch is mounted on /scratch for each of your nodes. There is no quota, and if you exceed the physical size of the disk you will get disk failure messages. To check the usage of your disk, use the df -h command on the compute node.

For example, the command

   ssh n036 df -h /scratch

shows size, used and available space in M, G or T units.

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2       457G  198M  434G   1% /tmp

This node n036 has a 500 GB disk, with 457 GB available for your applications.

There is a physical disk installed on each node that is used for time critical tasks, such as swapping memory. The compute nodes are configured with 500 GB disk, however, the /scratch filesystem will never have the total disk. Large memory nodes will use more of the disk for swap space.

We strongly recommend that you refer to the node scratch by using the environment variable, $TMPDIR, which is defined by Grid Engine when using qsub or qlogin.

Snapshots are read-only images of the filesystem at the time the snapshot is taken. They are available under the .zfs/snapshot directory from the base of the filesystem (e.g., $WORKDIR/.zfs/snapshot/ or $HOME/.zfs/snapshot/). The .zfs directory does not show up in a directory listing using ls -a as it is hidden, but you can "go to" the directory with the cd command. In there you will find directories with the name of yyyymmdd-HHMM, where the yyyy is the 4 digit year, mm is the 2 digit month, dd is a 2 digit day, HH is the hour (in 24-hour format) of the day, and MM is the minute inside that hour when the snapshot was taken. They are named like this to allow any number of snapshots and easilly identify when the snapshot was taken. Multiple snapshots are kept per day for 3 days, then daily snapshots going back a month, after this there are weekly, and finally monthly retention policies in place. This allows for retrieving file “backups” from the system well into the past.

When an initial snapshot is taken, no space is used as it is a read-only reference for the current filesystem image. However, as the filesystem changes, copy-on-write of data blocks is done and will cause snapshots to use space. These new blocks used by snapshots do not count against the 1TB limit that the group's filesystem can reference, but they do count toward a 4TB limit per research group (workgroup). As directories begin to reach these limits, the number of snapshots will automatically be reduced to keep the workgroup and home directories from filling up.

Some example uses of snapshots for users are:

  • If a file is deleted or modified during the afternoon on November 26th, you can go to the 20141126-1215 snapshot and retrieve the file as it existed at that time.
  • If a file was deleted on November 26th and you do not realize until Monday you can use the 20141125-2215 snapshot to retrieve the file.

Example recovering .ssh directory from snapshot

By default, your .ssh directory is set up for you as part of your account on the clusters to have the proper SSH keys to allow you to use qlogin to connect to compute nodes. Sometime clients report they are no longer able to use qlogin to connect to a compute node because the SSH keys have been changed, usually by accident. No worries, you can restore your .ssh directory from a snapshot when you know it was last working. For example, say you could use qlogin on December 1st, 2017, but you realized on December 4th, 2017 that it stopped working. The example below shows how to go to the snapshot in your home directory, find the corresponding snapshot directory for December 1st, 2017 which is 20171201-1315 for the afternoon (1:15pm) snapshot on that day and then copy the files from this snapshot to replace the ones no longer working. Just remember if you did make other changes to these files after December 1st, then you will lose those changes.

$ cd ~/.zfs/snapshot

$ ls -l
   :

$ cd 20171201-1315/.ssh

$ ls -l
total 11
-rw-r--r-- 1 traine everyone  221 Jan 27  2017 authorized_keys
-rw------- 1 traine everyone 1679 Sep  3  2014 id_rsa
-rw-r--r-- 1 traine everyone  406 Sep  3  2014 id_rsa.pub
-rw-r--r-- 1 traine it_css   1221 Oct 12  2016 known_hosts


$ cp -a * ~/.ssh
cp: overwrite `/home/1201/.ssh/authorized_keys'? y
cp: overwrite `/home/1201/.ssh/id_rsa'? y
cp: overwrite `/home/1201/.ssh/id_rsa.pub'? y
cp: overwrite `/home/1201/.ssh/known_hosts'? y

$

Home directory: Use your home directory to store private files. Application software you use will often store its configuration, history and cache files in your home directory. Generally, keep this directory free and use it for files needed to configure your environment. For example, add symbolic links in your home directory to point to files in any of the other directory. The /home/ filesystem is backed-up with snapshots.

Workgroup directory: Use the workgroup directory (/home/work/«investing_entity») to build applications for you or your group to use as well as important data, modified source or any other files need to be shared by your research group. See the Application development section for information on building applications. You should create a VALET package for your fellow researchers to access applications you want to share. A typical workflow is to copy the files needed from /home/work to /lustre/scratch for the actual run. The /home/work system is backed-up with snapshots.

Public scratch directory: Use the public Lustre scratch directory (/lustre/scratch) for files where high performance is required. Store files produced as intermediate work files, and remove them when your current project is done. That will free up the public scratch workspace others also need. This is also a good place for sharing files and and data with all users. Files in this directory are not backed up, and subject to removal. Use Lustre utilities from a compute node to check disk usage and remove files no longer needed.

Node scratch directory: Use the node scratch directory (/scratch) for temporary files. The job scheduler software (Grid Engine) creates a subdirectory in /scratch specifically for each job's temporary files. This is done on each node assigned to the job. When the job is complete, the subdirectory and its contents are deleted. This process automatically frees up the local scratch storage that others may need. Files in node scratch directories are not available to the head node, or other compute nodes.

Lustre workgroup directory: Use the Lustre work directory (/lustre/work/«investing_entity») if purchased by your research group for files where high performance is required. Keep just the files needed for your job such as scripts and large data sets used for input or created for output. Remember the disk is not backed up and subject to removal only if needed, so be prepared to rebuild the files if necessary. With batch jobs, the queue script is a record of what you did, but for interactive work, you need to take notes as a record of your work. Use Lustre utilities from a compute node to check disk usage and remove files no longer needed.

  • abstract/farber/filesystems/filesystems.txt
  • Last modified: 2021-10-14 11:20
  • by anita