abstract:mills:filesystems:filesystems

Mills Storage

Each user has 2 GB of disk space reserved for personal use on the home file system. Users' home directories are in /home (e.g., /home/1005). These are regularly archived at UD's off-campus disaster recovery site. Use the recover command to restore older versions or deleted files in your home directory.

An 8 TB permanent filesystem is provided on the login node (head node), mills.hpc.udel.edu. The filesystem's RAID-6 (double parity) configuration is accessible to the compute nodes via 1 Gb/s Ethernet and to the campus network via 10 Gb/s Ethernet. Two terabytes are allocated for users' home directories in /home. The remaining 6 TB are reserved for the system software, libraries and applications in /opt.

Each research group has 1 TB of shared group storage on the archive filesystem (/archive). The directory is identified by the research-group identifier «investing_entity» (e.g., /archive/it_css). A read-only snapshot of users' files is made several times per day on the disk. In addition, the filesystem is replicated on UD's off-campus disaster recovery site. Daily snapshots are user-accessible. Older files may be restored by special request.

The 60 TB permanent archive filesystem uses 3 TB enterprise class SATA drives in a triple-parity RAID configuration for high reliability and availability. The filesystem is accessible to the head node via 10 Gbit/s Ethernet and to the compute nodes via 1 Gbit/s Ethernet.

User storage is available on a high-performance Lustre-based filesystem having 172 TB of usable space. This is used for input files, supporting data files, work files, and output files, source code and executables associated with computational tasks run on the cluster. The filesystem is accessible to all of the processor cores via QDR InfiniBand.

The Lustre filesystem is not backed up. However, it is a robust RAID-6 system. Thus, the filesystem can survive a concurrent disk failure of two independent hard drives and still rebuild its contents automatically.

The /lustre filesystem is partitioned as shown below:

Directory Description
work Private work directories for individual investor-groups
scratch Public scratch space for all users
sysadmin System administration use

Each investing-entity has a private work directory (/lustre/work/«investing_entity») that is group-writable. This is where you should create and store most of your files. Each investing-entity's principal stakeholder is responsible for maintenance of the group's directory. IT does not automatically delete files from these directories. The default group-ownership for a file created in a private work directory is the investing-entity's group name. Its default file permissions are 644.

Anyone may use the public scratch directory (/lustre/scratch). IT staff may run cleanup procedures as needed to purge aged files or directories in /lustre/scratch if old files are degrading system performance.

Note: A full filesystem inhibits use for everyone.

Each compute node has its own 1-2 TB local hard drive, which is needed for time-critical tasks such as managing virtual memory. The system usage of the local disk is kept as small as possible to allow some local disk for your applications, running on the node. Thus, there is a /scratch filesystem mounted on each node.

To help users maintain awareness of quotas and their usage on /lustre/work, /home and /archive filesystems, the command my_quotas is now available to display a list of the quota-controlled filesystems (Lustre, NFS, XFS) on which the user has storage space.

For example,

$ my_quotas
Type  Path                         In-use / kiB   Available / kiB  Pct
----- --------------------------- ------------- ----------------- ----
user  /home/1001                     1713689184 72057594037927936   0%
group /archive/it_nss                 167143424        1073741824  16%
group /lustre/work/it_nss               1161212        1219541792   0%

Each user's home directory has a hard quota limit of 2 GB.

Each group's work directory has a quota designed to give your group 1 TB of disk space.

Each investing-entity originally had an informal quota for its private work directory in /lustre/work/ based on 1 TB plus an extra 10 GB/processor-core owned by the investing-entity. Most groups therefore have approximately 1.25 TB quota. With the separation of /lustre/scratch, IT has enabled quotas on /lustre/work for each research group based on their current usage, plus some additional overhead proportional to the purchased core count on Mills. IT will continue to run cleanup procedures as needed to purge aged files or directories in /lustre/scratch since there are no quotas. With quotas in place on /lustre/work, it will prevent any research group from filling up /lustre/work and impacting other research groups.

To determine usage for user traine in workgroup it_css, use the command

[traine@mills ~]$ my_quotas
Type  Path                In-use / kiB Available / kiB  Pct
----- ------------------- ----------- ------------ ----
user  /home/1201               314016     10485760   3%
group /archive/it_css         7496704   1073741824   1%
group /lustre/work/it_css   188761744    914656344  21%

To determine all usage on /lustre/scratch, use the command

[traine@mills ~]$ df -H /lustre/scratch
Filesystem                          Size  Used Avail Use% Mounted on
mds1-ib@o2ib:mds2-ib@o2ib:/scratch  160T  2.8T  150T   2% /lustre-scratch
[traine@mills ~]$
Files are automatically cleaned up for /lustre/scratch to prevent it from reaching 100%.
Please use the custom Lustre utilities to remove files on all Lustre filesytems /lustre/work or /lustre/scratch, or to check disk usage on /lustre/scratch.

The node scratch is mounted on /scratch for each of your nodes. There is no quota, and if you exceed the physical size of the disk you will get disk failure messages. To check the usage of your disk use the df -h command on the compute node.

For example, the command

   ssh n017 df -h /scratch

shows 197 MB used from the total filesystem size of 793 GB.

Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2             793G  197M  753G   1% /scratch

This node n017 has a 1 TB disk and 64 MB memory, which requires 126 GB of swap space on the disk.

There is a physical disk installed on each node that is used for time critical tasks, such as swapping memory. The compute nodes are configured with either 1 TB disk or 2 TB disk, however, the /scratch filesystem will never have the total disk. Large memory nodes need more swap space.

We strongly recommend that you refer to the node scratch by using the environment variable, $TMPDIR, which is defined by Grid Engine when using qsub or qlogin.

Files in your home directory and all sub-directories are backed up using the campus backup system. The recover command is for browsing the index of saved files and recovering selected files from the backup system. To recover a file, «filename», to its original location

  • Go to the original directory using cd command.
  • Start an interactive recover session using the recover command.
  • Type the recover command: add «filename»
  • Schedule the file recovery with the command: recover

Here is a sample session where the file sourceme-gcc is removed and then recovered into its original location.

[traine@mills ex0]$ rm sourceme-gcc
[traine@mills ex0]$ recover
Current working directory is /home/1201/ex0/
recover> add sourceme-gcc
/home/1201/ex0
1 file(s) marked for recovery
recover> recover
Recovering 1 file into its original location
Volumes needed (all on-line):
        d08.RO at /xanadu/xanadu_8/_AF_readonly
Total estimated disk space needed for recover is 4 KB.
Requesting 1 file(s), this may take a while...
Requesting 1 recover session(s) from server.
./sourceme-gcc
Received 1 file(s) from NSR server `owell-3.nss.udel.edu'
Recover completion time: Mon 20 Aug 2012 02:54:59 PM EDT
recover> quit
[dnairn@mills ex0]$ head -1 sourceme-gcc
example='dgels'

Snapshots are read-only images of the filesystem at the time the snapshot is taken. They are available under the .zfs/snapshot directory from the base of the filesystem (e.g., /archive/it_css/.zfs/snapshot/). The .zfs directory does not show up in a directory listing using ls -a as it is hidden, but you can ""go to"" the directory with the cd command. In there you will find directories with the name of 12, 18, Mon, TueSat, Sun. The 12 snapshot is taken during the noon hour, and the 18 snapshot is taken during the 6pm hour. They are named like this to allow future hourly snapshots to be possibly included. There are also snapshots for each day of the week which are taken during the 11pm hour. Each day during the noon hour the snapshot from the previous day is destroyed and a new one is taken. The same is done during the 6pm hour. During the 11pm hour the snapshot from one week ago is destroyed and a new one is created. This allows for retrieving file “backups” from the past week. You will also notice snapshots with the names now, prev and prev-1. These represent the snapshots used to replicate the filesystem to UD's off-campus disaster recovery site.

When an initial snapshot is taken, no space is used as it is a read-only reference for the current filesystem image. However, as the filesystem changes, copy-on-write of data blocks is done and will cause snapshots to use space. These new blocks used by snapshots do not count against the 1TB limit that the group's filesystem can reference, but they do count toward a 4TB limit per research group (workgroup).

Some example uses of snapshots for users are:

  • If a file is deleted or modified during the afternoon you can go to the 12 snapshot taken during the noon hour and retrieve the file as it existed at that time.
  • If a file was deleted on Friday and you do not realize until Monday you can use the Thu snapshot to retrieve the file.

Generally, the /lustre filesystem provides better overall performance than the /home and /archive filesytems. This is especially true for input and output files needed or generated by jobs. The /lustre filesystem is accessible to all processor cores via (40 Gb/s) QDR InfiniBand. In comparison, the compute nodes access the /home and /archive filesystems over 1 Gb/s Ethernet.

The /archive filesystem has less space available for your group, but has both regular snapshots and off-site duplication for recovering storage. The filesystem is especially useful for building applications for your group to use. The main compilers and building tools are available on the head node, and the head node can access the /archive filesystem over 10 Gb/s Ethernet. If your group is installing many packages you could exceed your storage limit. Make sure you clean the install directory after you build and test your package. Remove all files you can download again.

The /home filesystem in limited in storage, and it is used by many applications to store user preference files and caches. So even if you never put files in your home directory, you should regularly check your usage. quota -s.

Private work directories

All members of an investing-entity share their group-writable, private lustre work directory, /lustre/work/«investing_entity», and archive directory, /archive/«investing_entity». All users in your group have full access to add, move (rename) or remove directories and files in these group-writable directories. Be careful not to move or remove any files or directories you do not own (you own directories you created). Your fellow researchers will appreciate your good "cluster citizenship."

You should create a personal subdirectory within any group-writable directory for your own group-related files. That will reduce the chance of others accidentally modifying or deleting your files. You will own this new personal directory with full access to you, and read-only access to your group. Your fellow researchers can copy your files, but not modify them. Researchers not in your group can never see or copy your work, because the investing-entity work directory is only open to your group.

This describes the way ownership and access is set using the default shell environment. You can make some changes by standard UNIX commands such as chmod, if you must. However, you can't give users access to your files outside of your group. Use the public scratch directories for sharing files.

Public scratch directories

All members of the cluster community share a world-writable, public lustre scratch directory, /lustre/scratch/. Access to this world-writable directory is controlled by the sticky bit set on the directory. When set, only the file owner or the directory's owner can rename (mv) or delete (rm) the files in the directory. You should create a personal directory that will be owned by you with full access to you, and read-only to every user. This is where you store files you are willing to share. This is also where you store files that require a large amount of disk space for a short period. Make sure you clean up as soon as the disk space is no longer needed.

Work directory structure

Your group should initially consider how to organize the group's private work directory structure to match the group's workflow patterns. There is no single, best solution. For some, the work may be more project-oriented; for others, it may be more independent and user-oriented. Or the work may be some combination of these. One possibility for the lustre work might look like this:

/lustre/work/it_css/
    projects/
       fuelcell/
       turbulence/
    users/
       boltzman/
       jjkim/
       ksridhar/

where the projects and users directories are owned by the stakeholder, and will be group-writable with the stick bit set. In these directories it is safe for any user to create new projects and/or a personal user directory. The user creating these new directories will own them and the control the name. The stakeholder will be able to rename the subdirectories if restructuring is required.

If your research group chooses to follow this structure, we suggest the following procedure: The stakeholder should first create the projects and users directories that is group-writable and has the sticky bit set (mode 1770).

  cd /lustre/work/it_css
  mkdir -m 1770 projects users

Recommendations:

Private work directory: Use the Lustre work directory (/lustre/work/«investing_entity») for files where high performance is required. Keep just the files needed for your job such as your applications, scripts and large data sets used for input or created for output. Remember the disk is not backed up, so be prepared to rebuild the files if necessary. With batch jobs, the queue script is a record of what you did, but for interactive work, you need to take notes as a record of your work. Use Lustre utilities from a compute node to check disk usage and remove files no longer needed.

Private archive directory: Use the archive directory (/archive/«investing_entity») to build applications for you or your group to use as well as important data, source or any other files you want backed-up. This directory and all the make tools are available from the head node. The head node has faster access to /archive than the compute nodes. See the Application development section for information on building applications. You should make a "setup" script that your fellow researchers can use to access applications you want to share. A typical workflow is to copy the files needed from /archive to /lustre for the actual run. The /archive system is backed-up with snapshots.

Public scratch directory: Use the public Lustre scratch directory (/lustre/scratch or /lustre-scratch) for files where high performance is required. Store files produced as intermediate work files, and remove them when your current project is done. That will free up the public scratch workspace others also need. This is also a good place for sharing files with all users on Mills. Files in this directory are not backed up, and the first subject to removal if the filesystem gets full. Use Lustre utilities from a compute node to check disk usage and remove files no longer needed.

Home directory: Use your home directory to store small private files. Application software you use will often store its configuration, history and cache files in your home directory. Generally, keep this directory free and only use it for files needed to configure your environment. For example, add symbolic links in your home directory to point to files in any of the other directory.

Node scratch directory: Use the node scratch directory (/scratch) for temporary files. The job scheduler software (Grid Engine) creates a subdirectory in /scratch specifically for each job's temporary files. This is done on each node assigned to the job. When the job is complete, the subdirectory and its contents are deleted. This process automatically frees up the local scratch storage that others may need. Files in node scratch directories are not available to the head node, or other compute nodes.

  • abstract/mills/filesystems/filesystems.txt
  • Last modified: 2021-07-15 14:31
  • by anita