Technical documentation

What you will find in this area are technical notes produced as UD IT builds and supports the University's various research computing systems.

Articles in this section discuss generic system administration tasks/observations.

The following articles discuss some of the implementation details involved in tailoring the Slurm job scheduler to the Caviness HPC system.

Additional articles documenting the configuration and use of Slurm on Caviness:

The following articles discuss some of the implementation details involved in tailoring the Slurm job scheduler to the DARWIN HPC system.

Additional articles documenting the configuration and use of Slurm on DARWIN:

Downloadable documents:

HADM: Integration of Resource Allocations in Slurm

The following articles discuss some of the implementation details involved in tailoring the Grid Engine job scheduler to the UD HPC systems.

Using subordination to auto-suspend jobs
Enhanced qlogin for X11 and group propagation
Exclusive allocation of compute nodes
Automated orphaned qlogin cleanup on cluster head nodes
Adding cgroup integration to Grid Engine using prolog/epilog scripts
Fully supporting Linux cgroups using UD's Grid Engine Cgroup Orchestrator (GECO) software

Compute nodes are provisioned and managed using the open source PERCEUS toolkit.

Adding per-node init.d services without creating multiple VNFS images

Lustre is a high-performance parallel filesystem.

Recovering a failed OST

The following articles discuss technical aspects of software development on UD HPC systems.

Using file striping with the Lustre MPI-IO interfaces in Open MPI
Learn how your HPC workgroup can organize its own software installs and make use of VALET to streamline software maintenance
WGSS: WorkGroup-Sponsored Software on the clusters

According to Wikipedia, a white paper is an authoritative report or guide that helps readers understand an issue, solve a problem, or make a decision. Wikipedia claims that white papers are primarily used in marketing and government, but the definition applies equally well to the computing world.

Occasionally there may be performance benchmarking studies performed on the clusters or on new hardware being considered for use in the clusters. Any significant findings that can be made public will be published in the white paper area of the site.

Technical documentation

General Information

Recipes

SLURM

Caviness

DARWIN

Grid Engine

PERCEUS

Lustre Support

Development

White papers

hpc documentation