====== Technical documentation ======

{{:technical:bp_512.png?128 |}}What you will find in this area are technical notes produced as UD IT builds and supports the University's various research computing systems.

===== General Information =====

Articles in this section discuss generic system administration tasks/observations.

  * [[technical:generic:farber-upgrade-201606|Summary of the June 2016 upgrade to Farber]]
  * [[technical:generic:lustre-issues-201605|Summary of the issues experienced with Lustre the latter half of May, 2016]]
  * [[technical:cc3-build:start|Public documentation of the build process of Community Cluster 3]]
  * [[technical:generic:farber-microcode-201904|Summary of 2019 job stall issues on Farber and mitigation]]
  * [[technical:generic:caviness-expansion-201906|Summary of the June 2019 expansion to Caviness]]
  * [[technical:generic:caviness-cuda-update-201906|Summary of the June 2019 update to the CUDA driver on Caviness GPU nodes]]
  * [[technical:generic:mills-ongoing-file-access|Specifications for ongoing access to Mills file systems (2019-07-12 and on)]]
  * [[technical:generic:caviness-gen1-ossnfs-rebuild|Planned rebuild of Caviness first-generation OSS and NFS nodes]]
  * [[technical:globus:mills|Accessing Mills files via Globus GridFTP]]
  * [[technical:generic:workgroup-cmd|Alterations to behavior of the workgroup command]]
  * [[technical:generic:gaussian-linda-integration|Integration of TCPLinda+Gaussian on Caviness]]
  * [[technical:generic:caviness-login-cpu-limit|Per-process CPU time limits on Caviness login nodes]]
  * [[technical:generic:caviness-lustre-rebalance|Caviness 2021 Lustre expansion]]
  * [[technical:xsede:access-sso-hub|Regarding the transition to ACCESS and the XSEDE SSO HUB]]
  * [[technical:generic:intel-oneapi-caveats|Use of the Intel oneAPI compilers on Caviness and DARWIN]]
  * [[technical:generic:openmpi-4-ucx-issue|Use of UCX PML in Open MPI 4.x]]
  * [[technical:generic:gaussian-16-on-ampere-gpus|Compilation of Gaussian '16 to target NVIDIA Ampere GPUs]]
  * [[technical:generic:mpi-and-ucx-mr-cache|UCX MR caching and large-scale MPI collectives]]
  * [[technical:generic:account-portal|2026 password policy change and self-service web portal]]

===== Recipes =====

  * [[technical:recipes:pyqt5-in-virtualenv|Building PyQt5 in a Python virtualenv]]
  * [[technical:recipes:keras-in-virtualenv|Building a Keras Python virtualenv]]
  * [[technical:recipes:r-in-rlibs|Adding your own library of R modules in R_LIBS]]
  * [[technical:recipes:emcee-in-virtualenv|Building an Emcee and PyKlip Python virtualenv]]
  * [[technical:recipes:mpi4py-in-virtualenv|Building a Python virtualenv with a properly-integrated mpi4py module]]
  * [[technical:recipes:tensorflow-in-virtualenv|Building a TensorFlow Python virtualenv]]
  * [[technical:recipes:jupyter-notebook|Building a Jupyter Notebook Python virtualenv]]
  * [[technical:recipes:openmm|Building an OpenMM Python virtualenv]]
  * [[technical:recipes:telemac|Building and using TELEMAC-MASCARET with VALET integration]]
  * [[technical:recipes:software-managment|Basic software building and management]]
  * [[technical:recipes:vasp-6-darwin|Building VASP 6 on Caviness/DARWIN]]
  * [[technical:recipes:git-cmake-valet-package|Managing multiple versions of revision-controlled repositories]]
  * [[technical:recipes:visit-remote-host|Using Caviness or DARWIN as a remote VisIt 3.2.x host]]
  * [[technical:recipes:vnc-usage|Using a VNC for X11 applications]]
  * [[technical:recipes:tensorflow-rocm|Using TensorFlow on AMD GPUs (on DARWIN)]]
  * [[technical:recipes:mcfost|Building MCFOST on Caviness]]
  * [[technical:recipes:gcc-openacc|Building a GCC 12.2 toolchain with NVIDIA/AMD offload for OpenACC]]
  * [[technical:recipes:ls-dyna|Installing and using LS-DYNA on Caviness/DARWIN]]
  * [[technical:recipes:gnnunlock|Installing GNNUnlock on Caviness]]
  * [[technical:recipes:lammps|Installing and managing LAMMPS on Caviness/DARWIN]]
===== SLURM =====

==== Caviness ====

The following articles discuss some of the implementation details involved in tailoring the Slurm job scheduler to the Caviness HPC system.
 
  * [[technical:slurm:caviness:partitions|Revisions to Slurm Configuration v1.0.0 on Caviness]]
  * [[technical:slurm:caviness:arraysize-and-nodecounts|Revisions to Slurm Configuration v1.1.2 on Caviness]]
  * [[technical:slurm:caviness:node-memory-sizes|Revisions to Slurm Configuration v1.1.3 on Caviness]]
  * [[technical:slurm:caviness:reboot-and-helper-scripts|Revisions to Slurm Configuration v1.1.4 on Caviness]]
  * [[technical:slurm:caviness:gen2-additions|Revisions to Slurm Configuration v1.1.5 on Caviness]]
  * [[technical:slurm:caviness:scheduler-params|Revisions to Slurm Configuration v2.0.0 on Caviness]]
  * [[technical:slurm:caviness:gen2_1-additions|Revisions to Slurm Configuration v2.1.0 on Caviness]]
  * [[technical:slurm:caviness:swap-control-implementation|Revisions to Slurm Configuration v2.2.1 on Caviness]]
  * [[technical:slurm:caviness:salloc-default-cmd-fixup|Revisions to Slurm Configuration v2.3.1 on Caviness]]
  * [[technical:slurm:caviness:gen3-additions|Revisions to Slurm Configuration v2.3.2 on Caviness]]
  * [[technical:slurm:caviness:workgroup-reorg|Revisions to Slurm Configuration v2.4.0 on Caviness]]

Additional articles documenting the configuration and use of Slurm on Caviness:

  * [[technical:slurm:caviness:templates:start|UD job script templating for Slurm on Caviness]]
  * [[technical:slurm:caviness:auto_tmpdir|UD plugin for automated creation of per-job temporary directories]]
  * [[technical:slurm:caviness:mandatory_gpu_type|Revision to Slurm job submission to require GPU types]]
  * [[technical:slurm:caviness:synth_features|Revision to Slurm node features]]

==== DARWIN ====

The following articles discuss some of the implementation details involved in tailoring the Slurm job scheduler to the DARWIN HPC system.
 
  * [[technical:slurm:darwin:swap-control-implementation|Revisions to Slurm Configuration v1.0.7 on DARWIN]]
  * [[technical:slurm:darwin:salloc-default-cmd-fixup|Revision to Slurm Configuration v1.0.8 on DARWIN]]

Additional articles documenting the configuration and use of Slurm on DARWIN:

  * [[technical:slurm:darwin:templates:start|UD job script templating for Slurm on DARWIN]]
  * [[technical:slurm:darwin:auto_tmpdir|UD plugin for automated creation of per-job temporary directories]]
  * [[technical:slurm:darwin:swap-control|Controlling Swap Usage]]
  * [[technical:slurm:darwin:synth_features|Revision to Slurm node features]]

Downloadable documents:

  * {{:technical:slurm:hadm_write-up.pdf|HADM: Integration of Resource Allocations in Slurm}}
===== Grid Engine =====

The following articles discuss some of the implementation details involved in tailoring the Grid Engine job scheduler to the UD HPC systems.

  * Using [[technical:gridengine:subordination|subordination]] to auto-suspend jobs
  * Enhanced [[technical:gridengine:enhanced-qlogin|qlogin]] for X11 and group propagation
  * [[technical:gridengine:exclusive-alloc|Exclusive allocation]] of compute nodes
  * Automated [[technical:gridengine:qlogin-orphan|orphaned qlogin cleanup]] on cluster head nodes
  * Adding [[technical:gridengine:cgroup-integration|cgroup integration]] to Grid Engine using prolog/epilog scripts
  * Fully supporting Linux cgroups using UD's [[technical:gridengine:geco:start|Grid Engine Cgroup Orchestrator]] (GECO) software
===== PERCEUS =====

Compute nodes are provisioned and managed using the open source [[http://www.perceus.org|PERCEUS]] toolkit.

  * Adding [[technical:perceus:per-node-initd|per-node init.d services]] without creating multiple VNFS images 


===== Lustre Support =====

Lustre is a high-performance parallel filesystem.

  * [[technical:lustre:recover-ost|Recovering a failed OST]]

===== Development =====

The following articles discuss technical aspects of software development on UD HPC systems.

  * Using file striping with the [[technical:developer:open-mpi:lustre-mpi-io|Lustre MPI-IO interfaces in Open MPI]]
  * Learn how your HPC workgroup can organize its own software installs and make use of VALET to [[technical:developer:workgroup-sw|streamline software maintenance]]
  * [[technical:developer:wgss|WGSS:  WorkGroup-Sponsored Software]] on the clusters

===== White papers =====

[[wp>White_paper|According to Wikipedia]], a white paper is //an authoritative report or guide that helps readers understand an issue, solve a problem, or make a decision//.  Wikipedia claims that white papers are primarily used in marketing and government, but the definition applies equally well to the computing world.

Occasionally there may be performance benchmarking studies performed on the clusters or on new hardware being considered for use in the clusters.  Any significant findings that can be made public will be published in the [[technical:whitepaper:start|white paper]] area of the site.