====== Caviness: Per-process CPU Limits ======

Resource limits are of critical importance on cluster login nodes.  Without effective limits in place, a single user could monopolize a login node and leave the cluster inaccessible to others.  This document summarizes current resource limits and the need for and implementation of additional limits on the Caviness cluster login nodes.

====== Existing Limits ======

Linux cgroups (control groups) are used to limit cluster users' ssh sessions on the login nodes to:

  * 28 of the 36 CPU cores
  * an equal share of cycles on those 28 CPU cores
  * a hard maximum memory usage of 8 GiB

A single user running many high-CPU processes can monopolize the 28 cores; but should all 28 cores be in-use, the cycles will be equally-balanced across all users.  Reserving 8 CPU cores for the system (and IT staff) prevents users from locking-out all access (thus, IT staff can always log in and kill runaway processes, for example).

The memory limit does not apply to //each process// the user executes, it applies to the //aggregate usage// across //all processes// the user executes on that login node.  If the memory limit is exceeded, the system will kill process(es) until the user's aggregate memory usage has dropped below 8 GiB.

====== CPU Time as a Resource ======

The cgroup-based limits do not address a situation that often occurs on login nodes:  long-running, CPU-intensive tasks.  Long-running, CPU-intensive tasks are meant to be run on compute nodes in an HPC cluster, not on the login nodes.  Quite often such tasks also embody higher memory usage and i/o levels which can further degrade the performance of a login node for all users.

By contrast, normal login processes — like the bash shell — spend the majority of their elapsed real time (or wall time, in the computing parlance) //sleeping//, or not consuming CPU cycles.

  * A single CPU-intensive process will consume 1 hour of CPU time over an elapsed 1 hour real time, whereas the bash shell used to execute that program will accrue a few seconds of CPU time.
  * A multithreaded program running 4 concurrent CPU-intensive threads will consume CPU time at a rate approaching 4x that of real time:  4 hours CPU time for every 1 hour elapsed real time.

IT-RCI staff have traditionally manually monitored the cluster login nodes, killing and notifying users executing long-running, CPU-intensive tasks on the login nodes.  Outside of normal business hours (overnight, vacations, unpaid leave) staff may not be available to perform this task.  For the sake of service continuity for all cluster users, automated policing of long-running, CPU-intensive tasks is important.

====== Implementation ======

The standard Unix/Linux [[https://man7.org/linux/man-pages/man2/getrlimit.2.html|resource limits]] include a CPU time limit.  This limit is per-process, and is judged against a process' own consumption of CPU time — not that of its children.  This is a critical distinction, observed in the following:

<code bash>
[traine@login00 ~]$ ulimit -t 30
[traine@login01 ~]$ time ./mem_throttle
Killed

real	0m30.614s
user	0m25.585s
sys	0m5.024s
[traine@login01 ~]$ time ./mem_throttle
Killed

real	0m30.615s
user	0m25.589s
sys	0m5.022s
[traine@login01 ~]$ cat /proc/$$/stat | awk '{printf("%d seconds\n", $14 + $15);}'
7 seconds
</code>

The user self-imposed a 30-second CPU time limit on the shell; this limit is inherited by processes executed in that shell.  The long-running, CPU- and memory-intensive program ''mem_throttle'' is run twice, and each time exceeds the 30-second CPU limit (user + sys times) and is killed.  But by the end of those commands, the shell has accrued just 7 seconds of CPU time and has not been killed.

This is the desired behavior:  login shells, software builds, and other processes that accrue less than some threshold of CPU time in their complete execution are not affected, but CPU-intensive programs will reach that limit and be automatically killed by the system.  The chosen threshold is 30 minutes of CPU time.  Consider an actual bash shell that a user launched on Caviness over 1 month ago:

<code bash>
[root@login01 ~]# ls -ld /proc/27073
dr-xr-xr-x 9 traine everyone 0 Dec  1 01:23 /proc/27073
[root@login01 ~]# cat /proc/27073/stat | awk '{printf("%d seconds\n", $14 + $15);}'
34 seconds
</code>

A new resource limit has been enacted on the two Caviness login nodes:  a hard maximum of 30 minutes CPU time on each individual process executed by users.

====== Timeline ======

^Date^Time^Goal/Description^
|2021-01-06|11:30|CPU ulimit for users in group ''everyone'' effected|
|2021-01-06|12:00|This document published|