abstract:darwin:runjobs:schedule_jobs

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
abstract:darwin:runjobs:schedule_jobs [2021-07-18 11:12] – [Scheduling Jobs on DARWIN] anitaabstract:darwin:runjobs:schedule_jobs [2023-03-20 14:27] (current) – [Handling System Signals aka Checkpointing] anita
Line 17: Line 17:
  
 **Moral of the story:** Request only the resources needed for your job.  Over or under requesting resources results in wasting your allocation credits for everyone in your project/workgroup.</note> **Moral of the story:** Request only the resources needed for your job.  Over or under requesting resources results in wasting your allocation credits for everyone in your project/workgroup.</note>
 +
 +<note important>**Interactive jobs:** An interactive job is billed the SU associated with the full wall time of its execution, not just for CPU time accrued through its duration.  For example, if you leave an interactive job running for 2 hours and execute code for 2 minutes, your allocation will be billed for 2 hours of time, not 2 minutes.  Please review [[abstract:darwin:runjobs:accounting|job accounting]] to determine the SU associated for each type of resource requested (compute, gpu) and the SUs billed per hour.</note>
  
 ===== Interactive jobs (salloc) ===== ===== Interactive jobs (salloc) =====
Line 233: Line 235:
 |Extra-Large Memory/2 TiB   |%%--%%partition=xlarge-mem                                   2031616|      1984| |Extra-Large Memory/2 TiB   |%%--%%partition=xlarge-mem                                   2031616|      1984|
 |nVidia-T4/512 GiB          |%%--%%partition=gpu-t4                                        499712|       488| |nVidia-T4/512 GiB          |%%--%%partition=gpu-t4                                        499712|       488|
-|nVidia-V100/768 GiB        |%%--%%partition=gpi-v100                                      737280|       720|+|nVidia-V100/768 GiB        |%%--%%partition=gpu-v100                                      737280|       720|
 |amd-MI50/512 GiB           |%%--%%partition=gpu-mi50                                      499712|       488| |amd-MI50/512 GiB           |%%--%%partition=gpu-mi50                                      499712|       488|
 |Extended Memory/3.73 TiB   |%%--%%partition=extended-mem %%--exclusive%%                  999424|       976| |Extended Memory/3.73 TiB   |%%--%%partition=extended-mem %%--exclusive%%                  999424|       976|
Line 482: Line 484:
  
 The name provided with the ''%%-%%-job-name'' command-line option will be assigned to the interactive session/job that the user started versus the default name ''interact''. See [[abstract/darwin/runjobs/job_status|Managing Jobs]] on the <html><span style="color:#ffffff;background-color:#2fa4e7;padding:3px 7px !important;border-radius:4px;">sidebar</span></html> for general information about commands in Slurm to manage all your jobs on DARWIN. The name provided with the ''%%-%%-job-name'' command-line option will be assigned to the interactive session/job that the user started versus the default name ''interact''. See [[abstract/darwin/runjobs/job_status|Managing Jobs]] on the <html><span style="color:#ffffff;background-color:#2fa4e7;padding:3px 7px !important;border-radius:4px;">sidebar</span></html> for general information about commands in Slurm to manage all your jobs on DARWIN.
 +
 +===== Launching GUI Applications (VNC for X11 Applications) =====
 +
 +Please review [[ technical:recipes/vnc-usage | using VNC for X11 Applications]] as an alternative to X11 Forwarding.
 +
  
 ===== Launching GUI Applications (X11 Forwarding) ===== ===== Launching GUI Applications (X11 Forwarding) =====
Line 531: Line 538:
 </code> </code>
  
-This will launch an interactive session on one of the compute nodes in the ''standard'' partition, in this case ''r1n02'', with default options of one cpu (core), 1 GB of memory and 30 minutes time.+This will launch an interactive job on one of the compute nodes in the ''standard'' partition, in this case ''r1n02'', with default options of one cpu (core), 1 GB of memory and 30 minutes time.
  
-Now the session and environment will be ready to launch any program that has a GUI (Graphical User Interface) and be displayed on your local computer display.+Now the compute node and environment will be ready to launch any program that has a GUI (Graphical User Interface) and be displayed on your local computer display.
  
 <note important> <note important>
Line 603: Line 610:
 ==== Handling System Signals aka Checkpointing ==== ==== Handling System Signals aka Checkpointing ====
  
-Generally, there are two possible cases when jobs are killed: (1) preemption and (2) walltime configured within the jobs script has elapsed. Checkpointing can be used to intercept and handle the system signals in each of these cases to write out a restart file, perform the cleanup or backup operations, or any other tasks before the job gets killed. Of course this depends on whether or not the application or software you are using is checkpoint enabled.+Generally, there are two possible cases when jobs are killed: (1) preemption and (2) walltime configured within the jobs script has elapsed. Checkpointing can be used to intercept and handle the system signals in each of these cases to write out a restart file, perform the cleanup or backup operations, or any other tasks before the job gets killed. Of coursethis depends on whether or not the application or software you are using is checkpoint enabled.
  
 <note important>Please review the comments provided in the Slurm job script templates available in ''/opt/shared/templates'' that demonstrates the ways to trap these signals.</note>  <note important>Please review the comments provided in the Slurm job script templates available in ''/opt/shared/templates'' that demonstrates the ways to trap these signals.</note> 
Line 609: Line 616:
 "TERM" is the most common system signal that is triggered in both the above cases. However, there is a working logic behind the preemption of job which works as below. "TERM" is the most common system signal that is triggered in both the above cases. However, there is a working logic behind the preemption of job which works as below.
  
-When a job gets submitted to a workgroup-specific partition and resources are tied-up by jobs in the ''standard'' partition, the jobs in the ''standard'' partition will be preempted to make way.  Slurm sends a preemption signal to the job (SIGCONT followed by SIGTERM) then waits for a grace period (5 minutes) before signaling again (SIGCONT followed by SIGTERM) then killing it (SIGKILL).  However, if the job is able to simply be re-run as-is, the user can submit with ''%%-%%-requeue'' to indicate that ''standard'' job that was preempted should be rerun on the ''standard'' partition (possibly restarting immediately on different nodes, otherwise it will need to wait for resources to become available).+When a job gets submitted to a workgroup-specific partition and resources are tied up by jobs in the ''idle'' partition, the jobs in the ''idle'' partition will be preempted to make way.  Slurm sends a preemption signal to the job (SIGCONT followed by SIGTERM) then waits for a grace period (5 minutes) before signaling again (SIGCONT followed by SIGTERM) then killing it (SIGKILL).  However, if the job is able to simply be re-run as-is, the user can submit with ''%%-%%-requeue'' to indicate that an ''idle'' job that was preempted should be rerun on the ''idle'' partition (possibly restarting immediately on different nodes, otherwise it will need to wait for resources to become available).
  
-For example using the logic provided in one of the Slurm job script templates, one can catch these signals during the preemption and handle them by performing the cleanup or backing up the job results operations as follows. +For exampleusing the logic provided in one of the Slurm job script templates, one can catch these signals during the preemption and handle them by performing the cleanup or backing up the job results operations as follows. 
  
 <code bash> <code bash>
 +#SBATCH --partition=idle
 #SBATCH --job-name="atest" #SBATCH --job-name="atest"
 #SBATCH --nodes=1 #SBATCH --nodes=1
  • abstract/darwin/runjobs/schedule_jobs.1626621161.txt.gz
  • Last modified: 2021-07-18 11:12
  • by anita