abstract:caviness:runjobs:runjobs

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
abstract:caviness:runjobs:runjobs [2018-11-19 09:49] – [Slurm] anitaabstract:caviness:runjobs:runjobs [2022-05-05 11:17] (current) – [Introduction] anita
Line 3: Line 3:
 ====== Introduction ====== ====== Introduction ======
  
-The SLURM workload manager (job scheduling system) is used to manage and control the resources available to computational tasks.  The job scheduler considers each job's resource requests (memory, disk space, processor cores) and executes it as those resources become available.  As a cluster workload manager, Slurm has three key functions: (1) It allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. (2) It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. (3) It arbitrates contention for resources by managing a queue of pending work.+The Slurm workload manager (job scheduling system) is used to manage and control the resources available to computational tasks.  The job scheduler considers each job's resource requests (memory, disk space, processor cores) and executes it as those resources become available.  As a cluster workload manager, Slurm has three key functions: (1) It allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. (2) It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. (3) It arbitrates contention for resources by managing a queue of pending work.
  
 Without a job scheduler, a cluster user would need to manually search for the resources required by his or her job, perhaps by randomly logging-in to nodes and checking for other users' programs already executing thereon.  The user would have to "sign-out" the nodes he or she wishes to use in order to notify the other cluster users of resource availability((Historically, this is actually how some clusters were managed!)).  A computer will perform this kind of chore more quickly and efficiently than a human can, and with far greater sophistication. Without a job scheduler, a cluster user would need to manually search for the resources required by his or her job, perhaps by randomly logging-in to nodes and checking for other users' programs already executing thereon.  The user would have to "sign-out" the nodes he or she wishes to use in order to notify the other cluster users of resource availability((Historically, this is actually how some clusters were managed!)).  A computer will perform this kind of chore more quickly and efficiently than a human can, and with far greater sophistication.
  
-Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.  Documentation for the current version of Slurm provide by SchedMD [[https://slurm.schedmd.com/documentation.html|SchedMD Slurm Documentation]].+Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.  Documentation for the current version of Slurm provided by SchedMD [[https://slurm.schedmd.com/documentation.html|SchedMD Slurm Documentation]].
  
-You may find it helpful when migrating from another scheduler to another such as Slurm to refer to SchedMD's [[https://slurm.schedmd.com/rosetta.pdf|rosetta]] showing equivalent commands across various schedulers. +You may find it helpful when migrating from one scheduler to another such as GridEngine to Slurm to refer to SchedMD's [[https://slurm.schedmd.com/rosetta.pdf|rosetta]] showing equivalent commands across various schedulers and their [[https://slurm.schedmd.com/pdfs/summary.pdf|command/option summary (two pages)]]
  
-<note tip>It is a good idea to periodically check in ''/opt/templates/slurm/'' for updated or new templates to use as job scripts to run generic or specific applicationsdesigned to provide the best performance on Caviness.</note>+<note tip>It is a good idea to periodically check in ''/opt/shared/templates/slurm/'' for updated or new [[technical:slurm:caviness:templates:start|templates]] to use as job scripts to run generic or specific applications designed to provide the best performance on Caviness.</note>
  
 +Need help? See [[http://www.hpc.udel.edu/presentations/intro_to_slurm/|Introduction to Slurm]] in UD's HPC community cluster environment.
 ===== What is a Job? ===== ===== What is a Job? =====
  
Line 34: Line 35:
 </note> </note>
  
 +It is important for jobs to run on compute nodes and not login nodes.  Without effective limits in place, a single user could monopolize a login node and leave the cluster inaccessible to others. Please review [[technical:generic:caviness-login-cpu-limit|Per-process CPU time limits on Caviness login nodes]] summarizing current resource limits and the need for and implementation of additional limits on the Caviness cluster login nodes.
 ===== Queues ===== ===== Queues =====
  
  • abstract/caviness/runjobs/runjobs.1542638977.txt.gz
  • Last modified: 2018-11-19 09:49
  • by anita