| |
| technical:slurm:darwin:synth_features [2025-11-19 13:53] – created - external edit 127.0.0.1 | technical:slurm:darwin:synth_features [2026-01-06 10:21] (current) – frey |
|---|
| ===== Issues ===== | ===== Issues ===== |
| |
| Every compute node registered in Slurm has a list of zero or more //features// -- strings that identify a functionality, identity, or other attribute associated with the node. On Caviness, all compute nodes have always been statically-configured with a set of generational, CPU, and nominal features. Take, for example, these two Caviness nodes: | Every compute node registered in Slurm has a list of zero or more //features// -- strings that identify a functionality, identity, or other attribute associated with the node. On DARWIN, all compute nodes have always been statically-configured with a set of generational, CPU, and nominal features. Take, for example, these DARWIN nodes: |
| <code bash> | <code bash> |
| [user@login01.caviness ~]$ scontrol show node r00n14 | grep Features | [user@login01.darwin ~]$ scontrol show node r1n00 | grep Features |
| AvailableFeatures=Gen1,E5-2695,E5-2695v4,128GB | AvailableFeatures=standard,512GiB |
| ActiveFeatures=Gen1,E5-2695,E5-2695v4,128GB | ActiveFeatures=standard,512GiB |
| | |
| [user@login01.caviness ~]$ scontrol show node r05n14 | grep Features | [user@login01.darwin ~]$ scontrol show node r1t00 | grep Features |
| AvailableFeatures=Gen3,Intel,Gold-6240R,6240R,384GB | AvailableFeatures=nvidia-gpu,nvidia-t4,t4,512GiB |
| ActiveFeatures=Gen3,Intel,Gold-6240R,6240R,384GB | ActiveFeatures=nvidia-gpu,nvidia-t4,t4,512GiB |
| | |
| | [user@login01.darwin ~]$ scontrol show node r2v00 | grep Features |
| | AvailableFeatures=nvidia-gpu,nvidia-v100,v100,768GiB |
| | ActiveFeatures=nvidia-gpu,nvidia-v100,v100,768GiB |
| | </code> |
| A user can limit which nodes are permissible for a submitted job: | A user can limit which nodes are permissible for a submitted job: |
| | |
| <code bash> | <code bash> |
| [user@login01.caviness ~]$ sbatch --constraint=Gen3 … | [user@login01.darwin ~]$ sbatch --constraint=512GiB … |
| </code> | </code> |
| would mean ''r05n14'' could be used to execute the job but ''r00n14'' could not. | |
| |
| While these existing features can be useful, they do not directly assist in choosing nodes based on the //hardware capabilities//. Some software may demand a CPU with AVX512 ISA extensions, but Slurm does not inherently know whether or not a node's CPU has that capability, nor do our existing features directly indicate it. The Intel 6240R processor does implement AVX512 ISA extensions, so the user might be tempted to use the ''--constraint=Gen3'' option when submitting the job. This would work fine unless a Gen3 GPU node were selected: the AMD CPUs in those nodes do **not** have AVX512 capabilities. | would mean ''r1n00'' or ''r1t00'' could be used to execute the job but ''r2v00'' would not. |
| | |
| | While these existing features can be useful, they do not directly assist in choosing nodes based on the //hardware capabilities//. Some software may demand a CPU with AVX512 ISA extensions, but Slurm does not inherently know whether or not a node's CPU has that capability, nor do our existing features directly indicate it. |
| |
| A list of all ISA extensions supported by a CPU is present in a Linux system's ''/proc/cpuinfo'' file. It would be helpful if the list of statically-configured features that have always existed is augmented by additional features added dynamically by the Slurm software running on the compute node. | A list of all ISA extensions supported by a CPU is present in a Linux system's ''/proc/cpuinfo'' file. It would be helpful if the list of statically-configured features that have always existed is augmented by additional features added dynamically by the Slurm software running on the compute node. |
| |
| <code bash> | <code bash> |
| [user@login01.caviness ~]$ scontrol show node r00n14 | grep Features | [user@login01.darwin ~]$ scontrol show node r1n00 | grep Features |
| AvailableFeatures=Gen1,E5-2695,E5-2695v4,128GB,VENDOR::GenuineIntel,MODEL::E5-2695_v4,CACHE::46080KB,ISA::sse,ISA::sse2,ISA::sse4_1,ISA::sse4_2,ISA::avx,ISA::avx2 | AvailableFeatures=VENDOR::AuthenticAMD,MODEL::EPYC_7502,CACHE::512KB,ISA::sse,ISA::sse2,ISA::ssse3,ISA::sse4_1,ISA::sse4_2,ISA::avx,ISA::avx2,standard,512GiB |
| ActiveFeatures=Gen1,E5-2695,E5-2695v4,128GB,VENDOR::GenuineIntel,MODEL::E5-2695_v4,CACHE::46080KB,ISA::sse,ISA::sse2,ISA::sse4_1,ISA::sse4_2,ISA::avx,ISA::avx2 | ActiveFeatures=VENDOR::AuthenticAMD,MODEL::EPYC_7502,CACHE::512KB,ISA::sse,ISA::sse2,ISA::ssse3,ISA::sse4_1,ISA::sse4_2,ISA::avx,ISA::avx2,standard,512GiB |
| | |
| [user@login01.caviness ~]$ scontrol show node r00g00 | grep Features | |
| AvailableFeatures=Gen1,E5-2695,E5-2695v4,128GB,PCI::GPU::P100,VENDOR::GenuineIntel,MODEL::E5-2695_v4,CACHE::46080KB,ISA::sse,ISA::sse2,ISA::ssse3,ISA::sse4_1,ISA::sse4_2,ISA::avx,ISA::avx2 | |
| ActiveFeatures=Gen1,E5-2695,E5-2695v4,128GB,PCI::GPU::P100,VENDOR::GenuineIntel,MODEL::E5-2695_v4,CACHE::46080KB,ISA::sse,ISA::sse2,ISA::ssse3,ISA::sse4_1,ISA::sse4_2,ISA::avx,ISA::avx2 | |
| | |
| [user@login01.caviness ~]$ scontrol show node r05n14 | grep Features | |
| AvailableFeatures=Gen3,Intel,Gold-6240R,6240R,384GB,VENDOR::GenuineIntel,MODEL::Gold_6230,CACHE::28160KB,ISA::sse,ISA::sse2,ISA::sse4_1,ISA::sse4_2,ISA::avx,ISA::avx2,ISA::avx512f,ISA::avx512dq,ISA::avx512cd,ISA::avx512bw,ISA::avx512vl,ISA::avx512_vnni | |
| ActiveFeatures=Gen3,Intel,Gold-6240R,6240R,384GB,VENDOR::GenuineIntel,MODEL::Gold_6230,CACHE::28160KB,ISA::sse,ISA::sse2,ISA::sse4_1,ISA::sse4_2,ISA::avx,ISA::avx2,ISA::avx512f,ISA::avx512dq,ISA::avx512cd,ISA::avx512bw,ISA::avx512vl,ISA::avx512_vnni | |
| </code> | |
| |
| For a user to submit a job that requires the AVX512 Byte-Word and AVX512 Foundational ISA extensions, the command would resemble this: | [user@login01.darwin ~]$ scontrol show node r1t00 | grep Features |
| | AvailableFeatures=VENDOR::AuthenticAMD,MODEL::EPYC_7502,CACHE::512KB,ISA::sse,ISA::sse2,ISA::ssse3,ISA::sse4_1,ISA::sse4_2,ISA::avx,ISA::avx2,PCI::GPU::T4,nvidia-gpu,nvidia-t4,t4,512GiB |
| | ActiveFeatures=VENDOR::AuthenticAMD,MODEL::EPYC_7502,CACHE::512KB,ISA::sse,ISA::sse2,ISA::ssse3,ISA::sse4_1,ISA::sse4_2,ISA::avx,ISA::avx2,PCI::GPU::T4,nvidia-gpu,nvidia-t4,t4,512GiB |
| |
| <code bash> | [user@login01.darwin ~]$ scontrol show node r2v00 | grep Features |
| [user@login01.caviness ~]$ sbatch … --constrain='ISA::avx512f&ISA::avx512bw' … | AvailableFeatures=VENDOR::GenuineIntel,MODEL::8260,CACHE::36608KB,ISA::sse,ISA::sse2,ISA::ssse3,ISA::sse4_1,ISA::sse4_2,ISA::avx,ISA::avx2,ISA::avx512f,ISA::avx512dq,ISA::avx512cd,ISA::avx512bw,ISA::avx512vl,ISA::avx512_vnni,PCI::GPU::V100,nvidia-gpu,nvidia-v100,v100,768GiB |
| | ActiveFeatures=VENDOR::GenuineIntel,MODEL::8260,CACHE::36608KB,ISA::sse,ISA::sse2,ISA::ssse3,ISA::sse4_1,ISA::sse4_2,ISA::avx,ISA::avx2,ISA::avx512f,ISA::avx512dq,ISA::avx512cd,ISA::avx512bw,ISA::avx512vl,ISA::avx512_vnni,PCI::GPU::V100,nvidia-gpu,nvidia-v100,v100,768GiB |
| </code> | </code> |
| |
| To further qualify the hardware selection, the existing ''Gen2'' feature could still be used, for example: | For a user to submit a job that requires the AVX512 Byte-Word and AVX512 Foundational ISA extensions, the command would resemble this: |
| |
| <code bash> | <code bash> |
| [user@login01.caviness ~]$ sbatch … --constrain='Gen2&ISA::avx512f&ISA::avx512bw' … | [user@login01.darwin ~]$ sbatch … --constrain='ISA::avx512f&ISA::avx512bw' … |
| </code> | </code> |
| |
| ===== Impact ===== | ===== Impact ===== |
| |
| The Slurm scheduler will be restarted to load the updated job submission plugin. Job submission and query (via ''sbatch'', ''sacct'', ''squeue'' for example) will hang for a period anticipated to be less than one minute. | The Slurm scheduler will be restarted to load the new plugin. Job submission and query (via ''sbatch'', ''sacct'', ''squeue'' for example) will hang for a period anticipated to be less than one minute. |
| |
| ===== Timeline ===== | ===== Timeline ===== |
| |
| ^Date^Time^Goal/Description^ | ^Date^Time^Goal/Description^ |
| |2025-11-13| |Authoring of this document| | |2026-01-06| |Authoring of this document| |
| |2025-12-01|10:00|Implementation| | |2026-01-13|10:00|Implementation| |
| |