With each release of the CUDA toolkit, nVidia provides updated kernel drivers which the accompanying CUDA libraries require for compatibility. The 10.0 libraries will not work on a node running the 9.2 kernel drivers. However, the 9.2 libraries will work on a node running the 10.0 kernel drivers. Keeping our GPU nodes as CUDA-compatible as possible is a matter of keeping the kernel drivers up-to-date.
The nVidia CUDA toolkit 10.1.168 was released May 7, 2019, and includes the 418.67 kernel drivers. The latest official TensorFlow containers, for example, require the CUDA toolkit 10.x libraries, so this upgrade is necessary to support that software.
The development GPU node (r00g00
, accessible via the devel
partition) has been running the 418.67 kernel drivers for one week now without issue.
Slurm allows a node to be marked for automatic reboot. The node is not considered for scheduling new jobs until it has rebooted. Any jobs that were running when the node was marked for reboot will continue running until complete. After the node rebooted and is back online it automatically begins servicing jobs again. A staged reboot like this presents minimal impact on job throughput.
No downtime is expected to be required.
Date | Time | Goal/Description |
---|---|---|
2019-06-03 | 17:00 | VNFS image with 418.67 kernel drivers prepared |
2019-06-05 | 17:00 | Staged reboots of GPU nodes commences |
2019-06-10 | 12:30 | Staged reboots of GPU nodes completed |