Caviness, June 2019 CUDA driver update
With each release of the CUDA toolkit, nVidia provides updated kernel drivers which the accompanying CUDA libraries require for compatibility. The 10.0 libraries will not work on a node running the 9.2 kernel drivers. However, the 9.2 libraries will work on a node running the 10.0 kernel drivers. Keeping our GPU nodes as CUDA-compatible as possible is a matter of keeping the kernel drivers up-to-date.
Available Update
The nVidia CUDA toolkit 10.1.168 was released May 7, 2019, and includes the 418.67 kernel drivers. The latest official TensorFlow containers, for example, require the CUDA toolkit 10.x libraries, so this upgrade is necessary to support that software.
The development GPU node (r00g00
, accessible via the devel
partition) has been running the 418.67 kernel drivers for one week now without issue.
Implementation
- The GPU nodes' Virtual Node File System (VNFS) image will be cloned. The existing CUDA kernel drivers will be removed and the 418.67 drivers installed in their place.
- The GPU nodes will be rebooted on the new VNFS image once all running jobs complete.
Slurm allows a node to be marked for automatic reboot. The node is not considered for scheduling new jobs until it has rebooted. Any jobs that were running when the node was marked for reboot will continue running until complete. After the node rebooted and is back online it automatically begins servicing jobs again. A staged reboot like this presents minimal impact on job throughput.
Impact
No downtime is expected to be required.
Timeline
Date | Time | Goal/Description |
---|---|---|
2019-06-03 | 17:00 | VNFS image with 418.67 kernel drivers prepared |
2019-06-05 | 17:00 | Staged reboots of GPU nodes commences |
2019-06-10 | 12:30 | Staged reboots of GPU nodes completed |