Table of Contents

Caviness, June 2019 CUDA driver update

With each release of the CUDA toolkit, nVidia provides updated kernel drivers which the accompanying CUDA libraries require for compatibility. The 10.0 libraries will not work on a node running the 9.2 kernel drivers. However, the 9.2 libraries will work on a node running the 10.0 kernel drivers. Keeping our GPU nodes as CUDA-compatible as possible is a matter of keeping the kernel drivers up-to-date.

Available Update

The nVidia CUDA toolkit 10.1.168 was released May 7, 2019, and includes the 418.67 kernel drivers. The latest official TensorFlow containers, for example, require the CUDA toolkit 10.x libraries, so this upgrade is necessary to support that software.

The development GPU node (r00g00, accessible via the devel partition) has been running the 418.67 kernel drivers for one week now without issue.

Caviness users who feel that an upgrade to the nVidia kernel drivers will adversely impact their software should open an IT trouble ticket prior to close-of-business on Wednesday, June 5, 2019.

Implementation

  1. The GPU nodes' Virtual Node File System (VNFS) image will be cloned. The existing CUDA kernel drivers will be removed and the 418.67 drivers installed in their place.
  2. The GPU nodes will be rebooted on the new VNFS image once all running jobs complete.

Slurm allows a node to be marked for automatic reboot. The node is not considered for scheduling new jobs until it has rebooted. Any jobs that were running when the node was marked for reboot will continue running until complete. After the node rebooted and is back online it automatically begins servicing jobs again. A staged reboot like this presents minimal impact on job throughput.

Impact

No downtime is expected to be required.

Timeline

DateTimeGoal/Description
2019-06-0317:00VNFS image with 418.67 kernel drivers prepared
2019-06-0517:00Staged reboots of GPU nodes commences
2019-06-1012:30Staged reboots of GPU nodes completed