technical:recipes:tensorflow-rocm

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

technical:recipes:tensorflow-rocm [2022-07-07 12:56] – created freytechnical:recipes:tensorflow-rocm [2022-07-07 12:58] (current) frey
Line 1: Line 1:
 +====== TensorFlow using AMD GPUs ======
  
 +Complementing the NVIDIA T4 and V100 GPU nodes in the DARWIN cluster are nodes with an AMD Mi50 and an AMD Mi100 GPU coprocessor, respectively.  The AMD GPU device and software stack are far newer than NVIDIA's CUDA stack: support for them is evolving.
 +
 +One popular AI/ML framework, TensorFlow, has an official DockerHub container release.  DARWIN includes versions of the Singularity toolset which can make use of Docker container images.  With only a single node of each present in the cluster, multi-node parallelism is not likely to be necessary:  an official container should satisfy most user's needs.
 +
 +Building the container, however, is a challenge.  Several versions were tried (2.9, 2.8, and an "ancient" 2.3 TensorFlow) with mixed results:
 +
 +  * The 2.3 image worked as-is
 +  * The 2.8 and 2.9 images lacked a number of Python modules that prevented tensorflow from loading (numpy, google, protobuf, etc.)
 +  * The blobs that comprise the container images are quite large — I hit my home directory quota thanks to the blobs Singularity cached in ''~/.singularity''
 +
 +<WRAP center round important 60%>
 +Singularity by default maps the user's home directory to the container.  This means that Python modules present under ''~/.local'' will be visible inside the container and may satisfy some of the missing module dependencies.  It's possible those maintaining the official containers on DockerHub have the same scenario interfering with dependency resolution while building the container images.
 +</WRAP>
 +
 +The solution with the 2.8 and 2.9 images was to initially build them as read+write //sandbox// images, then run them as user root.  Root has privileges to write to ''/usr'' therein, so ''pip'' can be used to iteratively install missing dependencies.  When all dependencies are satisfied, a read-only image file is generated from the sandbox image.
 +
 +====== Producing the 2.9 image ======
 +
 +After adding Singularity to the runtime environment, a directory for the sandbox and read-only images is created and a temporary Singularity cache directory is created:
 +
 +<code bash>
 +$ vpkg_require singularity/default
 +Adding dependency `squashfs-tools/4.5.1` to your environment
 +Adding package `singularity/3.10.0` to your environment
 +
 +$ mkdir -p /opt/shared/singularity/images/tensorflow/2.9-rocm
 +
 +$ export SINGULARITY_CACHEDIR="$(mktemp -d)"
 +</code>
 +
 +For the duration of container builds in this shell, the temp directory will hold all cached blobs (rather than putting them in my NFS home directory).  Before exiting this shell it is important that that cache directory be removed — don't forget!
 +
 +The sandbox image is built thusly:
 +
 +<code bash>
 +$ singularity build --sandbox /opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow-sb.sif \
 +    docker://rocm/tensorflow:rocm5.2.0-tf2.9-dev
 +INFO:    Starting build...
 +Getting image source signatures
 +Copying blob 8751bf8569be done  
 +Copying blob 8751bf8569be done 
 +   :
 +2022/07/07 10:12:48  info unpack layer: sha256:4da118ab357bd39e3ce7f4cf1924ae3e4c4421c1a96460bdf0ad33cea2abc496
 +2022/07/07 10:12:51  info unpack layer: sha256:e9d0198e6dd5d7f70a7a516c3b7fdcdd43844dbb3efec79d68a30f6aa00d3cd8
 +2022/07/07 10:12:54  info unpack layer: sha256:2206655d3afe836157d63f1a7dc5951244a2058a457abdf6ac0715d482f45636
 +INFO:    Creating sandbox directory...
 +INFO:    Build complete: /opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow-sb.sif
 +</code>
 +
 +Next, a shell is started in the sandbox container and used to iteratively add missing Python packages.  Since Singularity runs processes as the initiating user inside the container, it's necessary to be root at this point (so DARWIN users are out of luck here, unless they replicate this work on a machine on which they have root).
 +
 +<code bash>
 +$ sudo -s
 +[root]$ singularity shell --pid --ipc --writable /opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow-sb.sif
 +
 +Singularity> python3 -c "import tensorflow as tf ; tf.config.list_physical_devices('GPU')"
 +   :
 +ModuleNotFoundError: No module named 'absl'
 +
 +Singularity> pip3 install absl-py
 +Collecting absl-py
 +  Downloading absl_py-1.1.0-py3-none-any.whl (123 kB)
 +     |████████████████████████████████| 123 kB 13.2 MB/s 
 +   :
 +Installing collected packages: absl-py
 +Successfully installed absl-py-1.1.0
 +
 +Singularity> python3 -c "import tensorflow as tf ; tf.config.list_physical_devices('GPU')"
 +
 +   :
 +ModuleNotFoundError: No module named 'numpy'
 +Singularity> pip3 install numpy  
 +Collecting numpy
 +  Using cached numpy-1.23.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
 +   :
 +
 +Singularity> python3 -c "import tensorflow as tf ; tf.config.list_physical_devices('GPU')"
 +2022-07-07 11:33:23.743348: E tensorflow/stream_executor/rocm/rocm_driver.cc:305] failed call to hipInit: HIP_ERROR_InvalidDevice
 +2022-07-07 11:33:23.743406: I tensorflow/stream_executor/rocm/rocm_diagnostics.cc:112] retrieving ROCM diagnostic information for host: r0login0.localdomain.hpc.udel.edu
 +2022-07-07 11:33:23.743421: I tensorflow/stream_executor/rocm/rocm_diagnostics.cc:119] hostname: r0login0.localdomain.hpc.udel.edu
 +2022-07-07 11:33:23.743466: I tensorflow/stream_executor/rocm/rocm_diagnostics.cc:142] librocm reported version is: NOT_FOUND: was unable to find librocm.so DSO loaded into this program
 +2022-07-07 11:33:23.743522: I tensorflow/stream_executor/rocm/rocm_diagnostics.cc:146] kernel reported version is: UNIMPLEMENTED: kernel reported driver version not implemented
 +</code>
 +
 +Since the node on which I was building this container does not have an AMD GPU, no GPU is found by the now-working tensorflow module.  At this point the read-only container image can be created:
 +
 +<code bash>
 +[root]$ exit
 +$ singularity build /opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow.sif \
 +    /opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow-sb.sif
 +INFO:    Starting build...
 +INFO:    Creating SIF file...
 +INFO:    Build complete: /opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow.sif
 +</code>
 +
 +====== Test the 2.9 image ======
 +
 +At this point it's necessary to check that the read-only image sees the AMD GPU device.  This requires a remote shell on the GPU node.  Note that for testing purposes the "idle" partition will be used; production usage of the GPU nodes should make use of the "gpu-mi50" and "gpu-mi100" partitions on DARWIN.
 +
 +<code bash>
 +$ salloc --partition=idle --gpus=amd_mi100:1
 +salloc: Pending job allocation ######
 +salloc: job 918608 queued and waiting for resources
 +salloc: job 918608 has been allocated resources
 +salloc: Granted job allocation ######
 +salloc: Waiting for resource configuration
 +salloc: Nodes r0m01 are ready for job
 +</code>
 +
 +Add Singularity to the runtime environment and reference the 2.9 container image:
 +
 +<code bash>
 +$ vpkg_require singularity/default
 +Adding dependency `squashfs-tools/4.5.1` to your environment
 +Adding package `singularity/3.10.0` to your environment
 +
 +$ export SINGULARITY_IMAGE=/opt/shared/singularity/images/tensorflow/2.9-rocm/tensorflow.sif
 +
 +$ Sshell
 +Singularity> rocm-smi
 +
 +======================= ROCm System Management Interface =======================
 +================================= Concise Info =================================
 +GPU  Temp   AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
 +0    27.0c  34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
 +================================================================================
 +============================= End of ROCm SMI Log ==============================
 +</code>
 +
 +Excellent!  Inside the container, the ''rocm-smi'' utility is able to see the AMD GPU and read stats from it.  Finally, will tensorflow be able to see it?
 +
 +<code bash>
 +Singularity> python3
 +Python 3.9.13 (main, May 23 2022, 22:01:06) 
 +[GCC 9.4.0] on linux
 +Type "help", "copyright", "credits" or "license" for more information.
 +>>> import tensorflow as tf
 +>>> tf.config.list_physical_devices('GPU')
 +[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
 +>>> 
 +</code>
 +
 +====== VALET configuration ======
 +
 +To make the containers as easy to use for end users, VALET is leveraged:
 +
 +<code bash>
 +$ cat /opt/shared/valet/etc/tensorflow.vpkg_yaml
 +tensorflow:
 +    prefix: /opt/shared/singularity/images/tensorflow
 +    description: official TensorFlow containers
 +    url: "https://hub.docker.com/r/rocm/tensorflow"
 +
 +    actions:
 +        - variable: SINGULARITY_IMAGE
 +          action: set
 +          value: ${VALET_PATH_PREFIX}/tensorflow.sif
 +
 +    dependencies:
 +        - singularity/default
 +
 +    versions:
 +        "2.9:rocm":
 +            description: TF 2.9 with ROCM 5.2.0 AMD GPU support
 +
 +        "2.8:rocm":
 +            description: TF 2.8 with ROCM 5.2.0 AMD GPU support
 +
 +        "2.3:rocm":
 +            description: TF 2.3 with ROCM 4.2 AMD GPU support
 +
 +$ vpkg_versions tensorflow
 +
 +Available versions in package (* = default version):
 +
 +[/opt/shared/valet/2.1/etc/tensorflow.vpkg_yaml]
 +tensorflow  official TensorFlow containers
 +  2.3:rocm  TF 2.3 with ROCM 4.2 AMD GPU support
 +  2.8:rocm  TF 2.8 with ROCM 5.2.0 AMD GPU support
 +* 2.9:rocm  TF 2.9 with ROCM 5.2.0 AMD GPU support
 +</code>
 +
 +====== End-user usage of 2.9 image ======
 +
 +To test a Python script running the the tensorflow container, we first write the Python script:
 +
 +<file python tf-test.py>
 +#!/usr/bin/env python
 +
 +import tensorflow as tf
 +
 +print('There are {:d} GPU devices visible to this TF'.format(
 +    len(tf.config.list_physical_devices('GPU')))
 +  )
 +
 +</file>
 +
 +The job script requests all memory, all 128 CPU cores, and the Mi100 GPU device:
 +
 +<file bash tf-test.qs>
 +#!/bin/bash
 +#
 +#SBATCH --partition=gpu-mi100
 +#SBATCH --nodes=1
 +#SBATCH --ntasks=1
 +#SBATCH --cpus-per-task=128
 +#SBATCH --mem=0
 +#SBATCH --gpus=amd_mi100:1
 +#
 +
 +vpkg_require tensorflow/2.9:rocm
 +
 +Srun python3 tf-test.py
 +
 +</file>
 +
 +These two files were copied to a working directory named ''tf-2.9-test'' (in my home directory) and the batch job submitted:
 +
 +<code bash>
 +$ cd ~/tf-2.9-test
 +
 +$ ls -l
 +total 19
 +-rw-r--r-- 1 frey it_nss 164 Jul  7 12:55 tf-test.py
 +-rw-r--r-- 1 frey it_nss 214 Jul  7 12:54 tf-test.qs
 +
 +$ sbatch tf-test.qs
 +Submitted batch job 918657
 +</code>
 +
 +After the job has run:
 +
 +<code bash>
 +$ ls -l
 +total 29
 +-rw-r--r-- 1 frey it_nss 220 Jul  7 12:55 slurm-918657.out
 +-rw-r--r-- 1 frey it_nss 164 Jul  7 12:55 tf-test.py
 +-rw-r--r-- 1 frey it_nss 214 Jul  7 12:54 tf-test.qs
 +
 +$ cat slurm-918657.out 
 +Adding dependency `squashfs-tools/4.5.1` to your environment
 +Adding dependency `singularity/3.10.0` to your environment
 +Adding package `tensorflow/2.9:rocm` to your environment
 +There are 1 GPU devices visible to this TF
 +</code>
 +
 +Hooray!