Differences
This shows you the differences between two versions of the page.
technical:recipes:tensorflow-rocm [2022-07-07 12:56] – created frey | technical:recipes:tensorflow-rocm [2022-07-07 12:58] (current) – frey | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== TensorFlow using AMD GPUs ====== | ||
+ | Complementing the NVIDIA T4 and V100 GPU nodes in the DARWIN cluster are nodes with an AMD Mi50 and an AMD Mi100 GPU coprocessor, | ||
+ | |||
+ | One popular AI/ML framework, TensorFlow, has an official DockerHub container release. | ||
+ | |||
+ | Building the container, however, is a challenge. | ||
+ | |||
+ | * The 2.3 image worked as-is | ||
+ | * The 2.8 and 2.9 images lacked a number of Python modules that prevented tensorflow from loading (numpy, google, protobuf, etc.) | ||
+ | * The blobs that comprise the container images are quite large — I hit my home directory quota thanks to the blobs Singularity cached in '' | ||
+ | |||
+ | <WRAP center round important 60%> | ||
+ | Singularity by default maps the user's home directory to the container. | ||
+ | </ | ||
+ | |||
+ | The solution with the 2.8 and 2.9 images was to initially build them as read+write //sandbox// images, then run them as user root. Root has privileges to write to ''/ | ||
+ | |||
+ | ====== Producing the 2.9 image ====== | ||
+ | |||
+ | After adding Singularity to the runtime environment, | ||
+ | |||
+ | <code bash> | ||
+ | $ vpkg_require singularity/ | ||
+ | Adding dependency `squashfs-tools/ | ||
+ | Adding package `singularity/ | ||
+ | |||
+ | $ mkdir -p / | ||
+ | |||
+ | $ export SINGULARITY_CACHEDIR=" | ||
+ | </ | ||
+ | |||
+ | For the duration of container builds in this shell, the temp directory will hold all cached blobs (rather than putting them in my NFS home directory). | ||
+ | |||
+ | The sandbox image is built thusly: | ||
+ | |||
+ | <code bash> | ||
+ | $ singularity build --sandbox / | ||
+ | docker:// | ||
+ | INFO: Starting build... | ||
+ | Getting image source signatures | ||
+ | Copying blob 8751bf8569be done | ||
+ | Copying blob 8751bf8569be done | ||
+ | : | ||
+ | 2022/07/07 10: | ||
+ | 2022/07/07 10: | ||
+ | 2022/07/07 10: | ||
+ | INFO: Creating sandbox directory... | ||
+ | INFO: Build complete: / | ||
+ | </ | ||
+ | |||
+ | Next, a shell is started in the sandbox container and used to iteratively add missing Python packages. | ||
+ | |||
+ | <code bash> | ||
+ | $ sudo -s | ||
+ | [root]$ singularity shell --pid --ipc --writable / | ||
+ | |||
+ | Singularity> | ||
+ | : | ||
+ | ModuleNotFoundError: | ||
+ | |||
+ | Singularity> | ||
+ | Collecting absl-py | ||
+ | Downloading absl_py-1.1.0-py3-none-any.whl (123 kB) | ||
+ | | ||
+ | : | ||
+ | Installing collected packages: absl-py | ||
+ | Successfully installed absl-py-1.1.0 | ||
+ | |||
+ | Singularity> | ||
+ | |||
+ | : | ||
+ | ModuleNotFoundError: | ||
+ | Singularity> | ||
+ | Collecting numpy | ||
+ | Using cached numpy-1.23.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB) | ||
+ | : | ||
+ | |||
+ | Singularity> | ||
+ | 2022-07-07 11: | ||
+ | 2022-07-07 11: | ||
+ | 2022-07-07 11: | ||
+ | 2022-07-07 11: | ||
+ | 2022-07-07 11: | ||
+ | </ | ||
+ | |||
+ | Since the node on which I was building this container does not have an AMD GPU, no GPU is found by the now-working tensorflow module. | ||
+ | |||
+ | <code bash> | ||
+ | [root]$ exit | ||
+ | $ singularity build / | ||
+ | / | ||
+ | INFO: Starting build... | ||
+ | INFO: Creating SIF file... | ||
+ | INFO: Build complete: / | ||
+ | </ | ||
+ | |||
+ | ====== Test the 2.9 image ====== | ||
+ | |||
+ | At this point it's necessary to check that the read-only image sees the AMD GPU device. | ||
+ | |||
+ | <code bash> | ||
+ | $ salloc --partition=idle --gpus=amd_mi100: | ||
+ | salloc: Pending job allocation ###### | ||
+ | salloc: job 918608 queued and waiting for resources | ||
+ | salloc: job 918608 has been allocated resources | ||
+ | salloc: Granted job allocation ###### | ||
+ | salloc: Waiting for resource configuration | ||
+ | salloc: Nodes r0m01 are ready for job | ||
+ | </ | ||
+ | |||
+ | Add Singularity to the runtime environment and reference the 2.9 container image: | ||
+ | |||
+ | <code bash> | ||
+ | $ vpkg_require singularity/ | ||
+ | Adding dependency `squashfs-tools/ | ||
+ | Adding package `singularity/ | ||
+ | |||
+ | $ export SINGULARITY_IMAGE=/ | ||
+ | |||
+ | $ Sshell | ||
+ | Singularity> | ||
+ | |||
+ | ======================= ROCm System Management Interface ======================= | ||
+ | ================================= Concise Info ================================= | ||
+ | GPU Temp | ||
+ | 0 27.0c 34.0W | ||
+ | ================================================================================ | ||
+ | ============================= End of ROCm SMI Log ============================== | ||
+ | </ | ||
+ | |||
+ | Excellent! | ||
+ | |||
+ | <code bash> | ||
+ | Singularity> | ||
+ | Python 3.9.13 (main, May 23 2022, 22: | ||
+ | [GCC 9.4.0] on linux | ||
+ | Type " | ||
+ | >>> | ||
+ | >>> | ||
+ | [PhysicalDevice(name='/ | ||
+ | >>> | ||
+ | </ | ||
+ | |||
+ | ====== VALET configuration ====== | ||
+ | |||
+ | To make the containers as easy to use for end users, VALET is leveraged: | ||
+ | |||
+ | <code bash> | ||
+ | $ cat / | ||
+ | tensorflow: | ||
+ | prefix: / | ||
+ | description: | ||
+ | url: " | ||
+ | |||
+ | actions: | ||
+ | - variable: SINGULARITY_IMAGE | ||
+ | action: set | ||
+ | value: ${VALET_PATH_PREFIX}/ | ||
+ | |||
+ | dependencies: | ||
+ | - singularity/ | ||
+ | |||
+ | versions: | ||
+ | " | ||
+ | description: | ||
+ | |||
+ | " | ||
+ | description: | ||
+ | |||
+ | " | ||
+ | description: | ||
+ | |||
+ | $ vpkg_versions tensorflow | ||
+ | |||
+ | Available versions in package (* = default version): | ||
+ | |||
+ | [/ | ||
+ | tensorflow | ||
+ | 2.3: | ||
+ | 2.8: | ||
+ | * 2.9: | ||
+ | </ | ||
+ | |||
+ | ====== End-user usage of 2.9 image ====== | ||
+ | |||
+ | To test a Python script running the the tensorflow container, we first write the Python script: | ||
+ | |||
+ | <file python tf-test.py> | ||
+ | # | ||
+ | |||
+ | import tensorflow as tf | ||
+ | |||
+ | print(' | ||
+ | len(tf.config.list_physical_devices(' | ||
+ | ) | ||
+ | |||
+ | </ | ||
+ | |||
+ | The job script requests all memory, all 128 CPU cores, and the Mi100 GPU device: | ||
+ | |||
+ | <file bash tf-test.qs> | ||
+ | #!/bin/bash | ||
+ | # | ||
+ | #SBATCH --partition=gpu-mi100 | ||
+ | #SBATCH --nodes=1 | ||
+ | #SBATCH --ntasks=1 | ||
+ | #SBATCH --cpus-per-task=128 | ||
+ | #SBATCH --mem=0 | ||
+ | #SBATCH --gpus=amd_mi100: | ||
+ | # | ||
+ | |||
+ | vpkg_require tensorflow/ | ||
+ | |||
+ | Srun python3 tf-test.py | ||
+ | |||
+ | </ | ||
+ | |||
+ | These two files were copied to a working directory named '' | ||
+ | |||
+ | <code bash> | ||
+ | $ cd ~/ | ||
+ | |||
+ | $ ls -l | ||
+ | total 19 | ||
+ | -rw-r--r-- 1 frey it_nss 164 Jul 7 12:55 tf-test.py | ||
+ | -rw-r--r-- 1 frey it_nss 214 Jul 7 12:54 tf-test.qs | ||
+ | |||
+ | $ sbatch tf-test.qs | ||
+ | Submitted batch job 918657 | ||
+ | </ | ||
+ | |||
+ | After the job has run: | ||
+ | |||
+ | <code bash> | ||
+ | $ ls -l | ||
+ | total 29 | ||
+ | -rw-r--r-- 1 frey it_nss 220 Jul 7 12:55 slurm-918657.out | ||
+ | -rw-r--r-- 1 frey it_nss 164 Jul 7 12:55 tf-test.py | ||
+ | -rw-r--r-- 1 frey it_nss 214 Jul 7 12:54 tf-test.qs | ||
+ | |||
+ | $ cat slurm-918657.out | ||
+ | Adding dependency `squashfs-tools/ | ||
+ | Adding dependency `singularity/ | ||
+ | Adding package `tensorflow/ | ||
+ | There are 1 GPU devices visible to this TF | ||
+ | </ | ||
+ | |||
+ | Hooray! |