TensorFlow on DARWIN

TensorFlow on DARWIN

TensorFlow is a combination of Python scripted software and compiled libraries and tools. Building TensorFlow from source is extremely involved due to the number of dependencies and additional software packages involved. There are container images of pre-built TensorFlow environments available on DockerHub, and Conda packages are available (although they tend to lag behind the latest release of TensorFlow).

The recommended way to use TensorFlow on DARWIN is through Conda environments. If Conda environments do not provide the functionality you need, another option may be the pre-installed container images available through VALET. Use of both variants is documented here.

Conda environments

Conda environments are available through the Miniforge VALET package. If you are new to Conda, make sure to review the documentation for Miniforge on DARWIN.

This recipe assumes the user is adding the software to shared workgroup storage, ${WORKDIR_SW}/tensorflow and ${WORKDIR_SW}/valet.

Start by adding Miniforge to the environment (here the default package is used, but you should always check for newer versions with vpkg_versions):

[(my_workgroup:user)@login00.darwin ~]$ vpkg_require miniforge
Adding package `miniforge/25.11.0-1` to your environment

Next, follow the steps in either the GPU support or CPU only section below, depending on whether or not you will be running TensorFlow on GPU nodes.

GPU support

Although there are GPU-enabled Conda packages for TensorFlow, trying to create a Conda environment with one of these builds will probably result in an error saying that Conda cannot figure out how to resolve the environment since some fundamental dependencies are missing on DARWIN. Therefore, we will have to fall back to using pip from inside the Conda environment to install TensorFlow.¹⁾

All versions of the TensorFlow virtualenv will be stored in the common base directory, ${WORKDIR_SW}/tensorflow; each virtualenv must have a unique name that will become the VALET version. In this tutorial, we will install TensorFlow version 2.17.0 with CUDA support. An appropriate VALET package ID for this version would be 2.17.0:cuda, which can be translated to a VALET-friendly directory name:

[(my_workgroup:user)@login00.darwin ~]$ vpkg_id2path --version-id=2.17.0:cuda
2.17.0-cuda

The virtualenv is created using the –prefix option to direct the installation to the desired directory:

Note that TensorFlow is not listed as a dependency in the conda create command, since we will be installing it later with pip. We do specify a version of Python (which must be compatible with the version of TensorFlow we will install later). At this point, you could also specify other Conda packages unrelated to TensorFlow if your code needs them, but we do not do so here.

[(my_workgroup:user)@login00.darwin ~]$ conda create --prefix=${WORKDIR_SW}/tensorflow/2.17.0-cuda python==3.10
   :
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate /lustre/my_workgroup/sw/tensorflow/2.17.0-cuda
#
# To deactivate an active environment, use
#
#     $ conda deactivate

To complete the TensorFlow installation, we now need to activate the Conda environment and install TensorFlow with pip. The first step is to activate the environment we just created:

[(my_workgroup:user)@login00.darwin ~]$ conda activate ${WORKDIR_SW}/tensorflow/2.17.0-cuda
(/work/workgroup/sw/tensorflow/2.17.0-cuda) [(my_workgroup:user)@login00.darwin ~]$

Before we can run pip install, we need to load some dependencies with VALET which are required for pip to properly build some of TensorFlow's dependencies²⁾:

(/work/workgroup/sw/tensorflow/2.17.0-cuda) [(my_workgroup:user)@login00.darwin ~]$ vpkg_devrequire gcc/14.2 hdf5
Adding package `gcc/14.2.0` to your environment
Adding package `hdf5/1.10.7` to your environment

Finally, we can now install TensorFlow (output from pip install is ommitted):

(/work/workgroup/sw/tensorflow/2.17.0-cuda) [(my_workgroup:user)@login00.darwin ~]$ pip install 'tensorflow[and-cuda]==2.17.0'

Use conda deactivate to exit the virtual environment. Roll back the environment changes before proceeding:

(/work/workgroup/sw/tensorflow/2.17.0-cuda) [(my_workgroup:user)@login00.darwin ~]$ conda deactivate
[(my_workgroup:user)@login00.darwin ~]$ vpkg_rollback all

VALET package definition

Assuming the workgroup does not already have a TensorFlow VALET package definition, the following YAML config can be modified (e.g. alter the prefix path) and added to the file ${WORKDIR_SW}/valet/tensorflow.vpkg_yaml:

tensorflow:
    prefix: /lustre/my_workgroup/sw/tensorflow
    description: TensorFlow Python environments
    url: "https://www.tensorflow.org"
    
    flags:
        - no-standard-paths

    versions:
        "2.17.0:cuda":
            description: 2.17.0 with CUDA support and Python 3.10
            dependencies:
                - miniforge
                - gcc/14.2
                - hdf5
            actions:
                - action: source
                  script:
                      sh: miniforge-activate.sh
                  success: 0

If the ${WORKDIR_SW}/valet/tensorflow.vpkg_yaml file already exists, add the new version at the same level as others (under the versions key):

               :
        "2.17.0:cuda":
            description: 2.17.0 with CUDA support and Python 3.10
            dependencies:
                - miniforge
                - gcc/14.2
                - hdf5
            actions:
                - action: source
                  script:
                      sh: miniforge-activate.sh
                  success: 0 
                  
        "2.17.0:cpu":
            description: 2.17.0 with no GPU support
               :

With a properly-constructed package definition file, you can now check for your versions of TensorFlow:

[(my_workgroup:user)@login00.darwin ~]$ vpkg_versions tensorflow
 
Available versions in package (* = default version):                                                                    
 
[/lustre/my_workgroup/sw/valet/tensorflow.vpkg_yaml]
tensorflow    
* 2.17.0:cuda          2.17.0 with CUDA support and Python 3.10
    :

Job scripts

Any job scripts designed to run scripts using this virtualenv should include something like the following toward its end:

   :
   
#
# Setup TensorFlow virtualenv:
#
vpkg_require tensorflow/2.17.0:cuda

#
# Run a Python script in that virtualenv:
#
python3 my_tf_work.py
rc=$?

#
# Do cleanup work, etc....
#

#
# Exit with whatever exit code our Python script handed back:
#
exit $rc

CPU only

The conda search tensorflow command can be used to locate the specific version you wish to install. Some examples are shown (with many ommitted):

[(my_workgroup:user)@login00.darwin ~]$ conda search tensorflow
Loading channels: done
# Name                       Version           Build  Channel
tensorflow                    2.10.0 cpu_py310hd1aba9c_0  conda-forge
tensorflow                    2.10.0 cpu_py37h08536eb_0  conda-forge
   :
tensorflow                    2.19.1 cpu_py312h69ecde4_52  conda-forge
tensorflow                    2.19.1 cpu_py312h69ecde4_53  conda-forge
tensorflow                    2.19.1 cpu_py312h69ecde4_54  conda-forge
tensorflow                    2.19.1 cuda128py310h40b8f1e_200  conda-forge
tensorflow                    2.19.1 cuda128py310h40b8f1e_201  conda-forge
tensorflow                    2.19.1 cuda128py310h40b8f1e_203  conda-forge
   :

Note that the build tag provides the distinction between variants built on top of specific devices or libraries. For example, the third 2.19.1 package above is built for CPUs with Python 3.12 and translates to the qualified Conda package name tensorflow[version=2.19.1,build=cpu_py312h69ecde4_54]. We will use this version for the sake of this example, but make sure to choose the version that is most relevant for you when you follow along with these steps.

All versions of the TensorFlow virtualenv will be stored in the common base directory, ${WORKDIR_SW}/tensorflow; each virtualenv must have a unique name that will become the VALET version. In this tutorial, the latest version of TensorFlow built for CPUs will be installed using the tag cpu on the version:

[(my_workgroup:user)@login00.darwin ~]$ vpkg_id2path --version-id=2.19.1:cpu
2.19.1-cpu

The virtualenv is created using the --prefix option to direct the installation to the desired directory:

[(my_workgroup:user)@login00.darwin ~]$ conda create --prefix=${WORKDIR_SW}/tensorflow/2.19.1-cpu 'tensorflow[version=2.19.1,build=cpu_py312h69ecde4_54]'
   :
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate /lustre/my_workgroup/sw/tensorflow/2.19.1-cpu
#
# To deactivate an active environment, use
#
#     $ conda deactivate

VALET package definition

tensorflow:
    prefix: /lustre/my_workgroup/sw/tensorflow
    description: TensorFlow Python environments
    url: "https://www.tensorflow.org"
    
    flags:
        - no-standard-paths

    versions:
        "2.19.1:cpu":
            description: 2.19.1, cpu_py312h69ecde4_54 build
            dependencies:
                - miniforge
            actions:
                - action: source
                  script:
                      sh: miniforge-activate.sh
                  success: 0

If the ${WORKDIR_SW}/valet/tensorflow.vpkg_yaml file already exists, add the new version at the same level as others (under the versions key):

               :
        "2.19.1:cpu":
            description: 2.19.1, cpu_py312h69ecde4_54 build
            dependencies:
                - miniforge
            actions:
                - action: source
                  script:
                      sh: miniforge-activate.sh
                  success: 0
                  
        "2.19.1:gpu":
            description: 2.19.1, cuda129py312ha3fd0c4_252 build
               :

With a properly-constructed package definition file, you can now check for your versions of TensorFlow:

[(my_workgroup:user)@login00.darwin ~]$ vpkg_versions tensorflow
 
Available versions in package (* = default version):                                                                    
 
[/lustre/my_workgroup/sw/valet/tensorflow.vpkg_yaml]
tensorflow    
* 2.19.1:cpu          2.19.1, cpu_py312h69ecde4_54 build
    :

Job scripts

Any job scripts designed to run scripts using this virtualenv should include something like the following toward its end:

   :
   
#
# Setup TensorFlow virtualenv:
#
vpkg_require tensorflow/2.19.1:cpu

#
# Run a Python script in that virtualenv:
#
python3 my_tf_work.py
rc=$?

#
# Do cleanup work, etc....
#

#
# Exit with whatever exit code our Python script handed back:
#
exit $rc

Container images

IT RCI maintains TensorFlow Singularity containers for all users of DARWIN:

$ vpkg_versions tensorflow
 
Available versions in package (* = default version):
 
[/opt/shared/valet/2.1/etc/tensorflow.vpkg_yaml]
tensorflow  official TensorFlow containers
  2.3:rocm  TF 2.3 with ROCM 4.2 AMD GPU support
* 2.8:rocm  TF 2.8 with ROCM 5.2.0 AMD GPU support
  2.9:rocm  TF 2.9 with ROCM 5.2.0 AMD GPU support
  2.14.0    TF 2.14.0 official Docker runtime image
  2.15:rocm TF 2.15 with ROCM 6.1 AMD GPU support
  2.16.1    TF 2.16.1 official Docker runtime image

You write your Python code either somewhere in your home directory ($HOME) or somewhere under your workgroup directory ($WORKDIR). You should speak to other group members to understand how you should make use of the workgroup directory, e.g. create a directory for yourself, etc.

Assuming you will use your personal workgroup storage directory ($WORKDIR_USER), create a directory therein for your first TensorFlow job:

$ mkdir -p ${WORKDIR_USER}/tf-test-001
$ cd ${WORKDIR_USER}/tf-test-001

For example, say your TensorFlow Python script is called tf-script.py, then you should copy this file or create it in the tf-test-001 directory, then copy the tensorflow.qs job script template:

$ cp /opt/shared/templates/slurm/applications/tensorflow.qs .

The job script template has extensive documentation that should assist you in customizing it for the job. Last but not least, you need to specify the version of Tensorflow you want via VALET, and then the last line should be changed to match your Python script name and for this example, so for this example it would be tf-script.py:

   :
 
#
# Add a TensorFlow container to the environment:
#
vpkg_require tensorflow/2.16.1
 
#
# Execute our TensorFlow Python script:
#
python3 tf-script.py

Finally, submit the job using the sbatch command:

$ sbatch tensorflow.qs

Coprocessor usage

The DARWIN cluster includes nodes with NVIDIA (CUDA-based) GPGPUs and AMD (ROCM-based) GPUs. TensorFlow images with support for these coprocessors are available. Check the vpkg_versions tensorflow listing for versions with the tag rocm and gpu.

Virtual environments (older usage with Anaconda)

While Anaconda and Miniconda are also available on UD's clusters, Miniforge is now the recommended way to use conda for creating new virtual evironments. Refer to the Conda environments section for up to date instructions.

This recipe assumes the user is adding the software to shared workgroup storage, ${WORKDIR_SW}/tensorflow and ${WORKDIR_SW}/valet.

Start by adding the Anaconda distribution base to the environment (here 2024.02:python3 is used, but you should always check for newer versions with vpkg_versions):

[(my_workgroup:user)@login01.darwin ~]$ vpkg_require anaconda/2024.02:python3
Adding package `anaconda/2024.02:python3` to your environment
[(my_workgroup:user)@login01.darwin ~]$

The conda search tensorflow command can be used to locate the specific version you wish to install. Two examples are shown:

[(my_workgroup:user)@login01.darwin ~]$ conda search tensorflow
Loading channels: done
# Name                       Version           Build  Channel             
tensorflow                     1.4.1               0  pkgs/main           
tensorflow                     1.5.0               0  pkgs/main           
   :
tensorflow                    2.11.0 eigen_py310h0f08fec_0  pkgs/main           
   :        
tensorflow                    2.12.0 gpu_py38h03d86b3_0  pkgs/main
   :
tensorflow                    2.12.0 mkl_py39h5ea9445_0  pkgs/main

Note that the build tag provides the distinction between variants built on top of specific devices or libraries. For example, the final item above is built atop the Intel MKL infrastructure and translates to the qualified conda package name tensorflow[version=2.12.0,build= mkl_py39h5ea9445_0].

All versions of the TensorFlow virtualenv will be stored in the common base directory, ${WORKDIR_SW}/tensorflow; each virtualenv must have a unique name that will become the VALET version. In this tutorial, the latest version of TensorFlow with MKL support will be installed using the tag mkl on the version:

[(my_workgroup:user)@login01 ~]$ vpkg_id2path --version-id=2.12.0:mkl
2.12.0-mkl

The virtualenv is created using the --prefix option to direct the installation to the desired directory:

[(my_workgroup:user)@login01 ~]$ conda create --prefix=${WORKDIR_SW}/tensorflow/2.12.0-mkl 'tensorflow[version=2.12.0,build=mkl_py39h5ea9445_0]'
   :
Preparing transaction: done                                                                                             
Verifying transaction: done                                                                                             
Executing transaction: done                                                                                             
#                                                                                                                       
# To activate this environment, use                                                                                     
#                                                                                                                       
#     $ conda activate /lustre/my_workgroup/sw/tensorflow/2.12.0-mkl                                                          
#                                                                                                                       
# To deactivate an active environment, use                                                                              
#                                                                                                                       
#     $ conda deactivate

VALET package definition

tensorflow:
    prefix: /lustre/my_workgroup/sw/tensorflow
    description: TensorFlow Python environments
    url: "https://www.tensorflow.org"
    
    flags:
        - no-standard-paths

    versions:
        "2.12.0:mkl":
            description: 2.12.0, mkl_py39h5ea9445_0 build
            dependencies:
                - anaconda/2024.02:python3
            actions:
                - action: source
                  script:
                      sh: anaconda-activate-2024.sh
                  success: 0

If the ${WORKDIR_SW}/valet/tensorflow.vpkg_yaml file already exists, add the new version at the same level as others (under the versions key):

               :
        "2.12.0:mkl":
            description: 2.12.0, mkl_py39h5ea9445_0 build
            dependencies:
                - anaconda/2024.02:python3
            actions:
                - action: source
                  script:
                      sh: anaconda-activate-2024.sh
                  success: 0
                  
        "2.12.0:gpu":
            description: 2.12.0, gpu_py311h65739b5_0 build
               :

With a properly-constructed package definition file, you can now check for your versions of TensorFlow:

[(it_nss:frey)@login00 ~]$ vpkg_versions tensorflow
 
Available versions in package (* = default version):                                                                    
 
[/lustre/my_workgroup/sw/valet/tensorflow.vpkg_yaml]
tensorflow    
* 2.12.0:mkl  2.12.0, mkl_py39h5ea9445_0 build
    :

Job scripts

Any job scripts designed to run scripts using this virtualenv should include something like the following toward its end:

   :
   
#
# Setup TensorFlow virtualenv:
#
vpkg_require tensorflow/2.12.0:mkl

#
# Run a Python script in that virtualenv:
#
python3 my_tf_work.py
rc=$?

#
# Do cleanup work, etc....
#

#
# Exit with whatever exit code our Python script handed back:
#
exit $rc

¹⁾

Even though we're not using Conda directly to install TensorFlow, Conda is still useful to set up the virtual environment with the specific version of Python you wish to use, along with any other packages your code might need that are not dependencies of TensorFlow.

²⁾

If you try running pip install without loading these dependencies, you will see an error saying that a C++ compiler that supports the C++20 standard is needed. This is a hint suggesting that we need a newer version of GCC than the default on DARWIN. Similarly, there will be an error saying that an HDF5 shared library cannot be found. Luckily, HDF5 is available via VALET as well.

Table of Contents

TensorFlow on DARWIN

Conda environments

GPU support

VALET package definition

Job scripts

CPU only

VALET package definition

Job scripts

Container images

Coprocessor usage

Virtual environments (older usage with Anaconda)

VALET package definition

Job scripts