# DEMO - Accelerate Machine Learning Training with Fluid

(opens new window)

This article describes how to deploy ImageNet (opens new window) dataset stored on Aliyun OSS (opens new window) to Kubernetes cluster with Fluid, and train a ResNet-50 model on this dataset using Arena (opens new window). In this article, we perform machine learning training on 4 nodes, each node with 8 GPU cards.

# Prerequisites

Fluid (opens new window) (version >= 0.1.0)
Arena (opens new window)（version >= 0.4.0）
TensorFlow (opens new window) (version = 1.14)
Horovod (opens new window) (version=0.18.1)
Benchmark (opens new window)

NOTE:

The document requires Fluid installed on your Kubernetes cluster. Please refer to Fluid Installation Guide to finish installation before going to the next step.

Arena is a CLI that is convenient for data scientists to run and monitor machine learning tasks. See Arena Installation Tutorial (opens new window) for more information.

This Demo uses Horovod + TensorFlow + Benchmark, they are all open source version.

# Deploy Dataset on Kubernetes Cluster with Fluid

# Create Dataset and Runtime

The following dataset.yaml file defined a Dataset and Runtime separated by ---.

The dataset is stored on Alibaba Cloud OSS (opens new window). To ensure that Alluxio can successfully mount the dataset, please make sure that configurations in the dataset.yaml are correct set, including mountPoint, fs.oss.accessKeyId, fs.oss.accessKeySecret and fs.oss.endpoint.

See Alluxio's official document Aliyun Object Storage Service (opens new window) for more examples of using OSS in Alluxio.

If you'd like to prepare dataset by yourself， please download the dataset from http://image-net.org/download-images (opens new window).

If you'd like to download the imagenet dataset from us, please open an issue in Fluid community to ask for it.

This document takes 4 machines to training machine learning tasks, so spec.replicas is set to 4. To ensure that the data is cached on the V100 machine, nodeAffinity is configured. In addition, the following configuration file dataset.yaml also sets many parameters based on our experience to optimize the IO performance of Alluxio in machine learning tasks, including Alluxio, Fuse and JVM levels. You can adjust these parameters according to the test environment and task requirements.

$ cat << EOF >> dataset.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: imagenet
spec:
  mounts:
  - mountPoint: oss://<OSS_BUCKET>/<OSS_DIRECTORY>/
    name: imagenet
    options:
      fs.oss.accessKeyId: <OSS_ACCESS_KEY_ID>
      fs.oss.accessKeySecret: <OSS_ACCESS_KEY_SECRET>
      fs.oss.endpoint: <OSS_ENDPOINT>
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: aliyun.accelerator/nvidia_name
              operator: In
              values:
                - Tesla-V100-SXM2-16GB
---
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
  name: imagenet
spec:
  replicas: 4
  data:
    replicas: 1
#  alluxioVersion:
#    image: registry.cn-huhehaote.aliyuncs.com/alluxio/alluxio
#    imageTag: "2.3.0-SNAPSHOT-bbce37a"
#    imagePullPolicy: Always
  tieredstore:
    levels:
      - mediumtype: SSD
        path: /var/lib/docker/alluxio
        quota: 50Gi
        high: "0.99"
        low: "0.8"
EOF

Create Dataset and Alluxio Runtime with:

$ kubectl create -f dataset.yaml

Check the status Alluxio Runtime, and there should be 1 Master，4 Worker and 4 Fuse running:

$ kubectl describe alluxioruntime imagenet 
Name:         imagenet
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  data.fluid.io/v1alpha1
Kind:         AlluxioRuntime
Metadata:
  # more metadata
Spec:
  # more spec
Status:
  Cache States:
    Cache Capacity:     200GiB
    Cached:             0B
    Cached Percentage:  0%
  Conditions:
    # more conditions
  Current Fuse Number Scheduled:    4
  Current Master Number Scheduled:  1
  Current Worker Number Scheduled:  4
  Desired Fuse Number Scheduled:    4
  Desired Master Number Scheduled:  1
  Desired Worker Number Scheduled:  4
  Fuse Number Available:            4
  Fuse Numb    Status:                True
    Type:                  Ready
  Phase:                   Bound
  Runtimes:
    Category:   Accelerate
    Name:       imagenet
    Namespace:  default
    Type:       alluxio
  Ufs Total:    143.7GiB
Events:         <none>

At the same time, Dataset is bound to Alluxio Runtime:

$ kubectl describe dataset
Name:         imagenet
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  data.fluid.io/v1alpha1
Kind:         Dataset
Metadata:
  # more metadata
Spec:
  # more spec
Status:
  Cache States:
    Cache Capacity:     200GiB
    Cached:             0B
    Cached Percentage:  0%
  Conditions:
    Last Transition Time:  2020-08-18T11:01:09Z
    Last Update Time:      2020-08-18T11:02:48Z
    Message:               The ddc runtime is ready.
    Reason:                DatasetReady
    Status:                True
    Type:                  Ready
  Phase:                   Bound
  Runtimes:
    Category:   Accelerate
    Name:       imagenet
    Namespace:  default
    Type:       alluxio
  Ufs Total:    143.7GiB
Events:         <none>

A pv and pvc named imagenet are successfully created. So far, the dataset stored on cloud has been successfully deployed to the Kubernetes cluster.

$ kubectl get pv,pvc
NAME                        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM              STORAGECLASS   REASON   AGE
persistentvolume/imagenet   100Gi      RWX            Retain           Bound    default/imagenet                           7m11s

NAME                             STATUS   VOLUME     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/imagenet   Bound    imagenet   100Gi      RWX                           7m11s

# Example: Run Deep Learning Frameworks Using Arena

Arena provides a convenient way to help users submit and monitor machine learning tasks. In this article, we use Arena to simplify the deployment process of machine learning tasks.

If you have installed Arena and dataset has been successfully deployed to the local cluster, you can start training a ResNet50 model by simply executing the following command:

arena submit mpi \
    --name horovod-resnet50-v2-4x8-fluid \
    --gpus=8 \
    --workers=4 \
    --working-dir=/horovod-demo/tensorflow-demo/ \
    --data imagenet:/data \
    -e DATA_DIR=/data/imagenet \
    -e num_batch=1000 \
    -e datasets_num_private_threads=8 \
    --image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod-benchmark-dawnbench-v2:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6 \
    ./launch-example.sh 4 8

Notes：

--name：specify the name of job, horovod-resnet50-v2-4x8-fluid in this example
--workers：specify the number of nodes (workers) participating in training
--gpus：specify the number of GPUs used by each worker
--working-dir：specify working directory
--data：tell workers to mount a volume named imagenet to the directory /data
-e DATA_DIR：specify the directory where dataset locates
./launch-example.sh 4 8：run shell script to launch training process

Check whether the task is executed normally:

$ arena get horovod-resnet50-v2-4x8-fluid -e
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 16s

NAME                           STATUS   TRAINER  AGE  INSTANCE                                      NODE
horovod-resnet50-v2-4x8-fluid  RUNNING  MPIJOB   16s  horovod-resnet50-v2-4x8-fluid-launcher-czlfn  192.168.1.21
horovod-resnet50-v2-4x8-fluid  RUNNING  MPIJOB   16s  horovod-resnet50-v2-4x8-fluid-worker-0        192.168.1.16
horovod-resnet50-v2-4x8-fluid  RUNNING  MPIJOB   16s  horovod-resnet50-v2-4x8-fluid-worker-1        192.168.1.21
horovod-resnet50-v2-4x8-fluid  RUNNING  MPIJOB   16s  horovod-resnet50-v2-4x8-fluid-worker-2        192.168.1.25
horovod-resnet50-v2-4x8-fluid  RUNNING  MPIJOB   16s  horovod-resnet50-v2-4x8-fluid-worker-3        192.168.3.29

If you find that 4 workers are in the RUNNING state, congratulations! It means that you have successfully started the training.

If you want to know where the training is going, please check the Arena log:

$ arena logs --tail 100 -f horovod-resnet50-v2-4x8-fluid

# Clean Up

$ kubectl delete -f dataset.yaml

← DEMO - Accelerate PVC with Fluid Demo - Data Preloading →