# DEMO - Accelerate PVC with Fluid
# Test scenario: ResNet50 model training
- Machine: V100 x8
- NFS Server:38037492dc-pol25.cn-shanghai.nas.aliyuncs.com
# Settings
# Hardware
Cluster | Alibaba Cloud Kubernetes. v1.16.9-aliyun.1 |
---|---|
ECS Instance | ECS specifications:ecs.gn6v-c10g1.20xlarge CPU:82 cores |
Distributed Storage | NAS |
# Software
Software version: 0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6
# Prerequisites
- Fluid (opens new window) (version >= 0.3.0)
- Arena (opens new window)(version >= 0.4.0)
- Horovod (opens new window) (version=0.18.1)
- Benchmark (opens new window)
# Prepare Dataset
- Download
$ wget http://imagenet-tar.oss-cn-shanghai.aliyuncs.com/imagenet.tar.gz
- Unpack
$ tar -I pigz -xvf imagenet.tar.gz
# NFS dawnbench
# Deploy Dataset
Export Dataset on Your NFS Server
Create Volume using Kubernetes
$ cat <<EOF > nfs.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: nfs-imagenet
spec:
capacity:
storage: 150Gi
volumeMode: Filesystem
accessModes:
- ReadOnlyMany
persistentVolumeReclaimPolicy: Retain
storageClassName: nfs
mountOptions:
- vers=3
- nolock
- proto=tcp
- rsize=1048576
- wsize=1048576
- hard
- timeo=600
- retrans=2
- noresvport
- nfsvers=4.1
nfs:
path: <YOUR_PATH_TO_DATASET>
server: <YOUR_NFS_SERVER>
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nfs-imagenet
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 150Gi
storageClassName: nfs
EOF
NOTE:
Please replace
YOUR_PATH_TO_DATASET
andYOUR_NFS_SERVER
with your own nfs server address and path to dataset.
$ kubectl create -f nfs.yaml
- Check Volume
$ kubectl get pv,pvc
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
persistentvolume/nfs-imagenet 150Gi ROX Retain Bound default/nfs-imagenet nfs 45s
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/nfs-imagenet Bound nfs-imagenet 150Gi ROX nfs 45s
# Dawnbench
NOTES:
1x8
means1
machine with8
GPU cards.
# 1x8
arena submit mpi \
--name horovod-resnet50-v2-1x8-nfs \
--gpus=8 \
--workers=1 \
--working-dir=/horovod-demo/tensorflow-demo/ \
--data nfs-imagenet:/data \
-e DATA_DIR=/data/imagenet \
-e num_batch=1000 \
-e datasets_num_private_threads=8 \
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod-benchmark-dawnbench-v2:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6 \
./launch-example.sh 1 8
# 4x8
arena submit mpi \
--name horovod-resnet50-v2-4x8-nfs \
--gpus=8 \
--workers=4 \
--working-dir=/horovod-demo/tensorflow-demo/ \
--data nfs-imagenet:/data \
-e DATA_DIR=/data/imagenet \
-e num_batch=1000 \
-e datasets_num_private_threads=8 \
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod-benchmark-dawnbench-v2:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6 \
./launch-example.sh 4 8
NOTE:
If you find that nfs volume cannot be deleted, this is because Arena will leave a launcher pod after training finished, and Kubernetes still thinks that volume is in using.
Just execute following command to force deleting volume:
$ kubectl patch pvc nfs-imagenet -p '{"metadata":{"finalizers": []}}' --type=merge
# Accelerate PVC with Fluid
# Deploy Dataset
- Follow Previous Steps to Create NFS Volume
- Deploy Fluid to Accelerate NFS Volume
$ cat <<EOF > dataset.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: fluid-imagenet
spec:
mounts:
- mountPoint: pvc://nfs-imagenet
name: nfs-imagenet
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: aliyun.accelerator/nvidia_name
operator: In
values:
- Tesla-V100-SXM2-16GB
---
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: fluid-imagenet
spec:
replicas: 4
data:
replicas: 1
tieredstore:
levels:
- mediumtype: SSD
path: /var/lib/docker/alluxio
quota: 150Gi
high: "0.99"
low: "0.8"
EOF
NOTE:
- Please keep
spec.replicas
consistent with the number of machines you are going to use for machine learning。nodeSelectorTerms
is used to restrict scheduling on machines with V100 GPU only.
$ kubectl create -f dataset.yaml
- Check Volume
$ kubectl get pv,pvc
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
persistentvolume/fluid-imagenet 100Gi RWX Retain Bound default/fluid-imagenet 1s
persistentvolume/nfs-imagenet 150Gi ROX Retain Bound default/nfs-imagenet nfs 16m
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/fluid-imagenet Bound fluid-imagenet 100Gi RWX 0s
persistentvolumeclaim/nfs-imagenet Bound nfs-imagenet 150Gi ROX nfs 16m
# Dawnbench
# 1x8
arena submit mpi \
--name horovod-resnet50-v2-1x8-fluid \
--gpus=8 \
--workers=1 \
--working-dir=/horovod-demo/tensorflow-demo/ \
--data fluid-imagenet:/data \
-e DATA_DIR=/data/nfs-imagenet/imagenet \
-e num_batch=1000 \
-e datasets_num_private_threads=8 \
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod-benchmark-dawnbench-v2:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6 \
./launch-example.sh 1 8
# 4x8
arena submit mpi \
--name horovod-resnet50-v2-4x8-fluid \
--gpus=8 \
--workers=4 \
--working-dir=/horovod-demo/tensorflow-demo/ \
--data fluid-imagenet:/data \
-e DATA_DIR=/data/nfs-imagenet/imagenet \
-e num_batch=1000 \
-e datasets_num_private_threads=8 \
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod-benchmark-dawnbench-v2:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6 \
./launch-example.sh 4 8
# Experiment Results
# horovod-1x8
nfs | fluid (cold) | fluid (warm) | |
---|---|---|---|
Training time | 3h49m10s | 3h50m40s | 3h34m15s |
Speed at the 1000 step(images/second) | 2400.8 | 2378.4 | 9327.6 |
Speed at the last step(images/second) | 8696.8 | 8692.8 | 9301.6 |
steps | 56300 | 56300 | 56300 |
Accuracy @ 5 | 0.9282 | 0.9286 | 0.9285 |
# horovod-4x8
nfs | fluid (cold) | fluid (warm) | |
---|---|---|---|
Training time | 2h15m59s | 1h43m43s | 1h32m22s |
Speed at the 1000 step(images/second) | 3136 | 8889.6 | 20859.5 |
Speed at the last step(images/second) | 15024 | 20506.3 | 21329 |
steps | 14070 | 14070 | 14070 |
Accuracy @ 5 | 0.9228 | 0.9204 | 0.9243 |
# Result analysis
From the test results, the Fluid acceleration effect on 1x8 has no obvious effect, but in the scenario of 4x8, the effect is very obvious. In warm data scenario, the training time can be shortened (135-92)/135 = 31%; In cold data scenario, training time can be shortened (135-103) /135 = 23%. This is because NFS bandwidth became a bottleneck under 4x8; Fluid based on Alluxio provides distributed cache data reading capability for P2P data.