How to deploy GPU Node groups in Amazon EKS ( version >= 1.16 ) with Autoscaling
This post walks you through the process of deploying GPU nodes groups( worker nodes ) in an existing EKS cluster that has been created using ekctl . Normally, you would like to fire up GPU instance only on demand to save on costs.
Prerequisites
- Make sure you have following tools installed on your systems
eksctl
- eksctl.io/introduction/#installationkubectl
- kubernetes.io/docs/tasks/tools/install-kube..- You have cluster autoscaler deployed to your EKS cluster.
Create the node groups
- Create GPU Node group configuration in EKS cluster's configuration file. Let's call this file
cluster.yaml
. Example configuration file is as given below:
...
nodeGroups:
- name: ng-1-gpu-1-16
labels:
nvidia.com/gpu: "true"
name: nvidia-device-plugin-ds
k8s.amazonaws.com/accelerator: nvidia-tesla
taints:
nvidia.com/gpu: "true:NoSchedule"
tags:
k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu: 'true'
k8s.io/cluster-autoscaler/node-template/taint/dedicated: nvidia.com/gpu=true
k8s.io/cluster-autoscaler/node-template/label/name: "nvidia-device-plugin-ds"
k8s.io/cluster-autoscaler/node-template/taint/nvidia.com/gpu: "true:NoSchedule"
k8s.io/cluster-autoscaler/enabled: "true"
instanceType: p3.2xlarge
desiredCapacity: 0
minSize: 0
maxSize: 10
privateNetworking: true
volumeSize: 30
ssh:
publicKeyName: my-keypair
iam:
withAddonPolicies:
autoScaler: true
externalDNS: true
ebs: true
cloudWatch: true
albIngress: true
...
Note the labels and taints used for this node group.
- Update the cluster node groups :
eksctl create nodegroup --config-file=cluster.yaml
Configure EKS Cluster to use GPU resources
- To be able to use GPU resources , Nvidia device plugin needs to be installed as a daemonset.
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.6.0/nvidia-device-plugin.yml
Test the cluster be deploying a sample pod
- Create a file name
nvidia-smi.yml
with following contents:
apiVersion: v1
kind: Pod
metadata:
name: nvidia-smi
spec:
restartPolicy: OnFailure
containers:
- name: nvidia-smi
image: nvidia/cuda:9.2-devel
args:
- "nvidia-smi"
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
- Deploy the pod using the command:
kubectl apply -f nvidia-smi.yml
- You should now see autoscaling actions kicking in and your autoscaling group for GPU nodes launch a new GPU instance
- You can check that the pod is scheduled on to run on this node. Once the pod completes running, you should be able to see the logs as shown:
$ kubectl logs nvidia-smi
Tue Sep 29 13:20:05 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 |
| N/A 45C P0 25W / 300W | 0MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+