Understanding AWS EKS Storage, EFS Driver and Node Groups
Gashida Study Week 3
Understanding Container Storage
It is crucial to understand how storage works within containers. Kubernetes, the underlying container orchestration platform used by EKS, organizes containers into pods, which are the smallest deployable units in a Kubernetes cluster.
Each container within a pod requires a file system to store its operating system files, system binaries, and dependency libraries. Containers utilize an overlay file system that consists of multiple layers:
- Lower Directory (Lower Dir): This is a read-only layer that contains the bare minimum files necessary to run the container’s operating system, such as system binaries and libraries.
- Upper Directory (Upper Dir): This layer is writable and stores any modifications or new files created during the container’s runtime.
- Merge View: This is the unified view of the Lower Dir and Upper Dir, presenting a single file system to the container.
While containers can read from the Lower Dir and write to the Upper Dir, there is a significant challenge when it comes to data persistence. When a container is terminated, any data stored in the Upper Dir is lost along with the container. This ephemerality poses a problem for stateful applications that require data to persist across container restarts or pod rescheduling.
Data Management in Docker Containers
Docker provides three primary methods for managing data in containers.
- Volumes
Volumes are the preferred method for persisting data in Docker containers. They are managed by the Docker container runtime and store data in the host operating system’s file system. When a container is started, the volume is created, and when the container is stopped, the volume is removed. The creation and deletion of volumes are automatically handled by the container runtime. On Ubuntu systems, container data is typically stored in the /var/lib/docker/volumes
directory.
2. Bind Mounts
Bind mounts also store data in the host operating system’s file system but are not managed by the container runtime. This means that the data persists even after the container is stopped, unless explicitly deleted by the host operating system. Bind mounts are commonly used for managing persistent data that needs to outlive the container’s lifecycle.
3. tmpfs Mounts
tmpfs mounts store data in the memory (RAM) of the host system rather than the file system. They are suitable for scenarios where data persistence is not required and fast I/O performance is desired. Since tmpfs mounts use memory, they offer faster read and write speeds compared to volumes and bind mounts.
Kubernetes supports various container runtimes, including containerd, CRI-O, and Docker Engine. Containers running within Kubernetes pods can read and write data to the file system layers included in the container image. However, this approach does not provide data persistence across pod restarts or rescheduling. To address this, Kubernetes offers different mechanisms for managing data in pods.
hostPath
volume
This type allows pods to store data persistently on the file system of the Kubernetes node. Data stored in a hostPath
volume remains intact even if the pod is terminated. It is similar to the bind mount concept in Docker containers. hostPath volumes are commonly used for accessing log files, kubeconfig files, or CA certificates from the node.
However, it is important to note that the Kubernetes documentation advises against using hostPath volumes for persistent storage of application data due to security risks. It is recommended to use hostPath volumes in read-only mode and consider using Persistent Volumes (PVs) instead.
Warning:
Using the
hostPath
volume type presents many security risks. If you can avoid using ahostPath
volume, you should. For example, define a[local
PersistentVolume](https://kubernetes.io/docs/concepts/storage/volumes/#local), and use that instead.
Persistent Volumes (PVs) and Persistent Volume Claims (PVCs)
Persistent Volumes (PVs) provide a way to store data in a namespace-isolated manner, unlike hostPath volumes. PVs can be mounted from the node’s file system or a network file system using CSI drivers. They can also be connected to cloud storage services like AWS EBS or EFS.
Pods access PVs indirectly through Persistent Volume Claims (PVCs). PVCs act as a “key” to access the PV, allowing pods to request storage resources without directly interacting with the underlying storage infrastructure. PVCs specify the desired storage capacity and access modes required by the pod.
The use of PVCs abstracts the storage details from the developers, enabling them to focus on consuming storage resources without worrying about the specific storage implementation. System administrators define the available storage classes and configure the PVs, while developers simply specify the storage class in their PVCs.
Role-Based Access Control (RBAC) can be used to restrict access to PVCs and other resources based on user roles or service accounts. This allows fine-grained control over who can create, modify, or delete PVCs within a Kubernetes cluster.
Here’s an example of a PV configuration.
apiVersion: v1
kind: PersistentVolume
metadata:
name: my-pv
spec:
capacity:
storage: 5Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: local-storage
hostPath:
path: /data/my-pv
In this example, a PV named my-pv
is created with a capacity of 5GB. It uses the ReadWriteOnce
access mode, meaning it can be mounted as read-write by a single node. The persistentVolumeReclaimPolicy
is set to Retain
, indicating that the PV should be retained even after the claiming PVC is deleted. The PV is associated with the local-storage
Storage Class and uses the hostPath
volume type, storing data at the /data/my-pv
path on the node.
Here’s an example of a PVC configuration, which acts as a request ticket for a specific amount of storage with certain access modes
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
storageClassName: local-storage
A PVC named my-pvc
is created, requesting 2GB of storage with the ReadWriteOnce
access mode. It is associated with the local-storage
Storage Class. When a PVC is created, Kubernetes tries to find a matching PV that satisfies the PVC's requirements. If a suitable PV is found, the PVC is bound to that PV.
A Storage Class provides a way for administrators to describe the “classes” of storage they offer. Different classes might map to quality-of-service levels, backup policies, or arbitrary policies determined by the cluster administrators.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
A Storage Class named local-storage
is created, which uses the kubernetes.io/no-provisioner
provisioner, indicating that PVs are manually created by the administrator. The volumeBindingMode
is set to WaitForFirstConsumer
, which delays the binding and provisioning of a PV until a pod using the PVC is created.
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
image: nginx
volumeMounts:
- name: my-volume
mountPath: /data
volumes:
- name: my-volume
persistentVolumeClaim:
claimName: my-pvc
A pod named my-pod
is created with a single container. The container mounts a volume named my-volume
at the /data
path. The volume is defined using a PVC with the name my-pvc
. Kubernetes will ensure that the requested PVC is bound to an appropriate PV before the pod is scheduled and started.
Let’s explore the behavior of data stored in a container’s file system when a pod is deleted and recreated without using a Persistent Volume (PV). The goal is to verify that data stored in the container’s file system is lost when the pod is deleted.
# date-busybox-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: busybox
spec:
terminationGracePeriodSeconds: 3
containers:
- name: busybox
image: busybox
command:
- "/bin/sh"
- "-c"
- "while true; do date >> /home/pod-out.txt; cd /home; sync; sync; sleep 10; done"
The YAML file defines a pod named “busybox” that runs a container based on the “busybox” image. The container executes a command that appends the current date and time to a file located at “/home/pod-out.txt” every 10 seconds.
- The pod is deployed using the command
kubectl apply -f date-busybox-pod.yaml
. - After the pod is created, the contents of the “/home/pod-out.txt” file are observed using the command
kubectl exec busybox -- tail -f /home/pod-out.txt
. The output shows the timestamps being appended to the file every 10 seconds, with the last timestamp being "Tue Mar 23 09:24:39 UTC 2024". - The pod is then deleted using the command
kubectl delete pod busybox
. - The pod is recreated using the same YAML file with the command
kubectl apply -f date-busybox-pod.yaml
. - After the pod is recreated, the contents of the “/home/pod-out.txt” file are observed again using the command
kubectl exec busybox -- tail -f /home/pod-out.txt
. The output shows that the file only contains a single timestamp, "Tue Mar 19 09:24:49 UTC 2024", which is different from the last timestamp observed before deleting the pod.
Let’s also explore the usage of PVs and PVCs to store data persistently across pod deletions and recreations. The goal is to demonstrate that data stored in a PV remains intact even when the associated pod is deleted and recreated.
# localpath1.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: localpath-claim
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: "local-path"
# localpath2.yaml
apiVersion: v1
kind: Pod
metadata:
name: app
spec:
terminationGracePeriodSeconds: 3
containers:
- name: app
image: centos
command: ["/bin/sh"]
args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
volumeMounts:
- name: persistent-storage
mountPath: /data
volumes:
- name: persistent-storage
persistentVolumeClaim:
claimName: localpath-claim
$ kubectl get sc local-path
#NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
#local-path rancher.io/local-path Delete WaitForFirstConsumer false 4m19s
- The local-path-provisioner is installed using the provided YAML file. It is a dynamic provisioner for Kubernetes that creates PVs using local storage on the node where a pod is scheduled. It simplifies the process of allocating and managing local storage for pods. This provisioner allows the creation of PVs on the node’s file system using the
local-path
storage class. - A PVC named “localpath-claim” is created using the “localpath1.yaml” file. This PVC requests 1GB of storage from the “local-path” storage class.
- The status of the PVC is initially “Pending” because no pod is consuming it yet. The PVC waits for the first consumer (pod) to be created before binding to a PV.
- A pod named “app” is created using the “localpath2.yaml” file. This pod runs a container based on the “centos” image and executes a command that appends the current date and time to a file located at “/data/out.txt” every 5 seconds.
- The pod’s specification includes a volume mount that mounts a volume named “persistent-storage” at the “/data” path within the container. This volume is defined using the previously created PVC, “localpath-claim”.
- After applying the pod YAML file, the pod, PV, and PVC statuses are checked. The pod is in the “Running” state, and both the PV and PVC are in the “Bound” state, indicating that the PVC is successfully bound to a dynamically provisioned PV.
- The contents of the “/data/out.txt” file within the pod are observed using the command
kubectl exec -it app -- tail -f /data/out.txt
. The output shows timestamps being appended to the file every 5 seconds. - It’s important to note that the provisioned PV is specific to the node where the pod is running. When checking the PV’s path on other nodes, it is not found.
- To test data persistence, the “app” pod is deleted using the command
kubectl delete pod app
. - After deleting the pod, the contents of the “/data/out.txt” file are observed again, but this time using the command
kubectl exec -it app -- head /data/out.txt
to view the beginning of the file. - The output shows that the previously appended timestamps are still present in the file, confirming that the data stored in the PV persists even after the pod is deleted and recreated.
This example demonstrates the power of using PVs and PVCs in Kubernetes. By decoupling the storage from the pod’s lifecycle, data can be persisted across pod restarts and deletions. The local-path-provisioner enables the dynamic provisioning of PVs on the node’s file system, providing a convenient way to allocate storage for pods.
In the aformentioned example, the PV provisioned by the local-path-provisioner is specific to the node where the pod is running. This means that the PV is created on the local file system of that particular node.
For instance, consider a Kubernetes cluster with three nodes: Node1, Node2, and Node3. When a pod is scheduled on Node1 and requests storage using a PVC with the “local-path” StorageClass, the local-path-provisioner creates a PV on Node1’s local file system. The pod can then access and use this PV for storing its data.
In a multi-node cluster, pods can be rescheduled to different nodes based on various factors such as node failures, resource constraints, or manual interventions. However, when a pod is rescheduled to a different node, it may lose access to the PV that was provisioned on the original node.
Suppose the pod running on Node1 is rescheduled to Node2 due to a node failure or maintenance. Since the PV was provisioned on Node1’s local file system, the pod will no longer have access to that PV when it starts running on Node2. The data stored in the PV on Node1 will not be available to the pod on Node2. This limitation arises because the local-path-provisioner creates PVs that are tied to specific nodes. The PVs are not automatically migrated or replicated across nodes when pods are rescheduled.
By leveraging NAS or distributed file systems, Kubernetes can provide cluster-wide persistent storage that is not tied to specific nodes. This enables pods to be rescheduled freely across nodes while maintaining access to their persistent data.
CSI Drivers
Kubernetes introduced the Container Storage Interface (CSI) to decouple pods and volumes, which acts as an intermediary between storage solutions and the cluster. Thanks to CSI driver, pods can mount network volumes to access and store data on external storage systems. The following example demonstrates how to define a PersistentVolume (PV) and a PersistentVolumeClaim (PVC) for an NFS network volume.
# NFS PersistentVolume definition
apiVersion: v1
kind: PersistentVolume
metadata:
name: nfs-pv
spec:
capacity:
storage: 5Gi
accessModes:
- ReadWriteMany
nfs:
path: /path/to/nfs/share
server: nfs-server-ip
# NFS PersistentVolumeClaim definition
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nfs-pvc
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
# Pod definition
apiVersion: v1
kind: Pod
metadata:
name: nfs-pod
spec:
containers:
- name: nginx
image: nginx
volumeMounts:
- mountPath: "/usr/share/nginx/html"
name: nfs-volume
volumes:
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs-pvc
In this example, an NFS PV is defined with a capacity of 5GB and an access mode of ReadWriteMany. The PVC claims this PV, and the pod mounts the NFS volume using the PVC.
CSI allows Kubernetes to interact with different storage solutions, such as AWS EBS, EFS, GCE Persistent Disk, and more, without requiring changes to the core Kubernetes code. It enables storage vendors to develop their own CSI drivers, which can be easily plugged into Kubernetes clusters.
Here’s an example of defining a storage class using a CSI driver.
# NFS StorageClass definition
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: nfs-storage
provisioner: example.com/nfs
volumeBindingMode: WaitForFirstConsumer
# NFS PersistentVolumeClaim definition
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nfs-pvc
spec:
storageClassName: nfs-storage
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
# Pod definition
apiVersion: v1
kind: Pod
metadata:
name: nfs-pod
spec:
containers:
- name: nginx
image: nginx
volumeMounts:
- mountPath: "/usr/share/nginx/html"
name: nfs-volume
volumes:
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs-pvc
In this example, a storage class named “nfs-storage” is defined using the CSI driver with the provisioner “example.com/nfs”. The PVC references this storage class, and the pod mounts the volume using the PVC.
When running Kubernetes on Amazon Web Services (AWS), you can consider various storage options available to persist and manage data for your applications.
AWS Storage Options
- Elastic Block Store (EBS):
- EBS is a block-level storage service that provides persistent storage volumes for EC2 instances.
- It allows you to create and attach volumes to EC2 instances, enabling data persistence even if the instances are terminated.
- EBS volumes are network-attached and can only be mounted to a single EC2 instance at a time.
- EBS supports various volume types, such as General Purpose SSD (gp2 and gp3), Provisioned IOPS SSD (io1 and io2), and Throughput Optimized HDD (st1).
- EBS volumes are specific to an Availability Zone and can be easily backed up using snapshots.
- Due to the characteristics of EBS, being mountable to a single EC2 instance at a time, when using EBS as a PV in Kubernetes, the
accessModes
of both the PV and PVC must be set to ReadWriteOnce (RWO).
2. Elastic File System (EFS):
- EFS is a fully managed file storage service that provides scalable and elastic file storage for EC2 instances.
- It allows multiple EC2 instances to access the same file system simultaneously, making it suitable for shared storage scenarios.
- EFS supports the Network File System (NFS) protocol, enabling seamless integration with Linux-based instances.
- EFS provides a simple and scalable solution for storing and sharing files across multiple instances and Availability Zones.
3. Instance Store:
- Instance Store provides temporary block-level storage that is physically attached to the host computer of an EC2 instance.
- The storage is ephemeral, meaning that the data stored on Instance Store volumes is lost when the instance is stopped or terminated.
- Instance Store offers high I/O performance and is suitable for temporary storage needs, such as caches, buffers, or scratch data.
- The capacity and performance of Instance Store volumes vary depending on the EC2 instance type.
4. S3
- S3 CSI support is not compatible with Fargate, which is a serverless compute engine for containers.
- S3 CSI support is also not compatible with Windows-based container images.
- This means that if you are using Fargate or Windows containers, you may not be able to directly use S3 as a storage backend through the official CSI driver.
- The official S3 CSI driver only supports static provisioning of volumes.
- Static provisioning means that the S3 buckets need to be manually created beforehand, and the CSI driver can then use those pre-existing buckets as persistent volumes.
- Dynamic provisioning, which automatically creates S3 buckets on-demand when a persistent volume claim is made, is not supported by the official CSI driver.
To use these AWS storage services with Kubernetes, you can leverage the Container Storage Interface (CSI). You should install AWS EBS CSI Driver Addon in order to use ebs-csi-controller.
eksctl create iamserviceaccount \
--name ebs-csi-controller-sa \
--namespace kube-system \
--cluster ${CLUSTER_NAME} \
--attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
--approve \
--role-only \
--role-name AmazonEKS_EBS_CSI_DriverRole
eksctl get iamserviceaccount --cluster myeks
eksctl create addon --name aws-ebs-csi-driver --cluster ${CLUSTER_NAME} --service-account-role-arn arn:aws:iam::${ACCOUNT_ID}:role/AmazonEKS_EBS_CSI_DriverRole --force
eksctl get addon --cluster ${CLUSTER_NAME}
kubectl get pod -n kube-system -l app=ebs-csi-controller -o jsonpath='{.items[0].spec.containers[*].name}' ; echo
# Create a StorageClass for EBS
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: ebs-gp3
provisioner: ebs.csi.aws.com
parameters:
type: gp3
encrypted: 'true'
reclaimPolicy: Delete
allowVolumeExpansion: true
# kubectl get sc
#NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
#gp2 (default) kubernetes.io/aws-ebs Delete WaitForFirstConsumer false 48m
#gp3 ebs.csi.aws.com Delete WaitForFirstConsumer true 10s
#local-path rancher.io/local-path Delete WaitForFirstConsumer false 30m
# Create a PersistentVolumeClaim (PVC)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ebs-claim
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: ebs-gp3
# Use the PVC
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
image: my-image
volumeMounts:
- name: ebs-volume
mountPath: /data
volumes:
- name: ebs-volume
persistentVolumeClaim:
claimName: ebs-claim
In Kubernetes, nodeAffinity
is a feature that allows you to specify which nodes a resource (e.g. PV) should be scheduled on. It provides a way to control the placement of resources based on node labels and expressions.
$ kubectl get pv -o yaml | yh
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: topology.ebs.csi.aws.com/zone
operator: In
values:
- ap-northeast-2b
The nodeAffinity
configuration in this PV restricts the nodes on which the PV can be bound and mounted. It ensures that the PV is only scheduled on nodes that the following.
- Use the Amazon EBS CSI driver (
topology.ebs.csi.aws.com/zone
label key exists) - Are located in the
ap-northeast-2b
Availability Zone
This is important because EBS volumes are specific to an Availability Zone, and they can only be attached to EC2 instances (nodes) within the same Availability Zone. By setting the nodeAffinity
, Kubernetes ensures that the PV is only bound to nodes that have access to the corresponding EBS volume.
Yandex S3 Driver
The Yandex S3 CSI driver is a third-party implementation that allows using S3-compatible storage with Kubernetes. It offers features like dynamic provisioning, which means it can automatically create S3 buckets when a persistent volume claim is made.
- To use the S3 CSI driver, you first need to install it in your Kubernetes cluster.
helm repo add yandex-s3 https://yandex-cloud.github.io/k8s-csi-s3/charts
helm install csi-s3 yandex-s3/csi-s3
- For manual installation, you need to create the necessary Kubernetes resources, such as the provisioner, attacher, and driver.
apiVersion: v1
kind: Secret
metadata:
name: csi-s3-secret
namespace: kube-system
stringData:
accessKeyID: <YOUR_ACCESS_KEY_ID>
secretAccessKey: <YOUR_SECRET_ACCESS_KEY>
endpoint: https://s3.<region>.amazonaws.com
region: ap-northeast-2
- Define a StorageClass that references the S3 CSI driver and specifies any additional parameters.
- The StorageClass determines how volumes are provisioned and configured.
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: csi-s3
provisioner: ru.yandex.s3.csi
parameters:
mounter: geesefs
options: "--memory-limit 1000 --dir-mode 0777 --file-mode 0666"
- Use the StorageClass in your PersistentVolumeClaim to request storage from S3. The PVC specifies the desired storage capacity and access mode.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: csi-s3-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: csi-s3
- Mount the PVC in your Pod to access the S3 storage. Specify the PVC name and the desired mount path in the Pod manifest.
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
image: my-image
volumeMounts:
- name: s3-storage
mountPath: /data
volumes:
- name: s3-storage
persistentVolumeClaim:
claimName: csi-s3-pvc
Volume Snapshots Controller
In Kubernetes, taking snapshots of persistent volumes is a valuable feature for data backup and recovery purposes. AWS Elastic Block Store (EBS) supports the creation of snapshots, allowing you to capture the state of an EBS volume at a specific point in time.
By integrating the AWS Volume Snapshot Controller with Kubernetes, you can manage and automate the process of creating and restoring EBS snapshots directly from within your Kubernetes cluster.
To enable the volume snapshot functionality in Kubernetes, you need to install the Volume Snapshot Controller and create the necessary Custom Resource Definitions (CRDs) for volume snapshots by applying the following YAML files.
curl -s -O https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshots.yaml
curl -s -O https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotclasses.yaml
curl -s -O https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotcontents.yaml
kubectl apply -f snapshot.storage.k8s.io_volumesnapshots.yaml,snapshot.storage.k8s.io_volumesnapshotclasses.yaml,snapshot.storage.k8s.io_volumesnapshotcontents.yaml
These CRDs define the necessary objects for managing volume snapshots, such as VolumeSnapshot
, VolumeSnapshotClass
, and VolumeSnapshotContent
.
Verify that the CRDs are created successfully then install the Volume Snapshot Controller by applying the RBAC and deployment YAML files. This step sets up the necessary RBAC permissions and deploys the Volume Snapshot Controller in the kube-system
namespace.
kubectl get crd | grep snapshot
kubectl api-resources | grep snapshot
curl -s -O https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/deploy/kubernetes/snapshot-controller/rbac-snapshot-controller.yaml
curl -s -O https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/deploy/kubernetes/snapshot-controller/setup-snapshot-controller.yaml
kubectl apply -f rbac-snapshot-controller.yaml,setup-snapshot-controller.yaml
Verify that the Volume Snapshot Controller is running then create a VolumeSnapshotClass by applying the YAML file below, which creates a VolumeSnapshotClass
named csi-aws-vsc
that uses the EBS CSI driver (ebs.csi.aws.com
) for snapshot provisioning.
kubectl get deploy -n kube-system snapshot-controller
kubectl get pod -n kube-system | grep snapshot-controller
curl -s -O https://raw.githubusercontent.com/kubernetes-sigs/aws-ebs-csi-driver/master/examples/kubernetes/snapshot/manifests/classes/snapshotclass.yaml
kubectl apply -f snapshotclass.yaml
Create a VolumeSnapshot
object by applying the following YAML file, which defines a VolumeSnapshot
named ebs-volume-snapshot
that uses the csi-aws-vsc
snapshot class and specifies the source PVC (ebs-claim
) for the snapshot.
# ebs-volume-snapshot.yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: ebs-volume-snapshot
spec:
volumeSnapshotClassName: csi-aws-vsc
source:
persistentVolumeClaimName: ebs-claim
# kubectl apply -f ebs-volume-snapshot.yaml
# kubectl get volumesnapshot
To restore a PVC from a volume snapshot, you can do the following.
kubectl delete pod app && kubectl delete pvc ebs-claim
# Create a new PVC that references the volume snapshot as the data source
# ebs-snapshot-restored-claim.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ebs-snapshot-restored-claim
spec:
storageClassName: gp3
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 4Gi
dataSource:
name: ebs-volume-snapshot
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
# kubectl apply -f ebs-snapshot-restored-claim.yaml
# Create a new pod that uses the restored PVC
# ebs-snapshot-restored-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: app
spec:
containers:
- name: app
image: centos
command: ["/bin/sh"]
args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
volumeMounts:
- name: persistent-storage
mountPath: /data
volumes:
- name: persistent-storage
persistentVolumeClaim:
claimName: ebs-snapshot-restored-claim
# kubectl apply -f ebs-snapshot-restored-pod.yaml
# kubectl exec app -- cat /data/out.txt
- It is also possible to create the S3 bucket manually beforehand then the driver uses the pre-existing bucket as a persistent volume, which is called static provisioning.
- Specify the
snapshots-pv-live.yaml
as shown below.
apiVersion: v1
kind: PersistentVolume
metadata:
name: snapshots-pv-live
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 21Ti
csi:
driver: ru.yandex.s3.csi
controllerPublishSecretRef:
name: csi-s3-secret
namespace: csi-s3
nodePublishSecretRef:
name: csi-s3-secret
namespace: csi-s3
nodeStageSecretRef:
name: csi-s3-secret
namespace: csi-s3
volumeAttributes:
bucket: test-smh-bucket
mounter: geesefs
options: --no-systemd --memory-limit 1000 --dir-mode 0777 --file-mode 0666
volumeHandle: test-xxx-bucket/pvc-xxxxx-a753-4e7f-b557-c0b1476a7819
persistentVolumeReclaimPolicy: Retain
storageClassName: csi-s3
volumeMode: Filesystem
- In this example, a PersistentVolume named
snapshots-pv-live
is created with the following details: - Access mode is set to
ReadWriteMany
, allowing multiple nodes to read and write to the volume. - Capacity is set to 21Ti.
- The CSI driver is specified as
ru.yandex.s3.csi
, indicating the Yandex S3 CSI driver. - Secret references (
controllerPublishSecretRef
,nodePublishSecretRef
,nodeStageSecretRef
) are provided for authentication. - Volume attributes specify the S3 bucket name (
test-smh-bucket
), mounter (geesefs
), and additional options. - The volume handle includes the bucket name and a unique identifier.
- Reclaim policy is set to
Retain
, meaning the volume will be retained even if the PVC is deleted. - Storage class is set to
csi-s3
.
- To use the statically provisioned PV, create a PersistentVolumeClaim that references the PV using the
volumeName
field. - Specify the
snapshots-pvc-live.yaml
as shown below.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: snapshots-pvc-live
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 21Ti
storageClassName: csi-s3
volumeName: snapshots-pv-live
- In this example, a PersistentVolumeClaim named
snapshots-pvc-live
is created with the following details: - Access mode is set to
ReadWriteMany
, matching the PV's access mode. - Requested storage capacity is set to 21Ti.
- Storage class is set to
csi-s3
, matching the PV's storage class. - The
volumeName
field specifies the name of the statically provisioned PV (snapshots-pv-live
) to bind the PVC to.
Note that in static provisioning, the S3 bucket needs to be manually created and managed outside of Kubernetes. The S3 CSI driver simply uses the pre-existing bucket as a persistent volume and does not handle the creation or deletion of the bucket itself. Static provisioning can be useful in scenarios where you have existing S3 buckets that you want to use as persistent volumes in Kubernetes, or when you need more control over the bucket lifecycle and configuration.
Implementing Amazon Elastic File System (EFS)
When running stateful applications on Kubernetes, it’s common to require shared storage that can be accessed by multiple pods simultaneously. While Amazon Elastic Block Store (EBS) provides persistent storage, it has a limitation of being mounted to a single EC2 instance at a time. To overcome this challenge and enable shared storage across multiple nodes, we can leverage EFS with Kubernetes on Amazon EKS.
To use EFS with Kubernetes on EKS, we can leverage the Amazon EFS CSI driver, as we did above. The CSI driver provides a standard interface for Kubernetes to interact with storage systems like EFS.
Install the EFS CSI Driver
- Let us follow the example provided by AWS.
- Create an IAM role for the EFS CSI driver with the necessary permissions to manage EFS resources.
- Deploy the EFS CSI driver as an addon in your EKS cluster using the
eksctl
command-line tool.
export CLUSTER_NAME=my-eks
export ROLE_NAME=AmazonEKS_EFS_CSI_DriverRole
eksctl create iamserviceaccount \
--name efs-csi-controller-sa \
--namespace kube-system \
--cluster $CLUSTER_NAME \
--role-name $ROLE_NAME \
--role-only \
--attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEFSCSIDriverPolicy \
--approve
- Update the IAM role’s trust policy to allow the EFS CSI controller to assume the role.
TRUST_POLICY=$(aws iam get-role --role-name $ROLE_NAME --query 'Role.AssumeRolePolicyDocument' | \
sed -e 's/efs-csi-controller-sa/efs-csi-*/' -e 's/StringEquals/StringLike/')
aws iam update-assume-role-policy --role-name $ROLE_NAME --policy-document "$TRUST_POLICY"
- Create the EFS CSI driver addon in your EKS cluster.
eksctl create addon \
--name aws-efs-csi-driver \
--cluster ${CLUSTER_NAME} \
--service-account-role-arn arn:aws:iam::${ACCOUNT_ID}:role/AmazonEKS_EFS_CSI_DriverRole \
--force
Create an EFS File System
- Create an EFS file system in the desired AWS region and VPC using the AWS Management Console or AWS CLI. If you have a bastion host, SSH into the bastion host or a jumpbox instance that has access to the EKS cluster and the EFS file system.
- Install the
amazon-efs-utils
package if it's not already installed. Create a directory where you want to mount the EFS file system, for example,/mnt/myefs
.
EfsFsId=$(aws efs describe-file-systems --query "FileSystems[*].FileSystemId" --output text)
mount -t efs -o tls $EfsFsId:/ /mnt/myefs
- Define a StorageClass that uses the EFS CSI driver as the provisioner.
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: efs-sc
provisioner: efs.csi.aws.com
- Define a PersistentVolume that references the EFS file system using the EFS file system ID.
apiVersion: v1
kind: PersistentVolume
metadata:
name: efs-pv
spec:
capacity:
storage: 5Gi
volumeMode: Filesystem
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: efs-sc
csi:
driver: efs.csi.aws.com
volumeHandle: <EFS_FILE_SYSTEM_ID>
# kubectl apply -f pv.yaml
- Define a PersistentVolumeClaim that requests storage from the EFS StorageClass.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: efs-claim
spec:
accessModes:
- ReadWriteMany
storageClassName: efs-sc
resources:
requests:
storage: 5Gi
# kubectl apply -f claim.yaml
- Create pods that mount the PVC and specify the mount path where the EFS file system will be accessible.
# pod1.yaml
apiVersion: v1
kind: Pod
metadata:
name: app1
spec:
containers:
- name: app1
image: busybox
command: ["/bin/sh"]
args: ["-c", "while true; do echo $(date -u) >> /data/out1.txt; sleep 5; done"]
volumeMounts:
- name: persistent-storage
mountPath: /data
volumes:
- name: persistent-storage
persistentVolumeClaim:
claimName: efs-claim
# pod2.yaml
apiVersion: v1
kind: Pod
metadata:
name: app2
spec:
containers:
- name: app2
image: busybox
command: ["/bin/sh"]
args: ["-c", "while true; do echo $(date -u) >> /data/out2.txt; sleep 5; done"]
volumeMounts:
- name: persistent-storage
mountPath: /data
volumes:
- name: persistent-storage
persistentVolumeClaim:
claimName: efs-claim
kubectl apply -f pod1.yaml,pod2.yaml
- You can connect to the pods and check the contents of the mounted directory to confirm that the data is being shared across pods.
tree /mnt/myefs
#/mnt/myefs
#|-- out1.txt
#`-- out2.txt
Exploring Node Groups in Amazon EKS
Node Groups in EKS are logical groupings of worker nodes that share similar characteristics, such as instance type, CPU, RAM, and storage configuration. They provide a way to define the desired capacity and scaling behavior of the worker nodes within an EKS cluster.
When the workload on the cluster increases, EKS can automatically add new worker nodes to the node group to handle the increased demand. Similarly, when the workload decreases, EKS can scale down the number of worker nodes to optimize cost and resource utilization.
Let us discuss two autoscaling tools mentioned and explain how they work with Node Groups in EKS.
Kubernetes Cluster Autoscaler (CA)
The Kubernetes Cluster Autoscaler is a tool that automatically adjusts the size of a Kubernetes cluster based on the resource demands of the workloads running on it. It monitors the resource utilization of the pods and scales the number of worker nodes in the cluster accordingly.
You can define a Node Group with autoscaling configuration like below.
# myng-autoscaling.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
managedNodeGroups:
- name: ng-autoscaling
minSize: 1
maxSize: 5
desiredCapacity: 2
privateNetworking: true
iam:
withAddonPolicies:
autoScaler: true
The example defined a Node Group named ng-autoscaling
with the following autoscaling configurations:
- Minimum size: 1 (minimum number of worker nodes)
- Maximum size: 5 (maximum number of worker nodes)
- Desired capacity: 2 (initial number of worker nodes)
- IAM policy for the Cluster Autoscaler is enabled
You should make sure that you deploy the cluster autoscaler.
kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml
With the Cluster Autoscaler deployed and configured, it will continuously monitor the resource utilization of the pods in the cluster. If there are unschedulable pods due to insufficient resources, the Cluster Autoscaler will automatically scale up the number of worker nodes in the Node Group, up to the specified maximum size. Conversely, if there are underutilized worker nodes, the Cluster Autoscaler will scale down the Node Group to the minimum size to save costs.
Karpenter
Karpenter is an open-source autoscaling tool specifically designed for Kubernetes clusters running on AWS. It provides advanced autoscaling capabilities and integrates seamlessly with EKS.
Karpenter observes the resource requests of the pods in the sample workload and automatically provision worker nodes to accommodate them. It selects spot instances based on the provisioner configuration and scale the worker nodes as needed.
- Install Karpenter in your EKS cluster
helm repo add karpenter https://charts.karpenter.sh
helm repo update
helm upgrade --install karpenter karpenter/karpenter --namespace karpenter --create-namespace --set serviceAccount.create=true --version v0.20.0 --wait
- Create a Karpenter provisioner
# karpenter-provisioner.yaml
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
limits:
resources:
cpu: 1000
provider:
subnetSelector:
karpenter.sh/discovery: my-cluster
securityGroupSelector:
karpenter.sh/discovery: my-cluster
ttlSecondsAfterEmpty: 30
In this example, we define a Karpenter provisioner that:
- Selects spot instances for worker nodes
- Limits the total CPU capacity to 1000 cores
- Discovers subnets and security groups tagged with
karpenter.sh/discovery: my-cluster
- Terminates empty nodes after 30 seconds
Karpenter can select the most suitable instance types based on the workload requirements. It considers factors like CPU, memory, and other pod constraints to determine the optimal instance type for each workload.
If you are running a Kubernetes cluster on AWS and want advanced autoscaling features and dynamic provisioning, Karpenter may be a better fit. If you need a more general-purpose, cloud-agnostic autoscaling solution, the Kubernetes Cluster Autoscaler is a solid choice.
Creating a Node Group with eksctl
Define the Node Group configuration in a YAML file:
# myng3.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
managedNodeGroups:
- amiFamily: AmazonLinux2
desiredCapacity: 1
instanceType: t4g.medium
labels:
family: graviton
maxSize: 1
minSize: 1
name: ng3
ssh:
allow: true
publicKeyPath: ~/.ssh/id_rsa.pub
subnets:
- subnet-082b50963d6c944ef
- subnet-03e23edb6dd18f876
- subnet-029b2b9933d83de6f
tags:
alpha.eksctl.io/nodegroup-name: ng3
alpha.eksctl.io/nodegroup-type: managed
volumeSize: 30
metadata:
name: myeks
region: ap-northeast-2
version: "1.28"
The above example defined a Node Group named ng3
with the following configurations:
- Instance type:
t4g.medium
(ARM-based Graviton processor) - Desired capacity: 1 (initial number of worker nodes)
- Minimum size: 1 (minimum number of worker nodes)
- Maximum size: 1 (maximum number of worker nodes)
- Labels:
family=graviton
(custom label for the worker nodes) - SSH access: Enabled with the specified public key path
- Subnets: Specified subnet IDs for the worker nodes
- Tags: Additional tags for the Node Group
Create the Node Group using eksctl…
eksctl create nodegroup -f myng3.yaml
Verify the Node Group creation:
kubectl get nodes --label-columns eks.amazonaws.com/nodegroup,kubernetes.io/arch
Tains and Tolerations
Taints and tolerations provide a mechanism to control which pods can be scheduled on specific worker nodes. Taints are applied to worker nodes, while tolerations are specified in the pod specification.
Taints
- A taint is a property that can be set on each node.
- When a taint is applied to a node, pods will not be scheduled on that node unless they have a matching toleration.
- Taints are commonly used to dedicate nodes for specific purposes or roles.
- You can use taints to ensure that only pods that require GPU resources are scheduled on nodes with GPUs, while other pods are prevented from being scheduled on those nodes.
- NoSchedule: If a pod does not have a matching toleration, it will not be scheduled on the tainted node. However, this effect does not apply to pods that are already running on the node.
- PreferNoSchedule: If a pod does not have a matching toleration, Kubernetes will try to avoid scheduling it on the tainted node. However, if there are resource constraints or insufficient nodes in the cluster, the pod may still be scheduled on the tainted node.
- NoExecute: If a pod does not have a matching toleration, it will not be scheduled on the tainted node. Additionally, if a pod is already running on the node and does not have a matching toleration, it will be evicted (terminated).
Tolerations
- A toleration is a property that can be set on a pod.
- Tolerations are applied to pods and allow them to be scheduled on nodes with matching taints.
- Taints and tolerations are like a “no entry” sign on a node. Taints act as the “no entry” sign, preventing pods from being scheduled on the node. Tolerations are like a special “access pass” that allows pods to ignore the “no entry” sign and be scheduled on the node. NodeAffinity is like a “preferred location” sign. It guides pods towards specific nodes based on node labels, indicating where the pod should be scheduled.
The goal for this example is to ensure that pods are scheduled on the nodes in the ng3
node group.
Let’s apply a taint to the Node Group.
aws eks update-nodegroup-config --cluster-name $CLUSTER_NAME --nodegroup-name ng3 --taints "addOrUpdateTaints=[{key=frontend, value=true, effect=NO_EXECUTE}]"
This command adds a taint to the ng3
node group with the following properties:
- Key:
frontend
- Value:
true
- Effect:
NO_EXECUTE
To allow pods to be scheduled on the ng3
node group, they need to have a matching toleration in their configuration. The toleration should specify the same key (frontend
) and effect (NO_EXECUTE
) as the taint.
Here's an example of a pod configuration with the required toleration:
# busybox.yaml
apiVersion: v1
kind: Pod
metadata:
name: busybox
spec:
terminationGracePeriodSeconds: 3
containers:
- name: busybox
image: busybox
command:
- "/bin/sh"
- "-c"
- "while true; do date >> /home/pod-out.txt; cd /home; sync; sync; sleep 10; done"
tolerations:
- effect: NoExecute
key: frontend
operator: Exists
In this example, the pod has a toleration that matches the taint on the ng3
node group. The operator: Exists
means that the toleration will match any taint with the specified key, regardless of the value.
Again, it’s important to note that taints and tolerations are different from nodeAffinity
. While nodeAffinity
is used to specify which nodes a pod prefers or requires to be scheduled on, taints and tolerations provide a way to restrict or allow pod scheduling on specific nodes based on the presence or absence of tolerations.
The NO_EXECUTE
effect is useful in scenarios where you want to strictly control which pods can run on specific nodes and ensure that pods without the necessary tolerations are not scheduled or are evicted from those nodes. Some common use cases include:
- Dedicated Nodes: You can use taints with the
NO_EXECUTE
effect to dedicate certain nodes for specific workloads or services. By applying a taint to those nodes and requiring pods to have a matching toleration, you ensure that only the desired pods are scheduled on those nodes. - Maintenance or Decommissioning: If you need to perform maintenance on a node or decommission it, you can apply a taint with the
NO_EXECUTE
effect. This will evict any pods that don't have a matching toleration, allowing you to safely perform the maintenance or decommissioning tasks without affecting the workloads. - Workload Isolation: Taints with the
NO_EXECUTE
effect can be used to isolate specific workloads or applications from running on certain nodes. By applying taints to nodes and requiring pods to have specific tolerations, you can ensure that only the intended workloads are scheduled on those nodes.
It’s important to note that the NO_EXECUTE
effect is the strictest among the available taint effects (NO_SCHEDULE
, PREFER_NO_SCHEDULE
, NO_EXECUTE
). It not only prevents new pods from being scheduled on the tainted node but also evicts existing pods that don't have a matching toleration.
When using taints with the NO_EXECUTE
effect, it's crucial to carefully consider the impact on existing pods and ensure that the necessary tolerations are specified in the pod configurations to avoid unexpected evictions and disruptions to the workloads.