Understanding AWS EKS Storage, EFS Driver and Node Groups

Gashida Study Week 3

27 min readMar 23, 2024

Image: https://dev.classmethod.jp/articles/csi-driver-for-amazon-efs-eks-addon/

Understanding Container Storage

It is crucial to understand how storage works within containers. Kubernetes, the underlying container orchestration platform used by EKS, organizes containers into pods, which are the smallest deployable units in a Kubernetes cluster.

Each container within a pod requires a file system to store its operating system files, system binaries, and dependency libraries. Containers utilize an overlay file system that consists of multiple layers:

Image: https://ravichaganti.com/blog/2022-10-18-understanding-container-images-the-fundamentals/

Lower Directory (Lower Dir): This is a read-only layer that contains the bare minimum files necessary to run the container’s operating system, such as system binaries and libraries.
Upper Directory (Upper Dir): This layer is writable and stores any modifications or new files created during the container’s runtime.
Merge View: This is the unified view of the Lower Dir and Upper Dir, presenting a single file system to the container.

While containers can read from the Lower Dir and write to the Upper Dir, there is a significant challenge when it comes to data persistence. When a container is terminated, any data stored in the Upper Dir is lost along with the container. This ephemerality poses a problem for stateful applications that require data to persist across container restarts or pod rescheduling.

Data Management in Docker Containers

Image: https://www.toolsqa.com/docker/docker-volume/

Docker provides three primary methods for managing data in containers.

Volumes

Volumes are the preferred method for persisting data in Docker containers. They are managed by the Docker container runtime and store data in the host operating system’s file system. When a container is started, the volume is created, and when the container is stopped, the volume is removed. The creation and deletion of volumes are automatically handled by the container runtime. On Ubuntu systems, container data is typically stored in the /var/lib/docker/volumes directory.

2. Bind Mounts

Bind mounts also store data in the host operating system’s file system but are not managed by the container runtime. This means that the data persists even after the container is stopped, unless explicitly deleted by the host operating system. Bind mounts are commonly used for managing persistent data that needs to outlive the container’s lifecycle.

3. tmpfs Mounts

tmpfs mounts store data in the memory (RAM) of the host system rather than the file system. They are suitable for scenarios where data persistence is not required and fast I/O performance is desired. Since tmpfs mounts use memory, they offer faster read and write speeds compared to volumes and bind mounts.

Kubernetes supports various container runtimes, including containerd, CRI-O, and Docker Engine. Containers running within Kubernetes pods can read and write data to the file system layers included in the container image. However, this approach does not provide data persistence across pod restarts or rescheduling. To address this, Kubernetes offers different mechanisms for managing data in pods.

hostPath volume

This type allows pods to store data persistently on the file system of the Kubernetes node. Data stored in a hostPath volume remains intact even if the pod is terminated. It is similar to the bind mount concept in Docker containers. hostPath volumes are commonly used for accessing log files, kubeconfig files, or CA certificates from the node.

However, it is important to note that the Kubernetes documentation advises against using hostPath volumes for persistent storage of application data due to security risks. It is recommended to use hostPath volumes in read-only mode and consider using Persistent Volumes (PVs) instead.

Warning:
Using the hostPath volume type presents many security risks. If you can avoid using a hostPath volume, you should. For example, define a [local PersistentVolume](https://kubernetes.io/docs/concepts/storage/volumes/#local), and use that instead.

Persistent Volumes (PVs) and Persistent Volume Claims (PVCs)

Persistent Volumes (PVs) provide a way to store data in a namespace-isolated manner, unlike hostPath volumes. PVs can be mounted from the node’s file system or a network file system using CSI drivers. They can also be connected to cloud storage services like AWS EBS or EFS.

Pods access PVs indirectly through Persistent Volume Claims (PVCs). PVCs act as a “key” to access the PV, allowing pods to request storage resources without directly interacting with the underlying storage infrastructure. PVCs specify the desired storage capacity and access modes required by the pod.

The use of PVCs abstracts the storage details from the developers, enabling them to focus on consuming storage resources without worrying about the specific storage implementation. System administrators define the available storage classes and configure the PVs, while developers simply specify the storage class in their PVCs.

Role-Based Access Control (RBAC) can be used to restrict access to PVCs and other resources based on user roles or service accounts. This allows fine-grained control over who can create, modify, or delete PVCs within a Kubernetes cluster.

Here’s an example of a PV configuration.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: my-pv
spec:
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: local-storage
  hostPath:
    path: /data/my-pv

In this example, a PV named my-pv is created with a capacity of 5GB. It uses the ReadWriteOnce access mode, meaning it can be mounted as read-write by a single node. The persistentVolumeReclaimPolicy is set to Retain, indicating that the PV should be retained even after the claiming PVC is deleted. The PV is associated with the local-storage Storage Class and uses the hostPath volume type, storing data at the /data/my-pv path on the node.

Here’s an example of a PVC configuration, which acts as a request ticket for a specific amount of storage with certain access modes

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi
  storageClassName: local-storage

A PVC named my-pvc is created, requesting 2GB of storage with the ReadWriteOnce access mode. It is associated with the local-storage Storage Class. When a PVC is created, Kubernetes tries to find a matching PV that satisfies the PVC's requirements. If a suitable PV is found, the PVC is bound to that PV.

A Storage Class provides a way for administrators to describe the “classes” of storage they offer. Different classes might map to quality-of-service levels, backup policies, or arbitrary policies determined by the cluster administrators.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer

A Storage Class named local-storage is created, which uses the kubernetes.io/no-provisioner provisioner, indicating that PVs are manually created by the administrator. The volumeBindingMode is set to WaitForFirstConsumer, which delays the binding and provisioning of a PV until a pod using the PVC is created.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - name: my-container
      image: nginx
      volumeMounts:
        - name: my-volume
          mountPath: /data
  volumes:
    - name: my-volume
      persistentVolumeClaim:
        claimName: my-pvc

A pod named my-pod is created with a single container. The container mounts a volume named my-volume at the /data path. The volume is defined using a PVC with the name my-pvc. Kubernetes will ensure that the requested PVC is bound to an appropriate PV before the pod is scheduled and started.

Let’s explore the behavior of data stored in a container’s file system when a pod is deleted and recreated without using a Persistent Volume (PV). The goal is to verify that data stored in the container’s file system is lost when the pod is deleted.

# date-busybox-pod.yaml

apiVersion: v1
kind: Pod
metadata:
  name: busybox
spec:
  terminationGracePeriodSeconds: 3
  containers:
  - name: busybox
    image: busybox
    command:
    - "/bin/sh"
    - "-c"
    - "while true; do date >> /home/pod-out.txt; cd /home; sync; sync; sleep 10; done"

The YAML file defines a pod named “busybox” that runs a container based on the “busybox” image. The container executes a command that appends the current date and time to a file located at “/home/pod-out.txt” every 10 seconds.

The pod is deployed using the command kubectl apply -f date-busybox-pod.yaml.
After the pod is created, the contents of the “/home/pod-out.txt” file are observed using the command kubectl exec busybox -- tail -f /home/pod-out.txt. The output shows the timestamps being appended to the file every 10 seconds, with the last timestamp being "Tue Mar 23 09:24:39 UTC 2024".
The pod is then deleted using the command kubectl delete pod busybox.
The pod is recreated using the same YAML file with the command kubectl apply -f date-busybox-pod.yaml.
After the pod is recreated, the contents of the “/home/pod-out.txt” file are observed again using the command kubectl exec busybox -- tail -f /home/pod-out.txt. The output shows that the file only contains a single timestamp, "Tue Mar 19 09:24:49 UTC 2024", which is different from the last timestamp observed before deleting the pod.

Let’s also explore the usage of PVs and PVCs to store data persistently across pod deletions and recreations. The goal is to demonstrate that data stored in a PV remains intact even when the associated pod is deleted and recreated.

# localpath1.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: localpath-claim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: "local-path"

# localpath2.yaml

apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  terminationGracePeriodSeconds: 3
  containers:
  - name: app
    image: centos
    command: ["/bin/sh"]
    args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
    volumeMounts:
    - name: persistent-storage
      mountPath: /data
  volumes:
  - name: persistent-storage
    persistentVolumeClaim:
      claimName: localpath-claim

$ kubectl get sc local-path
#NAME         PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
#local-path   rancher.io/local-path   Delete          WaitForFirstConsumer   false                  4m19s

The local-path-provisioner is installed using the provided YAML file. It is a dynamic provisioner for Kubernetes that creates PVs using local storage on the node where a pod is scheduled. It simplifies the process of allocating and managing local storage for pods. This provisioner allows the creation of PVs on the node’s file system using the local-path storage class.
A PVC named “localpath-claim” is created using the “localpath1.yaml” file. This PVC requests 1GB of storage from the “local-path” storage class.
The status of the PVC is initially “Pending” because no pod is consuming it yet. The PVC waits for the first consumer (pod) to be created before binding to a PV.
A pod named “app” is created using the “localpath2.yaml” file. This pod runs a container based on the “centos” image and executes a command that appends the current date and time to a file located at “/data/out.txt” every 5 seconds.
The pod’s specification includes a volume mount that mounts a volume named “persistent-storage” at the “/data” path within the container. This volume is defined using the previously created PVC, “localpath-claim”.
After applying the pod YAML file, the pod, PV, and PVC statuses are checked. The pod is in the “Running” state, and both the PV and PVC are in the “Bound” state, indicating that the PVC is successfully bound to a dynamically provisioned PV.
The contents of the “/data/out.txt” file within the pod are observed using the command kubectl exec -it app -- tail -f /data/out.txt. The output shows timestamps being appended to the file every 5 seconds.
It’s important to note that the provisioned PV is specific to the node where the pod is running. When checking the PV’s path on other nodes, it is not found.
To test data persistence, the “app” pod is deleted using the command kubectl delete pod app.
After deleting the pod, the contents of the “/data/out.txt” file are observed again, but this time using the command kubectl exec -it app -- head /data/out.txt to view the beginning of the file.
The output shows that the previously appended timestamps are still present in the file, confirming that the data stored in the PV persists even after the pod is deleted and recreated.

This example demonstrates the power of using PVs and PVCs in Kubernetes. By decoupling the storage from the pod’s lifecycle, data can be persisted across pod restarts and deletions. The local-path-provisioner enables the dynamic provisioning of PVs on the node’s file system, providing a convenient way to allocate storage for pods.

In the aformentioned example, the PV provisioned by the local-path-provisioner is specific to the node where the pod is running. This means that the PV is created on the local file system of that particular node.

For instance, consider a Kubernetes cluster with three nodes: Node1, Node2, and Node3. When a pod is scheduled on Node1 and requests storage using a PVC with the “local-path” StorageClass, the local-path-provisioner creates a PV on Node1’s local file system. The pod can then access and use this PV for storing its data.

In a multi-node cluster, pods can be rescheduled to different nodes based on various factors such as node failures, resource constraints, or manual interventions. However, when a pod is rescheduled to a different node, it may lose access to the PV that was provisioned on the original node.

Suppose the pod running on Node1 is rescheduled to Node2 due to a node failure or maintenance. Since the PV was provisioned on Node1’s local file system, the pod will no longer have access to that PV when it starts running on Node2. The data stored in the PV on Node1 will not be available to the pod on Node2. This limitation arises because the local-path-provisioner creates PVs that are tied to specific nodes. The PVs are not automatically migrated or replicated across nodes when pods are rescheduled.

By leveraging NAS or distributed file systems, Kubernetes can provide cluster-wide persistent storage that is not tied to specific nodes. This enables pods to be rescheduled freely across nodes while maintaining access to their persistent data.

CSI Drivers

Kubernetes introduced the Container Storage Interface (CSI) to decouple pods and volumes, which acts as an intermediary between storage solutions and the cluster. Thanks to CSI driver, pods can mount network volumes to access and store data on external storage systems. The following example demonstrates how to define a PersistentVolume (PV) and a PersistentVolumeClaim (PVC) for an NFS network volume.

# NFS PersistentVolume definition
apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-pv
spec:
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteMany
  nfs:
    path: /path/to/nfs/share
    server: nfs-server-ip

# NFS PersistentVolumeClaim definition
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfs-pvc
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 5Gi

# Pod definition
apiVersion: v1
kind: Pod
metadata:
  name: nfs-pod
spec:
  containers:
  - name: nginx
    image: nginx
    volumeMounts:
    - mountPath: "/usr/share/nginx/html"
      name: nfs-volume
  volumes:
  - name: nfs-volume
    persistentVolumeClaim:
      claimName: nfs-pvc

In this example, an NFS PV is defined with a capacity of 5GB and an access mode of ReadWriteMany. The PVC claims this PV, and the pod mounts the NFS volume using the PVC.

CSI allows Kubernetes to interact with different storage solutions, such as AWS EBS, EFS, GCE Persistent Disk, and more, without requiring changes to the core Kubernetes code. It enables storage vendors to develop their own CSI drivers, which can be easily plugged into Kubernetes clusters.

Here’s an example of defining a storage class using a CSI driver.

# NFS StorageClass definition
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nfs-storage
provisioner: example.com/nfs
volumeBindingMode: WaitForFirstConsumer

# NFS PersistentVolumeClaim definition
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfs-pvc
spec:
  storageClassName: nfs-storage
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 5Gi

# Pod definition
apiVersion: v1
kind: Pod
metadata:
  name: nfs-pod
spec:
  containers:
  - name: nginx
    image: nginx
    volumeMounts:
    - mountPath: "/usr/share/nginx/html"
      name: nfs-volume
  volumes:
  - name: nfs-volume
    persistentVolumeClaim:
      claimName: nfs-pvc

In this example, a storage class named “nfs-storage” is defined using the CSI driver with the provisioner “example.com/nfs”. The PVC references this storage class, and the pod mounts the volume using the PVC.

When running Kubernetes on Amazon Web Services (AWS), you can consider various storage options available to persist and manage data for your applications.

AWS Storage Options

Elastic Block Store (EBS):

EBS is a block-level storage service that provides persistent storage volumes for EC2 instances.
It allows you to create and attach volumes to EC2 instances, enabling data persistence even if the instances are terminated.
EBS volumes are network-attached and can only be mounted to a single EC2 instance at a time.
EBS supports various volume types, such as General Purpose SSD (gp2 and gp3), Provisioned IOPS SSD (io1 and io2), and Throughput Optimized HDD (st1).
EBS volumes are specific to an Availability Zone and can be easily backed up using snapshots.
Due to the characteristics of EBS, being mountable to a single EC2 instance at a time, when using EBS as a PV in Kubernetes, the accessModes of both the PV and PVC must be set to ReadWriteOnce (RWO).

2. Elastic File System (EFS):

EFS is a fully managed file storage service that provides scalable and elastic file storage for EC2 instances.
It allows multiple EC2 instances to access the same file system simultaneously, making it suitable for shared storage scenarios.
EFS supports the Network File System (NFS) protocol, enabling seamless integration with Linux-based instances.
EFS provides a simple and scalable solution for storing and sharing files across multiple instances and Availability Zones.

3. Instance Store:

Instance Store provides temporary block-level storage that is physically attached to the host computer of an EC2 instance.
The storage is ephemeral, meaning that the data stored on Instance Store volumes is lost when the instance is stopped or terminated.
Instance Store offers high I/O performance and is suitable for temporary storage needs, such as caches, buffers, or scratch data.
The capacity and performance of Instance Store volumes vary depending on the EC2 instance type.

4. S3

S3 CSI support is not compatible with Fargate, which is a serverless compute engine for containers.
S3 CSI support is also not compatible with Windows-based container images.
This means that if you are using Fargate or Windows containers, you may not be able to directly use S3 as a storage backend through the official CSI driver.
The official S3 CSI driver only supports static provisioning of volumes.
Static provisioning means that the S3 buckets need to be manually created beforehand, and the CSI driver can then use those pre-existing buckets as persistent volumes.
Dynamic provisioning, which automatically creates S3 buckets on-demand when a persistent volume claim is made, is not supported by the official CSI driver.

To use these AWS storage services with Kubernetes, you can leverage the Container Storage Interface (CSI). You should install AWS EBS CSI Driver Addon in order to use ebs-csi-controller.

eksctl create iamserviceaccount \
  --name ebs-csi-controller-sa \
  --namespace kube-system \
  --cluster ${CLUSTER_NAME} \
  --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
  --approve \
  --role-only \
  --role-name AmazonEKS_EBS_CSI_DriverRole

eksctl get iamserviceaccount --cluster myeks

eksctl create addon --name aws-ebs-csi-driver --cluster ${CLUSTER_NAME} --service-account-role-arn arn:aws:iam::${ACCOUNT_ID}:role/AmazonEKS_EBS_CSI_DriverRole --force

eksctl get addon --cluster ${CLUSTER_NAME}

kubectl get pod -n kube-system -l app=ebs-csi-controller -o jsonpath='{.items[0].spec.containers[*].name}' ; echo

# Create a StorageClass for EBS

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: ebs-gp3
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  encrypted: 'true'
reclaimPolicy: Delete
allowVolumeExpansion: true

# kubectl get sc
#NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
#gp2 (default)   kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   false                  48m
#gp3             ebs.csi.aws.com         Delete          WaitForFirstConsumer   true                   10s
#local-path      rancher.io/local-path   Delete          WaitForFirstConsumer   false                  30m

# Create a PersistentVolumeClaim (PVC)

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ebs-claim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: ebs-gp3

# Use the PVC

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    image: my-image
    volumeMounts:
    - name: ebs-volume
      mountPath: /data
  volumes:
  - name: ebs-volume
    persistentVolumeClaim:
      claimName: ebs-claim

In Kubernetes, nodeAffinity is a feature that allows you to specify which nodes a resource (e.g. PV) should be scheduled on. It provides a way to control the placement of resources based on node labels and expressions.

$ kubectl get pv -o yaml | yh

nodeAffinity:
  required:
    nodeSelectorTerms:
    - matchExpressions:
      - key: topology.ebs.csi.aws.com/zone
        operator: In
        values:
        - ap-northeast-2b

The nodeAffinity configuration in this PV restricts the nodes on which the PV can be bound and mounted. It ensures that the PV is only scheduled on nodes that the following.

Use the Amazon EBS CSI driver (topology.ebs.csi.aws.com/zone label key exists)
Are located in the ap-northeast-2b Availability Zone

This is important because EBS volumes are specific to an Availability Zone, and they can only be attached to EC2 instances (nodes) within the same Availability Zone. By setting the nodeAffinity, Kubernetes ensures that the PV is only bound to nodes that have access to the corresponding EBS volume.

Yandex S3 Driver

The Yandex S3 CSI driver is a third-party implementation that allows using S3-compatible storage with Kubernetes. It offers features like dynamic provisioning, which means it can automatically create S3 buckets when a persistent volume claim is made.

To use the S3 CSI driver, you first need to install it in your Kubernetes cluster.

helm repo add yandex-s3 https://yandex-cloud.github.io/k8s-csi-s3/charts
helm install csi-s3 yandex-s3/csi-s3

For manual installation, you need to create the necessary Kubernetes resources, such as the provisioner, attacher, and driver.

apiVersion: v1
kind: Secret
metadata:
  name: csi-s3-secret
  namespace: kube-system
stringData:
  accessKeyID: <YOUR_ACCESS_KEY_ID>
  secretAccessKey: <YOUR_SECRET_ACCESS_KEY>
  endpoint: https://s3.<region>.amazonaws.com
  region: ap-northeast-2

Define a StorageClass that references the S3 CSI driver and specifies any additional parameters.
The StorageClass determines how volumes are provisioned and configured.

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: csi-s3
provisioner: ru.yandex.s3.csi
parameters:
  mounter: geesefs
  options: "--memory-limit 1000 --dir-mode 0777 --file-mode 0666"

Use the StorageClass in your PersistentVolumeClaim to request storage from S3. The PVC specifies the desired storage capacity and access mode.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: csi-s3-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
  storageClassName: csi-s3

Mount the PVC in your Pod to access the S3 storage. Specify the PVC name and the desired mount path in the Pod manifest.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    image: my-image
    volumeMounts:
    - name: s3-storage
      mountPath: /data
  volumes:
  - name: s3-storage
    persistentVolumeClaim:
      claimName: csi-s3-pvc

Volume Snapshots Controller

In Kubernetes, taking snapshots of persistent volumes is a valuable feature for data backup and recovery purposes. AWS Elastic Block Store (EBS) supports the creation of snapshots, allowing you to capture the state of an EBS volume at a specific point in time.

By integrating the AWS Volume Snapshot Controller with Kubernetes, you can manage and automate the process of creating and restoring EBS snapshots directly from within your Kubernetes cluster.

To enable the volume snapshot functionality in Kubernetes, you need to install the Volume Snapshot Controller and create the necessary Custom Resource Definitions (CRDs) for volume snapshots by applying the following YAML files.

curl -s -O https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshots.yaml
curl -s -O https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotclasses.yaml
curl -s -O https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotcontents.yaml
kubectl apply -f snapshot.storage.k8s.io_volumesnapshots.yaml,snapshot.storage.k8s.io_volumesnapshotclasses.yaml,snapshot.storage.k8s.io_volumesnapshotcontents.yaml

These CRDs define the necessary objects for managing volume snapshots, such as VolumeSnapshot, VolumeSnapshotClass, and VolumeSnapshotContent.

Verify that the CRDs are created successfully then install the Volume Snapshot Controller by applying the RBAC and deployment YAML files. This step sets up the necessary RBAC permissions and deploys the Volume Snapshot Controller in the kube-system namespace.

kubectl get crd | grep snapshot
kubectl api-resources | grep snapshot

curl -s -O https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/deploy/kubernetes/snapshot-controller/rbac-snapshot-controller.yaml
curl -s -O https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/deploy/kubernetes/snapshot-controller/setup-snapshot-controller.yaml
kubectl apply -f rbac-snapshot-controller.yaml,setup-snapshot-controller.yaml

Verify that the Volume Snapshot Controller is running then create a VolumeSnapshotClass by applying the YAML file below, which creates a VolumeSnapshotClass named csi-aws-vsc that uses the EBS CSI driver (ebs.csi.aws.com) for snapshot provisioning.

kubectl get deploy -n kube-system snapshot-controller
kubectl get pod -n kube-system | grep snapshot-controller

curl -s -O https://raw.githubusercontent.com/kubernetes-sigs/aws-ebs-csi-driver/master/examples/kubernetes/snapshot/manifests/classes/snapshotclass.yaml
kubectl apply -f snapshotclass.yaml

Create a VolumeSnapshot object by applying the following YAML file, which defines a VolumeSnapshot named ebs-volume-snapshot that uses the csi-aws-vsc snapshot class and specifies the source PVC (ebs-claim) for the snapshot.

# ebs-volume-snapshot.yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: ebs-volume-snapshot
spec:
  volumeSnapshotClassName: csi-aws-vsc
  source:
    persistentVolumeClaimName: ebs-claim

# kubectl apply -f ebs-volume-snapshot.yaml
# kubectl get volumesnapshot

To restore a PVC from a volume snapshot, you can do the following.

kubectl delete pod app && kubectl delete pvc ebs-claim

# Create a new PVC that references the volume snapshot as the data source
# ebs-snapshot-restored-claim.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ebs-snapshot-restored-claim
spec:
  storageClassName: gp3
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 4Gi
  dataSource:
    name: ebs-volume-snapshot
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io

# kubectl apply -f ebs-snapshot-restored-claim.yaml

# Create a new pod that uses the restored PVC
# ebs-snapshot-restored-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  containers:
  - name: app
    image: centos
    command: ["/bin/sh"]
    args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
    volumeMounts:
    - name: persistent-storage
      mountPath: /data
  volumes:
  - name: persistent-storage
    persistentVolumeClaim:
      claimName: ebs-snapshot-restored-claim

# kubectl apply -f ebs-snapshot-restored-pod.yaml
# kubectl exec app -- cat /data/out.txt

It is also possible to create the S3 bucket manually beforehand then the driver uses the pre-existing bucket as a persistent volume, which is called static provisioning.
Specify the snapshots-pv-live.yaml as shown below.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: snapshots-pv-live
spec:
  accessModes:
    - ReadWriteMany
  capacity:
    storage: 21Ti
  csi:
    driver: ru.yandex.s3.csi
    controllerPublishSecretRef:
      name: csi-s3-secret
      namespace: csi-s3
    nodePublishSecretRef:
      name: csi-s3-secret
      namespace: csi-s3
    nodeStageSecretRef:
      name: csi-s3-secret
      namespace: csi-s3
    volumeAttributes:
      bucket: test-smh-bucket
      mounter: geesefs
      options: --no-systemd --memory-limit 1000 --dir-mode 0777 --file-mode 0666
    volumeHandle: test-xxx-bucket/pvc-xxxxx-a753-4e7f-b557-c0b1476a7819
  persistentVolumeReclaimPolicy: Retain
  storageClassName: csi-s3
  volumeMode: Filesystem

In this example, a PersistentVolume named snapshots-pv-live is created with the following details:
Access mode is set to ReadWriteMany, allowing multiple nodes to read and write to the volume.
Capacity is set to 21Ti.
The CSI driver is specified as ru.yandex.s3.csi, indicating the Yandex S3 CSI driver.
Secret references (controllerPublishSecretRef, nodePublishSecretRef, nodeStageSecretRef) are provided for authentication.
Volume attributes specify the S3 bucket name (test-smh-bucket), mounter (geesefs), and additional options.
The volume handle includes the bucket name and a unique identifier.
Reclaim policy is set to Retain, meaning the volume will be retained even if the PVC is deleted.
Storage class is set to csi-s3.

To use the statically provisioned PV, create a PersistentVolumeClaim that references the PV using the volumeName field.
Specify the snapshots-pvc-live.yaml as shown below.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: snapshots-pvc-live
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 21Ti
  storageClassName: csi-s3
  volumeName: snapshots-pv-live

In this example, a PersistentVolumeClaim named snapshots-pvc-live is created with the following details:
Access mode is set to ReadWriteMany, matching the PV's access mode.
Requested storage capacity is set to 21Ti.
Storage class is set to csi-s3, matching the PV's storage class.
The volumeName field specifies the name of the statically provisioned PV (snapshots-pv-live) to bind the PVC to.

Note that in static provisioning, the S3 bucket needs to be manually created and managed outside of Kubernetes. The S3 CSI driver simply uses the pre-existing bucket as a persistent volume and does not handle the creation or deletion of the bucket itself. Static provisioning can be useful in scenarios where you have existing S3 buckets that you want to use as persistent volumes in Kubernetes, or when you need more control over the bucket lifecycle and configuration.

Implementing Amazon Elastic File System (EFS)

When running stateful applications on Kubernetes, it’s common to require shared storage that can be accessed by multiple pods simultaneously. While Amazon Elastic Block Store (EBS) provides persistent storage, it has a limitation of being mounted to a single EC2 instance at a time. To overcome this challenge and enable shared storage across multiple nodes, we can leverage EFS with Kubernetes on Amazon EKS.

To use EFS with Kubernetes on EKS, we can leverage the Amazon EFS CSI driver, as we did above. The CSI driver provides a standard interface for Kubernetes to interact with storage systems like EFS.

Install the EFS CSI Driver

Let us follow the example provided by AWS.
Create an IAM role for the EFS CSI driver with the necessary permissions to manage EFS resources.
Deploy the EFS CSI driver as an addon in your EKS cluster using the eksctl command-line tool.

export CLUSTER_NAME=my-eks
export ROLE_NAME=AmazonEKS_EFS_CSI_DriverRole

eksctl create iamserviceaccount \
    --name efs-csi-controller-sa \
    --namespace kube-system \
    --cluster $CLUSTER_NAME \
    --role-name $ROLE_NAME \
    --role-only \
    --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEFSCSIDriverPolicy \
    --approve

Update the IAM role’s trust policy to allow the EFS CSI controller to assume the role.

TRUST_POLICY=$(aws iam get-role --role-name $ROLE_NAME --query 'Role.AssumeRolePolicyDocument' | \
    sed -e 's/efs-csi-controller-sa/efs-csi-*/' -e 's/StringEquals/StringLike/')
aws iam update-assume-role-policy --role-name $ROLE_NAME --policy-document "$TRUST_POLICY"

Create the EFS CSI driver addon in your EKS cluster.

eksctl create addon \
  --name aws-efs-csi-driver \
  --cluster ${CLUSTER_NAME} \
  --service-account-role-arn arn:aws:iam::${ACCOUNT_ID}:role/AmazonEKS_EFS_CSI_DriverRole \
  --force

Create an EFS File System

Create an EFS file system in the desired AWS region and VPC using the AWS Management Console or AWS CLI. If you have a bastion host, SSH into the bastion host or a jumpbox instance that has access to the EKS cluster and the EFS file system.
Install the amazon-efs-utils package if it's not already installed. Create a directory where you want to mount the EFS file system, for example, /mnt/myefs.

EfsFsId=$(aws efs describe-file-systems --query "FileSystems[*].FileSystemId" --output text)
mount -t efs -o tls $EfsFsId:/ /mnt/myefs

Define a StorageClass that uses the EFS CSI driver as the provisioner.

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: efs-sc
provisioner: efs.csi.aws.com

Define a PersistentVolume that references the EFS file system using the EFS file system ID.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-pv
spec:
  capacity:
    storage: 5Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-sc
  csi:
    driver: efs.csi.aws.com
    volumeHandle: <EFS_FILE_SYSTEM_ID>

# kubectl apply -f pv.yaml

Define a PersistentVolumeClaim that requests storage from the EFS StorageClass.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: efs-claim
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: efs-sc
  resources:
    requests:
      storage: 5Gi

# kubectl apply -f claim.yaml

Create pods that mount the PVC and specify the mount path where the EFS file system will be accessible.

# pod1.yaml
apiVersion: v1
kind: Pod
metadata:
  name: app1
spec:
  containers:
  - name: app1
    image: busybox
    command: ["/bin/sh"]
    args: ["-c", "while true; do echo $(date -u) >> /data/out1.txt; sleep 5; done"]
    volumeMounts:
    - name: persistent-storage
      mountPath: /data
  volumes:
  - name: persistent-storage
    persistentVolumeClaim:
      claimName: efs-claim
      
# pod2.yaml
apiVersion: v1
kind: Pod
metadata:
  name: app2
spec:
  containers:
  - name: app2
    image: busybox
    command: ["/bin/sh"]
    args: ["-c", "while true; do echo $(date -u) >> /data/out2.txt; sleep 5; done"]
    volumeMounts:
    - name: persistent-storage
      mountPath: /data
  volumes:
  - name: persistent-storage
    persistentVolumeClaim:
      claimName: efs-claim

kubectl apply -f pod1.yaml,pod2.yaml

You can connect to the pods and check the contents of the mounted directory to confirm that the data is being shared across pods.

tree /mnt/myefs
#/mnt/myefs
#|-- out1.txt
#`-- out2.txt

Exploring Node Groups in Amazon EKS

Node Groups in EKS are logical groupings of worker nodes that share similar characteristics, such as instance type, CPU, RAM, and storage configuration. They provide a way to define the desired capacity and scaling behavior of the worker nodes within an EKS cluster.

When the workload on the cluster increases, EKS can automatically add new worker nodes to the node group to handle the increased demand. Similarly, when the workload decreases, EKS can scale down the number of worker nodes to optimize cost and resource utilization.

Let us discuss two autoscaling tools mentioned and explain how they work with Node Groups in EKS.

Kubernetes Cluster Autoscaler (CA)

The Kubernetes Cluster Autoscaler is a tool that automatically adjusts the size of a Kubernetes cluster based on the resource demands of the workloads running on it. It monitors the resource utilization of the pods and scales the number of worker nodes in the cluster accordingly.

You can define a Node Group with autoscaling configuration like below.

# myng-autoscaling.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
managedNodeGroups:
- name: ng-autoscaling
  minSize: 1
  maxSize: 5
  desiredCapacity: 2
  privateNetworking: true
  iam:
    withAddonPolicies:
      autoScaler: true

The example defined a Node Group named ng-autoscaling with the following autoscaling configurations:

Minimum size: 1 (minimum number of worker nodes)
Maximum size: 5 (maximum number of worker nodes)
Desired capacity: 2 (initial number of worker nodes)
IAM policy for the Cluster Autoscaler is enabled

You should make sure that you deploy the cluster autoscaler.

kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

With the Cluster Autoscaler deployed and configured, it will continuously monitor the resource utilization of the pods in the cluster. If there are unschedulable pods due to insufficient resources, the Cluster Autoscaler will automatically scale up the number of worker nodes in the Node Group, up to the specified maximum size. Conversely, if there are underutilized worker nodes, the Cluster Autoscaler will scale down the Node Group to the minimum size to save costs.

Karpenter

Karpenter is an open-source autoscaling tool specifically designed for Kubernetes clusters running on AWS. It provides advanced autoscaling capabilities and integrates seamlessly with EKS.

Karpenter observes the resource requests of the pods in the sample workload and automatically provision worker nodes to accommodate them. It selects spot instances based on the provisioner configuration and scale the worker nodes as needed.

Install Karpenter in your EKS cluster

helm repo add karpenter https://charts.karpenter.sh
helm repo update
helm upgrade --install karpenter karpenter/karpenter --namespace karpenter --create-namespace --set serviceAccount.create=true --version v0.20.0 --wait

Create a Karpenter provisioner

# karpenter-provisioner.yaml
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot"]
  limits:
    resources:
      cpu: 1000
  provider:
    subnetSelector:
      karpenter.sh/discovery: my-cluster
    securityGroupSelector:
      karpenter.sh/discovery: my-cluster
  ttlSecondsAfterEmpty: 30

In this example, we define a Karpenter provisioner that:

Selects spot instances for worker nodes
Limits the total CPU capacity to 1000 cores
Discovers subnets and security groups tagged with karpenter.sh/discovery: my-cluster
Terminates empty nodes after 30 seconds

Karpenter can select the most suitable instance types based on the workload requirements. It considers factors like CPU, memory, and other pod constraints to determine the optimal instance type for each workload.

If you are running a Kubernetes cluster on AWS and want advanced autoscaling features and dynamic provisioning, Karpenter may be a better fit. If you need a more general-purpose, cloud-agnostic autoscaling solution, the Kubernetes Cluster Autoscaler is a solid choice.

Creating a Node Group with `eksctl`

Define the Node Group configuration in a YAML file:

# myng3.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
managedNodeGroups:
- amiFamily: AmazonLinux2
  desiredCapacity: 1
  instanceType: t4g.medium
  labels:
    family: graviton
  maxSize: 1
  minSize: 1
  name: ng3
  ssh:
    allow: true
    publicKeyPath: ~/.ssh/id_rsa.pub
  subnets:
  - subnet-082b50963d6c944ef
  - subnet-03e23edb6dd18f876
  - subnet-029b2b9933d83de6f
  tags:
    alpha.eksctl.io/nodegroup-name: ng3
    alpha.eksctl.io/nodegroup-type: managed
  volumeSize: 30
metadata:
  name: myeks
  region: ap-northeast-2
  version: "1.28"

The above example defined a Node Group named ng3 with the following configurations:

Instance type: t4g.medium (ARM-based Graviton processor)
Desired capacity: 1 (initial number of worker nodes)
Minimum size: 1 (minimum number of worker nodes)
Maximum size: 1 (maximum number of worker nodes)
Labels: family=graviton (custom label for the worker nodes)
SSH access: Enabled with the specified public key path
Subnets: Specified subnet IDs for the worker nodes
Tags: Additional tags for the Node Group

Create the Node Group using eksctl…

eksctl create nodegroup -f myng3.yaml

Verify the Node Group creation:

kubectl get nodes --label-columns eks.amazonaws.com/nodegroup,kubernetes.io/arch

Tains and Tolerations

Taints and tolerations provide a mechanism to control which pods can be scheduled on specific worker nodes. Taints are applied to worker nodes, while tolerations are specified in the pod specification.

Taints

A taint is a property that can be set on each node.
When a taint is applied to a node, pods will not be scheduled on that node unless they have a matching toleration.
Taints are commonly used to dedicate nodes for specific purposes or roles.
You can use taints to ensure that only pods that require GPU resources are scheduled on nodes with GPUs, while other pods are prevented from being scheduled on those nodes.
NoSchedule: If a pod does not have a matching toleration, it will not be scheduled on the tainted node. However, this effect does not apply to pods that are already running on the node.
PreferNoSchedule: If a pod does not have a matching toleration, Kubernetes will try to avoid scheduling it on the tainted node. However, if there are resource constraints or insufficient nodes in the cluster, the pod may still be scheduled on the tainted node.
NoExecute: If a pod does not have a matching toleration, it will not be scheduled on the tainted node. Additionally, if a pod is already running on the node and does not have a matching toleration, it will be evicted (terminated).

Tolerations

A toleration is a property that can be set on a pod.
Tolerations are applied to pods and allow them to be scheduled on nodes with matching taints.
Taints and tolerations are like a “no entry” sign on a node. Taints act as the “no entry” sign, preventing pods from being scheduled on the node. Tolerations are like a special “access pass” that allows pods to ignore the “no entry” sign and be scheduled on the node. NodeAffinity is like a “preferred location” sign. It guides pods towards specific nodes based on node labels, indicating where the pod should be scheduled.

The goal for this example is to ensure that pods are scheduled on the nodes in the ng3 node group.

Let’s apply a taint to the Node Group.

aws eks update-nodegroup-config --cluster-name $CLUSTER_NAME --nodegroup-name ng3 --taints "addOrUpdateTaints=[{key=frontend, value=true, effect=NO_EXECUTE}]"

This command adds a taint to the ng3 node group with the following properties:

Key: frontend
Value: true
Effect: NO_EXECUTE

To allow pods to be scheduled on the ng3 node group, they need to have a matching toleration in their configuration. The toleration should specify the same key (frontend) and effect (NO_EXECUTE) as the taint.

Here's an example of a pod configuration with the required toleration:

# busybox.yaml

apiVersion: v1
kind: Pod
metadata:
  name: busybox
spec:
  terminationGracePeriodSeconds: 3
  containers:
  - name: busybox
    image: busybox
    command:
    - "/bin/sh"
    - "-c"
    - "while true; do date >> /home/pod-out.txt; cd /home; sync; sync; sleep 10; done"
  tolerations:
    - effect: NoExecute
      key: frontend
      operator: Exists

In this example, the pod has a toleration that matches the taint on the ng3 node group. The operator: Exists means that the toleration will match any taint with the specified key, regardless of the value.

Again, it’s important to note that taints and tolerations are different from nodeAffinity. While nodeAffinity is used to specify which nodes a pod prefers or requires to be scheduled on, taints and tolerations provide a way to restrict or allow pod scheduling on specific nodes based on the presence or absence of tolerations.

The NO_EXECUTE effect is useful in scenarios where you want to strictly control which pods can run on specific nodes and ensure that pods without the necessary tolerations are not scheduled or are evicted from those nodes. Some common use cases include:

Dedicated Nodes: You can use taints with the NO_EXECUTE effect to dedicate certain nodes for specific workloads or services. By applying a taint to those nodes and requiring pods to have a matching toleration, you ensure that only the desired pods are scheduled on those nodes.
Maintenance or Decommissioning: If you need to perform maintenance on a node or decommission it, you can apply a taint with the NO_EXECUTE effect. This will evict any pods that don't have a matching toleration, allowing you to safely perform the maintenance or decommissioning tasks without affecting the workloads.
Workload Isolation: Taints with the NO_EXECUTE effect can be used to isolate specific workloads or applications from running on certain nodes. By applying taints to nodes and requiring pods to have specific tolerations, you can ensure that only the intended workloads are scheduled on those nodes.

It’s important to note that the NO_EXECUTE effect is the strictest among the available taint effects (NO_SCHEDULE, PREFER_NO_SCHEDULE, NO_EXECUTE). It not only prevents new pods from being scheduled on the tainted node but also evicts existing pods that don't have a matching toleration.

When using taints with the NO_EXECUTE effect, it's crucial to carefully consider the impact on existing pods and ensure that the necessary tolerations are specified in the pod configurations to avoid unexpected evictions and disruptions to the workloads.

Understanding AWS EKS Storage, EFS Driver and Node Groups

Gashida Study Week 3

Understanding Container Storage

Data Management in Docker Containers

Persistent Volumes (PVs) and Persistent Volume Claims (PVCs)

CSI Drivers

AWS Storage Options

Yandex S3 Driver

Volume Snapshots Controller

Implementing Amazon Elastic File System (EFS)

Install the EFS CSI Driver

Create an EFS File System

Exploring Node Groups in Amazon EKS

Kubernetes Cluster Autoscaler (CA)

Karpenter

Creating a Node Group with `eksctl`

Tains and Tolerations

Written by Sigrid Jin

No responses yet

Understanding AWS EKS Storage, EFS Driver and Node Groups

Gashida Study Week 3

Understanding Container Storage

Data Management in Docker Containers

Persistent Volumes (PVs) and Persistent Volume Claims (PVCs)

CSI Drivers

AWS Storage Options

Yandex S3 Driver

Volume Snapshots Controller

Implementing Amazon Elastic File System (EFS)

Install the EFS CSI Driver

Create an EFS File System

Exploring Node Groups in Amazon EKS

Kubernetes Cluster Autoscaler (CA)

Karpenter

Creating a Node Group with eksctl

Tains and Tolerations

Written by Sigrid Jin

No responses yet

Creating a Node Group with `eksctl`