Kubernetes Deep Dive: LoadBalancer Services and MetalLB

Sigrid Jin
17 min readOct 5, 2024

--

As we continue our journey through the Kubernetes Advanced Networking Study (KANS), led by the esteemed CloudNet@ team, we’ve reached an exciting milestone in week 5. Today, we’ll explore the intricacies of LoadBalancer-type Services in Kubernetes, with a special focus on MetalLB for on-premises environments.

Understanding LoadBalancer Services

LoadBalancer Services in Kubernetes play a crucial role in exposing applications to external traffic using external load balancers. However, it’s essential to understand that their behavior can vary significantly depending on whether you’re operating in a public cloud environment like AWS or Azure, or in an on-premises setup.

When working with cloud environments, LoadBalancer Services typically operate in one of two modes:

NodePort Approach

  • External load balancer connects to a NodePort-exposed service
  • Service then distributes traffic to target pods
  • Involves two load balancing steps: at the external load balancer and at the service level
  • Potential drawbacks: slightly reduced performance and increased network costs

Pod Direct Approach

  • External load balancer distributes traffic directly to pod IPs
  • A LoadBalancer Controller pod communicates pod IPs to the external load balancer
  • More efficient, involving only one load balancing step

However, in on-premises environments, achieving similar functionality requires additional tools like MetalLB. MetalLB operates by creating speaker pods as a DaemonSet in your Kubernetes cluster. These speaker pods are responsible for advertising the External IP addresses of your LoadBalancer Services to the outside world, making your services accessible externally.

To propagate External IPs to other nodes, MetalLB speaker pods use either ARP (Address Resolution Protocol) or BGP (Border Gateway Protocol). This gives rise to two operational modes.

Layer 2 mode (using ARP)

In Layer 2 mode, MetalLB operates by electing a leader among the speaker pods. This leader pod, and by extension, the node it’s running on, becomes the sole entry point for incoming traffic to the LoadBalancer Service.

  1. A leader pod is elected among the MetalLB speaker pods
  2. The leader pod’s node receives all incoming traffic for the LoadBalancer Service
  3. Traffic is then distributed to target pods using iptables rules on the leader node
  4. Speaker pods use the host network for optimal performance

If the leader pod or its node fails, a new leader is elected from the remaining speaker pods. This process typically takes 10–20 seconds, during which service interruption may occur. Layer 2 mode can lead to potential bottlenecks as all traffic is funneled through a single node. This, combined with the brief downtime during leader re-election, makes Layer 2 mode less suitable for production environments with high availability requirements.

BGP Mode

https://www.linkedin.com/pulse/metallb-loadbalancer-bgp-k8s-rock-music-dipankar-shaw

BGP mode offers a more robust and scalable solution for LoadBalancer Services in on-premises environments.

  1. Speaker pods use BGP to advertise External IPs to the network
  2. External routers distribute incoming traffic using Equal-Cost Multi-Path (ECMP) routing
  3. This approach eliminates the single-node bottleneck seen in Layer 2 mode, as traffic can be efficiently distributed across multiple nodes

BGP mode is generally recommended for production use due to its superior load distribution and faster failover capabilities.

Implementing MetalLB in Your Kubernetes Cluster

Let’s create a kind cluster for the sake of testing.

cat <<EOT> kind-svc-2w.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
"InPlacePodVerticalScaling": true
"MultiCIDRServiceAllocator": true
nodes:
- role: control-plane
labels:
mynode: control-plane
topology.kubernetes.io/zone: ap-northeast-2a
extraPortMappings:
- containerPort: 30000
hostPort: 30000
- containerPort: 30001
hostPort: 30001
- containerPort: 30002
hostPort: 30002
- containerPort: 30003
hostPort: 30003
- containerPort: 30004
hostPort: 30004
kubeadmConfigPatches:
- |
kind: ClusterConfiguration
apiServer:
extraArgs:
runtime-config: api/all=true
controllerManager:
extraArgs:
bind-address: 0.0.0.0
etcd:
local:
extraArgs:
listen-metrics-urls: http://0.0.0.0:2381
scheduler:
extraArgs:
bind-address: 0.0.0.0
- |
kind: KubeProxyConfiguration
metricsBindAddress: 0.0.0.0
- role: worker
labels:
mynode: worker1
topology.kubernetes.io/zone: ap-northeast-2a
- role: worker
labels:
mynode: worker2
topology.kubernetes.io/zone: ap-northeast-2b
- role: worker
labels:
mynode: worker3
topology.kubernetes.io/zone: ap-northeast-2c
networking:
podSubnet: 10.10.0.0/16
serviceSubnet: 10.200.1.0/24
EOT

kind create cluster --config kind-svc-2w.yaml --name myk8s --image kindest/node:v1.31.0

docker exec -it myk8s-control-plane sh -c 'apt update && apt install tree psmisc lsof wget bsdmainutils bridge-utils net-tools dnsutils ipset ipvsadm nfacct tcpdump ngrep iputils-ping arping git vim arp-scan -y'
for i in worker worker2 worker3; do echo ">> node myk8s-$i <<"; docker exec -it myk8s-$i sh -c 'apt update && apt install tree psmisc lsof wget bsdmainutils bridge-utils net-tools dnsutils ipset ipvsadm nfacct tcpdump ngrep iputils-ping arping -y'; echo; done

docker run -d --rm --name mypc --network kind --ip 172.18.0.100 nicolaka/netshoot sleep infinity # IP 지정 실행 시

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: webpod1
labels:
app: webpod
spec:
nodeName: myk8s-worker
containers:
- name: container
image: traefik/whoami
terminationGracePeriodSeconds: 0
---
apiVersion: v1
kind: Pod
metadata:
name: webpod2
labels:
app: webpod
spec:
nodeName: myk8s-worker2
containers:
- name: container
image: traefik/whoami
terminationGracePeriodSeconds: 0
EOF

You can install MetalLB with a simple manifest configuration.

sigridjineth@sigridjineth-Z590-VISION-G:~$ kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/refs/heads/main/config/manifests/metallb-native.yaml
namespace/metallb-system created
customresourcedefinition.apiextensions.k8s.io/bfdprofiles.metallb.io created
customresourcedefinition.apiextensions.k8s.io/bgpadvertisements.metallb.io created
customresourcedefinition.apiextensions.k8s.io/bgppeers.metallb.io created
customresourcedefinition.apiextensions.k8s.io/communities.metallb.io created
customresourcedefinition.apiextensions.k8s.io/ipaddresspools.metallb.io created
customresourcedefinition.apiextensions.k8s.io/l2advertisements.metallb.io created
customresourcedefinition.apiextensions.k8s.io/servicel2statuses.metallb.io created
serviceaccount/controller created
serviceaccount/speaker created
role.rbac.authorization.k8s.io/controller created
role.rbac.authorization.k8s.io/pod-lister created
clusterrole.rbac.authorization.k8s.io/metallb-system:controller created
clusterrole.rbac.authorization.k8s.io/metallb-system:speaker created
rolebinding.rbac.authorization.k8s.io/controller created
rolebinding.rbac.authorization.k8s.io/pod-lister created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:controller created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:speaker created
configmap/metallb-excludel2 created
secret/metallb-webhook-cert created
service/metallb-webhook-service created
deployment.apps/controller created
daemonset.apps/speaker created

kubectl get crd | grep metallb

kubectl get all,configmap,secret,ep -n metallb-system

Now that we understand the theory behind MetalLB, let’s dive into its practical implementation. We’ll go through the process step-by-step, from configuration to testing failover scenarios.

The first crucial step in setting up MetalLB is defining the IP pool it will use for External IPs. We use an IPAddressPool resource to define the range of IP addresses MetalLB can assign:

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: first-pool
namespace: metallb-system
spec:
addresses:
- 172.19.255.200-172.19.255.250
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: example
namespace: metallb-system
spec:
ipAddressPools:
- first-pool

# kubectl apply -f metallb-config.yaml

The frist IPAddressPool configuration allows MetalLB to use IP addresses from 172.19.255.200 to 172.19.255.250, providing about 50 available IPs.

After declaring the IP pool, we need to allow its use in Layer 2 mode. This L2Advertisement resource tells MetalLB to advertise the IPs from our my-ippool using Layer 2 mode.

Now, let’s create some LoadBalancer-type services to test our MetalLB setup. This YAML creates three LoadBalancer services, each exposing port 80.

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
name: svc1
spec:
ports:
- name: svc1-webport
port: 80
targetPort: 80
selector:
app: webpod
type: LoadBalancer # 서비스 타입이 LoadBalancer
---
apiVersion: v1
kind: Service
metadata:
name: svc2
spec:
ports:
- name: svc2-webport
port: 80
targetPort: 80
selector:
app: webpod
type: LoadBalancer
---
apiVersion: v1
kind: Service
metadata:
name: svc3
spec:
ports:
- name: svc3-webport
port: 80
targetPort: 80
selector:
app: webpod
type: LoadBalancer
EOF

# service/svc1 created
# service/svc2 created
# service/svc3 created
Please note that external-ip in load balancers are the ip that will be accessed by MetalLB speaker pods. Additionally, 80 port is a clusterIP while 32503, 30616, 32667 is NodePort.

By default, LoadBalancer services include a NodePort. For security reasons, it’s often better to close unused ports. We can disable NodePort allocation:

kubectl patch svc svc1 -p '{"spec": {"allocateLoadBalancerNodePorts": false}}'

# kubectl get svc svc1 -o json | jq
# "spec": {
# "allocateLoadBalancerNodePorts": false,

In Layer 2 mode, one MetalLB speaker pod is elected as the leader. We can identify this using ARP scanning, which helps us understand which node is handling traffic for each service.

Recalling that we’re working with a Kind Kubernetes cluster, which simulates a multi-node Kubernetes environment using Docker containers. Note that we have deployed MetalLB to provide LoadBalancer functionality, which isn’t natively available in Kind.

Let’s take a look at the network configuration within our Kind cluster:

sigridjineth@sigridjineth-Z590-VISION-G:~$ docker exec -it myk8s-worker sh -c "arp -e"
Address HWtype HWaddress Flags Mask Iface
myk8s-worker2.kind ether 02:42:ac:13:00:04 C eth0
myk8s-control-plane.kin ether 02:42:ac:13:00:03 C eth0
myk8s-worker3.kind ether 02:42:ac:13:00:05 C eth0
10.10.3.2 ether 0e:d3:98:f4:fd:45 C vethe80a3506
sigridjineth-Z590-VISIO ether 02:42:fe:37:c0:e9 C eth0
10.10.3.4 ether 5e:04:ad:2d:fe:fe C veth9e0dbea4

This command reveals the ARP (Address Resolution Protocol) table of one of our worker nodes. The output shows:

  1. The cluster nodes (myk8s-worker2, myk8s-control-plane, myk8s-worker3) are on the same network (172.19.0.0/16).
  2. There are also some pod IPs (10.10.3.2, 10.10.3.4) visible, indicating the pod network is in the 10.10.0.0/16 range.
sigridjineth@sigridjineth-Z590-VISION-G:~$ docker exec -it myk8s-worker sh -c "iptables-save | grep 172.19.255"
-A KUBE-SERVICES -d 172.19.255.201/32 -p tcp -m comment --comment "default/svc2:svc2-webport loadbalancer IP" -m tcp --dport 80 -j KUBE-EXT-52NWEQM6QLDFKRCQ
-A KUBE-SERVICES -d 172.19.255.200/32 -p tcp -m comment --comment "default/svc1:svc1-webport loadbalancer IP" -m tcp --dport 80 -j KUBE-EXT-DLGPAL4ZCYSJ7UPR
-A KUBE-SERVICES -d 172.19.255.202/32 -p tcp -m comment --comment "default/svc3:svc3-webport loadbalancer IP" -m tcp --dport 80 -j KUBE-EXT-CY3XQR2NWKKCV4WA

Examining the iptable rules lets you know how traffic to our LoadBalancer IPs (172.19.255.200, 172.19.255.201, 172.19.255.202) is being handled. Each service has a corresponding iptables rule that forwards traffic to the appropriate Kubernetes service.

kubectl run test-pod --image=busybox -- sleep 3600
kubectl exec -it test-pod -- sh
# Inside the pod
wget -O- http://172.19.255.200
wget -O- http://172.19.255.201
wget -O- http://172.19.255.202

--2024-10-05 23:55:09-- http://172.19.255.200/
Connecting to 172.19.255.200:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 242 [text/plain]
Saving to: ‘STDOUT’

- 0%[ ] 0 --.-KB/s Hostname: webpod2
IP: 127.0.0.1
IP: ::1
IP: 10.10.2.2
IP: fe80::98ec:e2ff:fed5:1748
RemoteAddr: 172.19.0.5:38696
GET / HTTP/1.1
Host: 172.19.255.200
User-Agent: Wget/1.21.2
Accept: */*
Accept-Encoding: identity
Connection: Keep-Alive

--2024-10-05 23:55:09-- http://172.19.255.201/
Connecting to 172.19.255.201:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 242 [text/plain]
Saving to: ‘STDOUT’

- 0%[ ] 0 --.-KB/s Hostname: webpod2
IP: 127.0.0.1
IP: ::1
IP: 10.10.2.2
IP: fe80::98ec:e2ff:fed5:1748
RemoteAddr: 172.19.0.5:46804
GET / HTTP/1.1
Host: 172.19.255.201
User-Agent: Wget/1.21.2
Accept: */*
Accept-Encoding: identity
Connection: Keep-Alive

--2024-10-05 23:55:09-- http://172.19.255.202/
Connecting to 172.19.255.202:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 241 [text/plain]
Saving to: ‘STDOUT’

- 0%[ ] 0 --.-KB/s Hostname: webpod1
IP: 127.0.0.1
IP: ::1
IP: 10.10.3.2
IP: fe80::cd3:98ff:fef4:fd45
RemoteAddr: 172.19.0.4:21473
GET / HTTP/1.1
Host: 172.19.255.202
User-Agent: Wget/1.21.2
Accept: */*
Accept-Encoding: identity
Connection: Keep-Alive

The successful responses from these wget commands confirm that our services are accessible within the cluster, and MetalLB is correctly routing traffic to the backend pods.

kubectl logs -n metallb-system -l app=metallb,component=speaker

{"caller":"speakerlist.go:313","level":"info","msg":"node event - forcing sync","node addr":"172.19.0.2","node event":"NodeLeave","node name":"myk8s-worker","ts":"2024-10-05T14:52:49Z"}
{"caller":"service_controller_reload.go:63","controller":"ServiceReconciler - reprocessAll","level":"info","start reconcile":"metallbreload/reload","ts":"2024-10-05T14:52:49Z"}
{"caller":"service_controller_reload.go:119","controller":"ServiceReconciler - reprocessAll","end reconcile":"metallbreload/reload","level":"info","ts":"2024-10-05T14:52:49Z"}
{"caller":"speakerlist.go:313","level":"info","msg":"node event - forcing sync","node addr":"172.19.0.2","node event":"NodeJoin","node name":"myk8s-worker","ts":"2024-10-05T14:52:50Z"}
{"caller":"service_controller_reload.go:63","controller":"ServiceReconciler - reprocessAll","level":"info","start reconcile":"metallbreload/reload","ts":"2024-10-05T14:52:50Z"}
{"caller":"service_controller_reload.go:119","controller":"ServiceReconciler - reprocessAll","end reconcile":"metallbreload/reload","level":"info","ts":"2024-10-05T14:52:50Z"}
{"caller":"main.go:420","event":"serviceAnnounced","ips":["172.19.255.202"],"level":"info","msg":"service has IP, announcing","pool":"first-pool","protocol":"layer2","ts":"2024-10-05T14:52:50Z"}
{"caller":"service_controller_reload.go:119","controller":"ServiceReconciler - reprocessAll","end reconcile":"metallbreload/reload","level":"info","ts":"2024-10-05T14:52:50Z"}
{"caller":"layer2_status_controller.go:68","controller":"Layer2StatusReconciler","level":"info","start reconcile":"default/svc3","ts":"2024-10-05T14:52:50Z"}

{"caller":"layer2_status_controller.go:68","controller":"Layer2StatusReconciler","level":"info","start reconcile":"default/svc2","ts":"2024-10-05T14:52:50Z"}
{"caller":"layer2_status_controller.go:135","controller":"Layer2StatusReconciler","end reconcile":"default/svc2","level":"info","ts":"2024-10-05T14:52:50Z"}
{"caller":"layer2_status_controller.go:68","controller":"Layer2StatusReconciler","level":"info","start reconcile":"default/svc1","ts":"2024-10-05T14:52:50Z"}
{"caller":"layer2_status_controller.go:139","controller":"Layer2StatusReconciler","end reconcile":"default/svc1","level":"info","ts":"2024-10-05T14:52:50Z"}
{"caller":"layer2_status_controller.go:68","controller":"Layer2StatusReconciler","level":"info","start reconcile":"default/svc2","ts":"2024-10-05T14:52:50Z"}
{"caller":"layer2_status_controller.go:139","controller":"Layer2StatusReconciler","end reconcile":"default/svc2","level":"info","ts":"2024-10-05T14:52:50Z"}
{"caller":"layer2_status_controller.go:68","controller":"Layer2StatusReconciler","level":"info","start reconcile":"default/svc1","ts":"2024-10-05T14:52:50Z"}
{"caller":"layer2_status_controller.go:111","controller":"Layer2StatusReconciler","end reconcile":"default/svc1","level":"info","ts":"2024-10-05T14:52:50Z"}
{"caller":"layer2_status_controller.go:68","controller":"Layer2StatusReconciler","level":"info","start reconcile":"default/svc2","ts":"2024-10-05T14:52:50Z"}
{"caller":"layer2_status_controller.go:111","controller":"Layer2StatusReconciler","end reconcile":"default/svc2","level":"info","ts":"2024-10-05T14:52:50Z"}

MetalLB reacts to node join and leave events, forcing a sync each time. This ensures the load balancer configuration stays up-to-date with the cluster’s node state. We see log entries like "event":"serviceAnnounced", indicating that MetalLB is announcing the IP addresses for our services using the Layer 2 protocol.

MetalLB assigns IP addresses from the configured pool (172.19.255.200–250) to LoadBalancer services. In Layer 2 mode, MetalLB uses ARP to announce the service IPs. One node (the leader) responds to ARP requests for each service IP. Once traffic reaches the node handling a particular service IP, iptables rules direct it to the appropriate pods. MetalLB monitors node status and can reassign IPs if a node fails, ensuring continued service availability.

There are frequent “start reconcile” and “end reconcile” log entries, showing that MetalLB is constantly checking and updating its state to match the desired state of the cluster.

We’ll simulate a leader pod failure and observe how MetalLB handles the failover.

  1. First, let’s set up continuous access to a service.
  2. Then, we’ll identify and stop the node with the leader speaker pod.
  3. We’ll monitor service access during this process.
SVC1EXIP=$(kubectl get svc svc1 -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
while true; do
curl -s --connect-timeout 1 $SVC1EXIP | grep Hostname
echo "$(date): $?"
sleep 1
done

Run this in a separate terminal window. It will continuously attempt to access the service and print the hostname of the responding pod along with the exit status of the curl command.

To identify which speaker pod is handling a particular service IP, we need to look at the MetalLB speaker logs.

sigridjineth@sigridjineth-Z590-VISION-G:~$ for pod in $(kubectl get pods -n metallb-system -l app=metallb,component=speaker -o name); do
echo "Checking $pod"
kubectl logs -n metallb-system $pod | grep "serviceAnnounced" | grep "172.19.255.202"
done
Checking pod/speaker-gwt5t
Checking pod/speaker-mpvc5 <- remove this.
{"caller":"main.go:420","event":"serviceAnnounced","ips":["172.19.255.202"],"level":"info","msg":"service has IP, announcing","pool":"first-pool","protocol":"layer2","ts":"2024-10-05T14:52:40Z"}
{"caller":"main.go:420","event":"serviceAnnounced","ips":["172.19.255.202"],"level":"info","msg":"service has IP, announcing","pool":"first-pool","protocol":"layer2","ts":"2024-10-05T14:52:49Z"}
{"caller":"main.go:420","event":"serviceAnnounced","ips":["172.19.255.202"],"level":"info","msg":"service has IP, announcing","pool":"first-pool","protocol":"layer2","ts":"2024-10-05T14:52:50Z"}
Checking pod/speaker-nwcds
Checking pod/speaker-qv42s

sigridjineth@sigridjineth-Z590-VISION-G:~$ kubectl get pods -o wide -n metallb-system
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
controller-549b46878c-swbx4 1/1 Running 0 17m 10.10.3.4 myk8s-worker <none> <none>
speaker-758nn 1/1 Running 0 39s 172.19.0.4 myk8s-worker2 <none> <none>
speaker-8bljr 1/1 Running 0 39s 172.19.0.2 myk8s-worker <none> <none>
speaker-dgpln 1/1 Running 0 38s 172.19.0.3 myk8s-control-plane <none> <none>
speaker-w4jjj 1/1 Running 0 39s 172.19.0.5 myk8s-worker3 <none> <none>
sigridjineth@sigridjineth-Z590-VISION-G:~$ docker stop myk8s-worker2 --signal 15
error: lost connection to pod

Initially, all requests should succeed, returning the hostname of a pod. When you stop the node, you may see some failed requests or timeouts.

In the MetalLB logs, you should see messages about the node leaving the cluster and potentially a new leader being elected. After a short period (usually 5–10 seconds), requests should start succeeding again as a new leader takes over.

When you restart the node, you’ll see logs about the node rejoining, and there might be another brief interruption as MetalLB reconfigures.

Understanding IPVS Proxy Mode

https://mokpolar.tistory.com/66

IPVS (IP Virtual Server) mode leverages the Linux kernel’s IPVS module to act as a service proxy in Kubernetes. Here’s why it’s gaining traction:

  • IPVS is a Layer 4 load balancer operating within the netfilter framework.
  • It offers superior performance compared to iptables.
  • IPVS reduces the number of rules required for service proxying.
  • It provides a variety of load balancing algorithms, offering flexibility for different use cases.
https://kubernetes.io/blog/2018/07/09/ipvs-based-in-cluster-load-balancing-deep-dive/

IPVS Load Balancing algorithms allow for more sophisticated traffic distribution compared to the basic iptables proxy mode.

  1. Round Robin (rr) : Distributes requests evenly across target pods without prioritization.
  2. Least Connection (lc) : Forwards traffic to pods with the fewest active connections.
  3. Destination Hashing (dh) : Uses a hash of the destination IP to determine the target pod.
  4. Source Hashing (sh) : Calculates a hash based on the source IP to select the destination pod.
  5. Shortest Expected Delay (sed) : Chooses the pod with the fastest expected response time.

Hands-on: Setting Up IPVS Proxy Mode

Let’s walk through setting up a Kubernetes cluster with IPVS proxy mode enabled using kind.

cat <<EOT> kind-svc-2w-ipvs.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
"InPlacePodVerticalScaling": true
"MultiCIDRServiceAllocator": true
nodes:
- role: control-plane
labels:
mynode: control-plane
topology.kubernetes.io/zone: ap-northeast-2a
extraPortMappings:
- containerPort: 30000
hostPort: 30000
- containerPort: 30001
hostPort: 30001
- containerPort: 30002
hostPort: 30002
- containerPort: 30003
hostPort: 30003
- containerPort: 30004
hostPort: 30004
kubeadmConfigPatches:
- |
kind: ClusterConfiguration
apiServer:
extraArgs:
runtime-config: api/all=true
controllerManager:
extraArgs:
bind-address: 0.0.0.0
etcd:
local:
extraArgs:
listen-metrics-urls: http://0.0.0.0:2381
scheduler:
extraArgs:
bind-address: 0.0.0.0
- |
kind: KubeProxyConfiguration
metricsBindAddress: 0.0.0.0
ipvs:
strictARP: true
- role: worker
labels:
mynode: worker1
topology.kubernetes.io/zone: ap-northeast-2a
- role: worker
labels:
mynode: worker2
topology.kubernetes.io/zone: ap-northeast-2b
- role: worker
labels:
mynode: worker3
topology.kubernetes.io/zone: ap-northeast-2c
networking:
podSubnet: 10.10.0.0/16
serviceSubnet: 10.200.1.0/24
kubeProxyMode: "ipvs"
EOT

# k8s 클러스터 설치
kind create cluster --config kind-svc-2w-ipvs.yaml --name myk8s --image kindest/node:v1.31.0

# 노드에 기본 툴 설치
docker exec -it myk8s-control-plane sh -c 'apt update && apt install tree psmisc lsof wget bsdmainutils bridge-utils net-tools dnsutils ipset ipvsadm nfacct tcpdump ngrep iputils-ping arping git vim arp-scan -y'
for i in worker worker2 worker3; do echo ">> node myk8s-$i <<"; docker exec -it myk8s-$i sh -c 'apt update && apt install tree psmisc lsof wget bsdmainutils bridge-utils net-tools dnsutils ipset ipvsadm nfacct tcpdump ngrep iputils-ping arping -y'; echo; done

# 테스트용 컨테이너 생성
docker run -d --rm --name mypc --network kind --ip 172.18.0.100 nicolaka/netshoot sleep infinity

Proceed with creating test pods and corresponding ClusterIP service.

# 목적지 파드 생성
cat <<EOT> 3pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: webpod1
labels:
app: webpod
spec:
nodeName: myk8s-worker
containers:
- name: container
image: traefik/whoami
terminationGracePeriodSeconds: 0
---
apiVersion: v1
kind: Pod
metadata:
name: webpod2
labels:
app: webpod
spec:
nodeName: myk8s-worker2
containers:
- name: container
image: traefik/whoami
terminationGracePeriodSeconds: 0
---
apiVersion: v1
kind: Pod
metadata:
name: webpod3
labels:
app: webpod
spec:
nodeName: myk8s-worker3
containers:
- name: container
image: traefik/whoami
terminationGracePeriodSeconds: 0
EOT

# 클라이언트 파드 생성
cat <<EOT> netpod.yaml
apiVersion: v1
kind: Pod
metadata:
name: net-pod
spec:
nodeName: myk8s-control-plane
containers:
- name: netshoot-pod
image: nicolaka/netshoot
command: ["tail"]
args: ["-f", "/dev/null"]
terminationGracePeriodSeconds: 0
EOT

# 서비스 생성
cat <<EOT> svc-clusterip.yaml
apiVersion: v1
kind: Service
metadata:
name: svc-clusterip
spec:
ports:
- name: svc-webport
port: 9000 # 서비스 IP 에 접속 시 사용하는 포트 port 를 의미
targetPort: 80 # 타킷 targetPort 는 서비스를 통해서 목적지 파드로 접속 시 해당 파드로 접속하는 포트를 의미
selector:
app: webpod # 셀렉터 아래 app:webpod 레이블이 설정되어 있는 파드들은 해당 서비스에 연동됨
type: ClusterIP # 서비스 타입
EOT

# 배포
kubectl apply -f 3pod.yaml,netpod.yaml,svc-clusterip.yaml

# pod/webpod1 created
# pod/webpod2 created
# pod/webpod3 created
# pod/net-pod created
# service/svc-clusterip created

Let’s check the basic IPVS settings.

sigridjineth@sigridjineth-Z590-VISION-G:~$ kubectl krew install stern
Updated the local copy of plugin index.
Installing plugin: stern
Installed plugin: stern
\
| Use this plugin:
| kubectl stern
| Documentation:
| https://github.com/stern/stern
/
WARNING: You installed plugin "stern" from the krew-index plugin repository.
These plugins are not audited for security by the Krew maintainers.
Run them at your own risk.
sigridjineth@sigridjineth-Z590-VISION-G:~$ kubectl stern -n kube-system -l k8s-app=kube-proxy --since 2h | egrep '(ipvs|IPVS)'
+ kube-proxy-9s2gn › kube-proxy
+ kube-proxy-c8wlp › kube-proxy
+ kube-proxy-gxd7l › kube-proxy
+ kube-proxy-vg2nm › kube-proxy
kube-proxy-9s2gn kube-proxy I1005 15:15:44.912720 1 server_linux.go:230] "Using ipvs Proxier"
kube-proxy-9s2gn kube-proxy I1005 15:15:44.914038 1 proxier.go:364] "IPVS scheduler not specified, use rr by default" ipFamily="IPv4"
kube-proxy-9s2gn kube-proxy I1005 15:15:44.914139 1 proxier.go:364] "IPVS scheduler not specified, use rr by default" ipFamily="IPv6"
kube-proxy-gxd7l kube-proxy I1005 15:15:48.786143 1 server_linux.go:230] "Using ipvs Proxier"
kube-proxy-gxd7l kube-proxy I1005 15:15:48.787348 1 proxier.go:364] "IPVS scheduler not specified, use rr by default" ipFamily="IPv4"
kube-proxy-gxd7l kube-proxy I1005 15:15:48.787426 1 proxier.go:364] "IPVS scheduler not specified, use rr by default" ipFamily="IPv6"
kube-proxy-vg2nm kube-proxy I1005 15:15:48.423922 1 server_linux.go:230] "Using ipvs Proxier"
kube-proxy-vg2nm kube-proxy I1005 15:15:48.425051 1 proxier.go:364] "IPVS scheduler not specified, use rr by default" ipFamily="IPv4"
kube-proxy-vg2nm kube-proxy I1005 15:15:48.425121 1 proxier.go:364] "IPVS scheduler not specified, use rr by default" ipFamily="IPv6"
kube-proxy-c8wlp kube-proxy I1005 15:15:48.781916 1 server_linux.go:230] "Using ipvs Proxier"
kube-proxy-c8wlp kube-proxy I1005 15:15:48.783102 1 proxier.go:364] "IPVS scheduler not specified, use rr by default" ipFamily="IPv4"
kube-proxy-c8wlp kube-proxy I1005 15:15:48.783193 1 proxier.go:364] "IPVS scheduler not specified, use rr by default" ipFamily="IPv6"
- kube-proxy-c8wlp › kube-proxy
- kube-proxy-vg2nm › kube-proxy
^C
sigridjineth@sigridjineth-Z590-VISION-G:~$ kubectl get cm -n kube-system kube-proxy -o yaml | egrep 'mode|strictARP|scheduler'
scheduler: ""
strictARP: true
mode: ipvs

IPVS mode creates a kube-ipvs0 interface on each node. When creating service clusterIP, you can observe that the IP is allocated and received to kube-ipvs0 interface.

sigridjineth@sigridjineth-Z590-VISION-G:~$ for i in control-plane worker worker2 worker3; do echo ">> node myk8s-$i <<"; docker exec -it myk8s-$i ip -br -c addr show kube-ipvs0; echo; done
>> node myk8s-control-plane <<
kube-ipvs0 DOWN 10.200.1.10/32 10.200.1.1/32 10.200.1.163/32

>> node myk8s-worker <<
kube-ipvs0 DOWN 10.200.1.1/32 10.200.1.10/32 10.200.1.163/32

>> node myk8s-worker2 <<
kube-ipvs0 DOWN 10.200.1.1/32 10.200.1.10/32 10.200.1.163/32

>> node myk8s-worker3 <<
kube-ipvs0 DOWN 10.200.1.1/32 10.200.1.10/32 10.200.1.163/32

Let’s examine the IPVS forwarding table and test the load balancing:

# Get service details
CIP=$(kubectl get svc svc-clusterip -o jsonpath="{.spec.clusterIP}")
CPORT=$(kubectl get svc svc-clusterip -o jsonpath="{.spec.ports[0].port}")

# echo $CIP $CPORT
# 10.200.1.163 9000

# View IPVS forwarding table
sigridjineth@sigridjineth-Z590-VISION-G:~$ docker exec -it myk8s-control-plane ipvsadm -Ln -t $CIP:$CPORT
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.200.1.163:9000 rr
-> 10.10.1.2:80 Masq 1 0 129
-> 10.10.2.2:80 Masq 1 0 129
-> 10.10.3.2:80 Masq 1 0 129

# Monitor IPVS statistics. monitoring ipvsadm from control-plane
watch -d "docker exec -it myk8s-control-plane ipvsadm -Ln -t $CIP:$CPORT --stats; echo; docker exec -it myk8s-control-plane ipvsadm -Ln -t $CIP:$CPORT --rate"
Every 2.0s: docker exec -it myk8s-control-plane ipvsadm -Ln -t 10.200... sigridjineth-Z590-VISION-G: Sun Oct 6 00:25:43 2024

Prot LocalAddress:Port Conns InPkts OutPkts InBytes OutBytes
-> RemoteAddress:Port
TCP 10.200.1.163:9000 387 2322 1548 154800 203562
-> 10.10.1.2:80 129 774 516 51600 67854
-> 10.10.2.2:80 129 774 516 51600 67854
-> 10.10.3.2:80 129 774 516 51600 67854

Prot LocalAddress:Port CPS InPPS OutPPS InBPS OutBPS
-> RemoteAddress:Port
TCP 10.200.1.163:9000 1 6 4 374 491
-> 10.10.1.2:80 0 2 1 125 164
-> 10.10.2.2:80 0 2 1 124 164
-> 10.10.3.2:80 0 2 1 124 164

# Test service access
kubectl exec -it net-pod -- curl -s --connect-timeout 1 $CIP:9000

# Test load balancing (10, 100, 1000 requests)
kubectl exec -it net-pod -- zsh -c "for i in {1..1000}; do curl -s $CIP:9000 | grep Hostname; done | sort | uniq -c | sort -nr"
  • Consistent, even distribution of requests across all three pods (see right side, the load balancing tests)
  • Slight variations (333 vs 334) are normal in load balancing scenarios.
  • The ClusterIP service (10.200.1.163:9000) is load balancing traffic to three backend pods. (see left side, the ipvs statistics)
  • Total connections and packets are evenly distributed among the pods.
  • Real-time statistics show active traffic with about 166 new connections per second.

We conducted three tests with increasing numbers of requests to observe the load balancing behavior.

  1. Distribution Evenness:
  • With just 10 requests, there’s a slight imbalance (4–3–3).
  • At 100 requests, the distribution becomes more even (34–33–33).
  • At 1000 requests, we see an almost perfect distribution (334–333–333).

2. As the number of requests increases, the load distribution becomes increasingly balanced, demonstrating IPVS’s effectiveness at scale.

3. The near-perfect distribution at 1000 requests shows that IPVS maintains consistent load balancing over a large number of connections.

The IPVS load balancing in this Kubernetes cluster is functioning effectively:

  1. It’s distributing incoming connections evenly across all backend pods.
  2. Real-time statistics show balanced traffic distribution.
  3. Multiple large-scale tests confirm consistent, fair load distribution among pods.

In summary, these tests demonstrate that IPVS Proxy mode in Kubernetes offers highly consistent and equitable load balancing across pods, especially under higher traffic volumes. This even distribution, combined with improved performance characteristics, makes IPVS mode a compelling choice for Kubernetes clusters, particularly those expecting to handle significant traffic loads.

--

--

No responses yet