EKS 1.33 Upgrade: Kubernetes Networking Lessons from Production | Edstem Technologies

Kubernetes

AWS EKS

Production Incidents

Kubernetes Networking

DevOps

CNI Plugin

EKS 1.33 Upgrade: Kubernetes Networking Lessons from Production

by: Gokulkrishna AB, Ashish Sharma

January 05, 2026

Featured image for blog post: EKS 1.33 Upgrade: Kubernetes Networking Lessons from Production

Upgrading Kubernetes often seems straightforward: bump the control plane, upgrade the node groups, and you're done. But what seemed like a simple version upgrade turned into a crash course on Kubernetes networking and the fragility of EKS clusters. This is the story of what happened when we upgraded our EKS cluster from 1.32 to 1.33, and how we recovered.

The Setup: Everything Was Fine... Until It Wasn't

Before the upgrade, the cluster was stable. Microservices were running across multiple namespaces, the Application Load Balancer (ALB) routed traffic perfectly, and health checks were all green. Running a simple kubectl get nodes -o wide showed all nodes ready, IPs aligned, and everything healthy:

kubectl get nodes -o wide
# NAME                                       STATUS   ROLES    VERSION
# ip-10-0-7-74.us-east-1.compute.internal   Ready    <none>   v1.32.7-eks-1-32-202
# ip-10-0-44-7.us-east-1.compute.internal   Ready    <none>   v1.32.7-eks-1-32-202

Monitoring dashboards were silent. Logs were quiet. Traffic was flowing normally. Life was good.

The Trigger: Upgrading to EKS 1.33

AWS announced EKS 1.32 deprecation, so the plan was simple: upgrade the cluster to 1.33.

aws eks update-cluster-version \
  --name my-eks-cluster \
  --kubernetes-version 1.33

The control plane update went fine. Nodes showed the new version. Everything seemed fine... until alarms fired. Suddenly, services started returning 503 errors, and ALB target groups reported that all pods were unhealthy. It was clear this upgrade was not going to be a "quiet version bump."

Initial Investigation: Pod IPs Don't Match Node IPs

The first clue came when we checked pod IPs:

kubectl get pods -o wide -A
# NAME                   READY   STATUS             IP            NODE
# service-a-abc123       0/1     ContainerCreating 10.0.6.132    ip-10-0-7-74
# service-b-def456       0/1     ContainerCreating 10.0.42.35   ip-10-0-44-7

And the nodes:

kubectl get nodes -o wide
# NAME                    STATUS   ROLES    VERSION
# ip-10-0-7-74            NotReady <none>   v1.33.0-eks-1-33
# ip-10-0-44-7            NotReady <none>   v1.33.0-eks-1-33

Notice something? Pods were getting IPs from completely different subnets than their nodes. In Kubernetes, pod networking is supposed to be seamless; when pods and nodes don't align, networking breaks, and services go down.

Understanding Custom Networking and ENIConfig

It turned out we had custom networking enabled:

aws eks describe-cluster --name my-eks-cluster \
  --query "cluster.resourcesVpcConfig"
# {
#     "subnetIds": ["subnet-aaa","subnet-bbb","subnet-ccc"],
#     "endpointPublicAccess": true,
#     ...
# }

Custom networking lets pods pull IPs from different subnets than their nodes, which can help when node subnets are exhausted. But this also means every subnet in every availability zone must be configured properly in an ENIConfig:

kubectl get eniConfig -n kube-system
# NAME             SUBNET     SECURITYGROUPS
# us-east-1c       subnet-ccc sg-xxxxxx

The problem was obvious: ENIConfig existed only for us-east-1c, but our nodes ran in us-east-1a and us-east-1b. Pods on these nodes could not get IPs, the CNI plugin crashed, and the outage cascaded across the cluster.

The Cascade of Failures

We attempted to fix things manually. First, we tried disabling custom networking:

kubectl set env daemonset aws-node -n kube-system AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=false

But the CNI daemon still failed to start:

kubectl logs -n kube-system aws-node-xxxxx
# dial tcp 127.0.0.1:50051: connect: connection refused

Nodes went NotReady, pods got stuck in ContainerCreating, and traffic stopped flowing. We also tried rolling back the CNI plugin to different versions:

kubectl set image daemonset aws-node -n kube-system \
  aws-node=602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:v1.18.5

Result: crash. Version v1.16.4? ImagePullBackOff. Version v1.15.5? Same crash. The cluster was effectively offline.

The Solution: Using the EKS-Managed Addon

What finally worked was letting AWS manage the CNI:

aws eks update-addon \
  --cluster-name my-eks-cluster \
  --addon-name vpc-cni \
  --addon-version v1.18.3-eksbuild.3

The managed addon automatically ensured version compatibility with the cluster and node AMI. It applied the correct configuration for all subnets and prevented the misconfigurations that manual updates caused.

After the update:

kubectl get pods -n kube-system | grep aws-node
# aws-node-abc123   1/1   Running
# aws-node-def456   1/1   Running

Nodes returned to Ready, and ALB target groups passed health checks. Traffic flowed normally again.

How We Approach Networking Debugs Now

After this incident, we adopted a more systematic approach. We always check pod IPs versus node IPs:

kubectl get pods -o wide -A
kubectl get nodes -o wide

We verify CNI pod status and logs:

kubectl get pods -n kube-system | grep aws-node
kubectl logs -n kube-system aws-node-<pod-name>

Node conditions are inspected for NotReady states:

kubectl describe node <node-name>

We review ENIConfig and VPC CNI settings:

kubectl get eniConfig -n kube-system
kubectl describe daemonset aws-node -n kube-system

Finally, connectivity between pods is tested:

kubectl exec -it <pod-name> -- ping <other-pod-ip>

Takeaway

Kubernetes networking is unforgiving. A single misconfigured component can ripple across an entire cluster and bring down services. Upgrades are never "just a version bump." Custom networking adds flexibility but also risk. EKS-managed addons are not optional - they're a lifeline when things go wrong.

The lesson: plan upgrades carefully, validate networking in advance, and trust managed components for critical parts of the cluster.