Upgrading Kubernetes often seems straightforward: bump the control plane, upgrade the node groups, and you're done. But what seemed like a simple version upgrade turned into a crash course on Kubernetes networking and the fragility of EKS clusters. This is the story of what happened when we upgraded our EKS cluster from 1.32 to 1.33, and how we recovered.
The Setup: Everything Was Fine... Until It Wasn't
Before the upgrade, the cluster was stable. Microservices were running across multiple namespaces, the Application Load Balancer (ALB) routed traffic perfectly, and health checks were all green. Running a simple kubectl get nodes -o wide showed all nodes ready, IPs aligned, and everything healthy:
kubectl get nodes -o wide
# NAME STATUS ROLES VERSION
# ip-10-0-7-74.us-east-1.compute.internal Ready <none> v1.32.7-eks-1-32-202
# ip-10-0-44-7.us-east-1.compute.internal Ready <none> v1.32.7-eks-1-32-202Monitoring dashboards were silent. Logs were quiet. Traffic was flowing normally. Life was good.
The Trigger: Upgrading to EKS 1.33
AWS announced EKS 1.32 deprecation, so the plan was simple: upgrade the cluster to 1.33.
aws eks update-cluster-version \
--name my-eks-cluster \
--kubernetes-version 1.33The control plane update went fine. Nodes showed the new version. Everything seemed fine... until alarms fired. Suddenly, services started returning 503 errors, and ALB target groups reported that all pods were unhealthy. It was clear this upgrade was not going to be a "quiet version bump."
Initial Investigation: Pod IPs Don't Match Node IPs
The first clue came when we checked pod IPs:
kubectl get pods -o wide -A
# NAME READY STATUS IP NODE
# service-a-abc123 0/1 ContainerCreating 10.0.6.132 ip-10-0-7-74
# service-b-def456 0/1 ContainerCreating 10.0.42.35 ip-10-0-44-7And the nodes:
kubectl get nodes -o wide
# NAME STATUS ROLES VERSION
# ip-10-0-7-74 NotReady <none> v1.33.0-eks-1-33
# ip-10-0-44-7 NotReady <none> v1.33.0-eks-1-33Notice something? Pods were getting IPs from completely different subnets than their nodes. In Kubernetes, pod networking is supposed to be seamless; when pods and nodes don't align, networking breaks, and services go down.
Understanding Custom Networking and ENIConfig
It turned out we had custom networking enabled:
aws eks describe-cluster --name my-eks-cluster \
--query "cluster.resourcesVpcConfig"
# {
# "subnetIds": ["subnet-aaa","subnet-bbb","subnet-ccc"],
# "endpointPublicAccess": true,
# ...
# }Custom networking lets pods pull IPs from different subnets than their nodes, which can help when node subnets are exhausted. But this also means every subnet in every availability zone must be configured properly in an ENIConfig:
kubectl get eniConfig -n kube-system
# NAME SUBNET SECURITYGROUPS
# us-east-1c subnet-ccc sg-xxxxxxThe problem was obvious: ENIConfig existed only for us-east-1c, but our nodes ran in us-east-1a and us-east-1b. Pods on these nodes could not get IPs, the CNI plugin crashed, and the outage cascaded across the cluster.
The Cascade of Failures
We attempted to fix things manually. First, we tried disabling custom networking:
kubectl set env daemonset aws-node -n kube-system AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=falseBut the CNI daemon still failed to start:
kubectl logs -n kube-system aws-node-xxxxx
# dial tcp 127.0.0.1:50051: connect: connection refusedNodes went NotReady, pods got stuck in ContainerCreating, and traffic stopped flowing. We also tried rolling back the CNI plugin to different versions:
kubectl set image daemonset aws-node -n kube-system \
aws-node=602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:v1.18.5Result: crash. Version v1.16.4? ImagePullBackOff. Version v1.15.5? Same crash. The cluster was effectively offline.
The Solution: Using the EKS-Managed Addon
What finally worked was letting AWS manage the CNI:
aws eks update-addon \
--cluster-name my-eks-cluster \
--addon-name vpc-cni \
--addon-version v1.18.3-eksbuild.3The managed addon automatically ensured version compatibility with the cluster and node AMI. It applied the correct configuration for all subnets and prevented the misconfigurations that manual updates caused.
After the update:
kubectl get pods -n kube-system | grep aws-node
# aws-node-abc123 1/1 Running
# aws-node-def456 1/1 RunningNodes returned to Ready, and ALB target groups passed health checks. Traffic flowed normally again.
How We Approach Networking Debugs Now
After this incident, we adopted a more systematic approach. We always check pod IPs versus node IPs:
kubectl get pods -o wide -A
kubectl get nodes -o wideWe verify CNI pod status and logs:
kubectl get pods -n kube-system | grep aws-node
kubectl logs -n kube-system aws-node-<pod-name>Node conditions are inspected for NotReady states:
kubectl describe node <node-name>We review ENIConfig and VPC CNI settings:
kubectl get eniConfig -n kube-system
kubectl describe daemonset aws-node -n kube-systemFinally, connectivity between pods is tested:
kubectl exec -it <pod-name> -- ping <other-pod-ip>Takeaway
Kubernetes networking is unforgiving. A single misconfigured component can ripple across an entire cluster and bring down services. Upgrades are never "just a version bump." Custom networking adds flexibility but also risk. EKS-managed addons are not optional - they're a lifeline when things go wrong.
The lesson: plan upgrades carefully, validate networking in advance, and trust managed components for critical parts of the cluster.




