Help improve this page
To contribute to this user guide, choose the Edit this page on GitHub link that is located in the right pane of every page.
Amazon EKS Hybrid Nodes gateway troubleshooting
This page provides guidance for diagnosing and resolving common issues with the Amazon EKS Hybrid Nodes gateway. Each section describes a symptom, possible causes, diagnostic steps, and resolutions. For operational details, see Amazon EKS Hybrid Nodes gateway operations.
Pods on hybrid nodes unreachable from VPC
Pods running on hybrid nodes are not reachable from resources in the VPC, such as EC2 instances, load balancers, or the Kubernetes control plane.
Possible causes:
-
VPC route table entries are missing or point to the wrong ENI.
-
The gateway leader pod is not running or has not completed setup.
-
Cilium VTEP is not enabled or configured on the hybrid nodes.
-
Source/destination check is enabled on the gateway EC2 instance.
Diagnostic steps:
-
Check VPC route table entries. Verify that routes for your hybrid pod CIDRs exist and point to the active gateway instance’s primary ENI:
aws ec2 describe-route-tables \ --route-table-idsROUTE_TABLE_ID\ --query "RouteTables[].Routes[?DestinationCidrBlock=='[.replaceable]`POD_CIDR`']"If routes are missing, check the gateway logs for route table errors. If routes point to the wrong ENI, a failover may not have completed successfully.
-
Check gateway pod status and leader election. Confirm that two gateway pods are running and one holds the leader lease:
kubectl get pods -n eks-hybrid-nodes-gateway kubectl get lease -n eks-hybrid-nodes-gatewayIf no pod holds the lease, see Leader election issues.
-
Check Cilium VTEP configuration on hybrid nodes. Verify that the
CiliumVTEPConfigresource exists and contains the leader’s node IP:kubectl get ciliumvtepconfig hybrid-gateway -o yamlThe
spec.endpoints[0].tunnelEndpointshould match the leader gateway node’s IP address. If the resource is missing or has a stale IP, the gateway may not have completed leader setup. -
Check source/destination check. Verify that source/destination check is disabled on the gateway EC2 instances:
aws ec2 describe-instance-attribute \ --instance-idGATEWAY_INSTANCE_ID\ --attribute sourceDestCheckIf
sourceDestCheckistrue, disable it. See Get started with EKS Hybrid Nodes gateway.
Webhook calls to hybrid nodes fail
The Kubernetes API server cannot reach webhook endpoints running on hybrid nodes. Webhook admission requests time out or return connection errors.
Possible causes:
-
The gateway is not routing traffic from the control plane to hybrid pods.
-
The
CiliumVTEPConfigresource is missing or has a stale endpoint IP.
Diagnostic steps:
-
Verify the control plane can reach the gateway node IP. The control plane sends traffic to the VPC route table, which forwards it to the gateway’s ENI. Confirm the VPC route table entries are correct using the steps in Pods on hybrid nodes unreachable from VPC.
-
Check the CiliumVTEPConfig resource. Verify the resource exists and the
tunnelEndpointmatches the current leader’s node IP:kubectl get ciliumvtepconfig hybrid-gateway -o yamlIf the tunnel endpoint is stale (points to a previous leader), the gateway may not have completed the leader setup sequence. Check the gateway logs for errors during
CiliumVTEPConfigupsert.
VPC route table updates fail
The gateway logs show errors related to VPC route table operations, and routes for hybrid pod CIDRs are not created or updated.
Possible causes:
-
The gateway’s IAM role does not have the required EC2 permissions.
-
The route table IDs in the configuration are incorrect or the route tables do not exist.
-
The gateway cannot reach the EC2 API endpoint.
Diagnostic steps:
-
Verify IAM permissions. The gateway requires the following IAM actions:
-
ec2:DescribeRouteTables -
ec2:CreateRoute -
ec2:ReplaceRoute -
ec2:DescribeInstancesCheck the IAM role attached to the gateway node’s instance profile or pod identity configuration.
-
-
Check route table IDs in the configuration. Verify that the
ROUTE_TABLE_IDSenvironment variable contains valid route table IDs in the gateway deployment:kubectl get deployment eks-hybrid-nodes-gateway -n eks-hybrid-nodes-gateway -o jsonpath='{.spec.template.spec.containers[0].env}' | jq .Confirm the route table IDs exist in your VPC:
aws ec2 describe-route-tables --route-table-idsROUTE_TABLE_ID -
Check gateway logs for route table errors. Look for error messages related to route table operations:
kubectl logs -n eks-hybrid-nodes-gatewayLEADER_POD| grep -i "route table"Common error messages include:
-
Failed to verify route table access— The gateway cannot describe the route table. Check IAM permissions and route table IDs. -
Failed to update route tables— The gateway cannot create or replace routes. Check IAM permissions. -
failed to access route table— The route table ID may be incorrect or the IAM role lacksec2:DescribeRouteTables.
-
Gateway pods fail to start or are unhealthy
Gateway pods are in CrashLoopBackOff, Error, or Pending state, or the health endpoint returns an error.
Possible causes:
-
Required environment variables (
VPC_CIDR,POD_CIDRS,ROUTE_TABLE_IDS) are not set. -
IP forwarding is not enabled on the gateway node.
-
Node label or anti-affinity constraints prevent scheduling.
Diagnostic steps:
-
Check pod logs. View the logs for the failing pod to identify the error:
kubectl logs -n eks-hybrid-nodes-gatewayLEADER_POD -
Check required environment variables. The gateway requires
NODE_IP,VPC_CIDR, andPOD_CIDRS. If any are missing, the gateway exits immediately. Verify the pod spec:kubectl get pod -n eks-hybrid-nodes-gatewayLEADER_POD-o jsonpath='{.spec.containers[0].env}' | jq .-
NODE_IPis set automatically fromstatus.hostIPin the pod spec. If it is empty, the pod may not be scheduled on a node yet. -
VPC_CIDRandPOD_CIDRScome from the Helm values. Verify they are set correctly.
-
-
Check IP forwarding. The gateway checks that IP forwarding is enabled at startup and exits if it is not. Look for the error message
IP forwarding is not enabledin the pod logs. Enable IP forwarding on the node:# Check current setting cat /proc/sys/net/ipv4/ip_forward # Enable if not set sudo sysctl -w net.ipv4.ip_forward=1For a persistent setting, configure IP forwarding through the kubelet or add
net.ipv4.ip_forward=1to/etc/sysctl.d/. -
Check node label and scheduling constraints. The gateway pods require nodes with the
hybrid-gateway-node=truelabel. Pod anti-affinity ensures each pod runs on a separate node. If pods arePending, check for scheduling issues:kubectl describe pod -n eks-hybrid-nodes-gatewayLEADER_PODLook for events indicating insufficient nodes, missing labels, or anti-affinity conflicts.
Leader election issues
The gateway pods are running but no pod acquires the leader lease, or leadership transitions happen frequently.
Possible causes:
-
RBAC permissions for Lease objects are missing.
-
Network connectivity between gateway pods and the Kubernetes API server is unreliable.
-
Leader election parameters are misconfigured.
Diagnostic steps:
-
Check the Lease object. Verify the Lease exists and inspect its current holder:
kubectl get lease -n eks-hybrid-nodes-gateway hybrid-gateway-leader -o yamlThe
spec.holderIdentityfield shows the current leader. Thespec.renewTimeshows when the lease was last renewed. IfrenewTimeis stale, the leader may have lost connectivity to the API server. -
Check RBAC permissions. The gateway service account needs permissions to get, create, and update Lease objects in the gateway namespace. Verify the Role and RoleBinding:
kubectl get role -n eks-hybrid-nodes-gateway kubectl get rolebinding -n eks-hybrid-nodes-gatewayThe Role should include
get,create, andupdateverbs for theleasesresource in thecoordination.k8s.ioAPI group. -
Check pod logs for lease errors. Look for leader election errors in the pod logs:
kubectl logs -n eks-hybrid-nodes-gatewayLEADER_POD| grep -i "leader\|lease"Common issues include:
-
Failed to acquire lease— The pod cannot create or update the Lease object. Check RBAC permissions. -
Frequent
Leadership endedfollowed byLeader setup completemessages — The leader is losing and re-acquiring the lease. This may indicate network instability between the pod and the API server. Consider increasing--leader-election-lease-duration.
-
-
Check leader election parameters. Verify the configured values:
kubectl get deployment eks-hybrid-nodes-gateway -n eks-hybrid-nodes-gateway -o jsonpath='{.spec.template.spec.containers[0].args}'Ensure
--leader-election-renew-deadlineis less than--leader-election-lease-duration. If the renew deadline exceeds the lease duration, the leader loses the lease before it can renew. For more information, see Leader election tuning.
Common error messages
The following table lists error messages you may see in the gateway pod logs and their resolutions.
| Error message | Cause | Resolution |
|---|---|---|
|
|
The kernel parameter |
Enable IP forwarding through the kubelet configuration or by running |
|
|
The gateway cannot create the VXLAN network interface. This typically occurs when the pod lacks the |
Verify the Deployment spec includes |
|
|
The gateway cannot describe one or more VPC route tables at startup. |
Verify the IAM role has |
|
|
The gateway cannot create or replace routes in the VPC route tables. |
Verify the IAM role has |
|
|
The gateway cannot initialize the AWS EC2 client or retrieve the instance’s primary ENI. |
Verify the IAM role has |
|
|
The |
Verify the pod spec sets |
|
|
The value provided for |
Check the |
|
|
The |
Set the |
|
|
The |
Check the |
|
|
The gateway cannot retrieve the AWS Region from EC2 instance metadata. |
Verify the instance metadata service (IMDS) is accessible. Alternatively, set the |
|
|
The gateway cannot retrieve the instance ID from EC2 instance metadata. |
Verify the instance metadata service (IMDS) is accessible. Alternatively, set the |
|
|
A hybrid node’s |
Verify the hybrid node is registered correctly and the Cilium agent is running. Check the |
|
|
A hybrid node’s |
Verify Cilium IPAM is configured correctly on the hybrid node. Check the |
|
|
The gateway cannot create or update the |
Verify the CRD is installed in the cluster and the gateway service account has permissions to manage |
|
|
The controller-runtime manager failed to initialize. |
Check the pod logs for additional context. Common causes include invalid kubeconfig or inability to reach the Kubernetes API server. |
|
|
The leader-elected runnable could not be registered with the controller manager. |
This is typically an internal error. Check the full pod logs for additional context and report the issue on the GitHub repository |
|
|
The CiliumNode reconciler could not be registered with the controller manager. |
Check the pod logs for additional context. Verify that the CiliumNode CRD is installed in the cluster. |
|
|
The controller manager exited unexpectedly. |
Check the pod logs for the underlying error. Common causes include loss of connectivity to the Kubernetes API server or a port conflict on the metrics or health probe bind addresses. |
|
|
The gateway cannot describe a specific VPC route table during the startup verification check. |
Verify the IAM role has |
Related topics
-
Amazon EKS Hybrid Nodes gateway — Overview of the gateway architecture and use cases.
-
Get started with EKS Hybrid Nodes gateway — Prerequisites and installation instructions.
-
Amazon EKS Hybrid Nodes gateway configuration reference — Complete reference for Helm values, CLI flags, and environment variables.
-
Amazon EKS Hybrid Nodes gateway operations — Monitoring, failover behavior, and scaling guidance.