Deploy intelligent agents that analyze cluster data and historical cases to deliver actionable recommendations. Reduce mean time to resolution by automating diagnostic workflows through natural language interactions in Slack.
Overview
This Guidance demonstrates how to address the complexity of troubleshooting Amazon EKS environments with multiple metrics and logs by implementing an agentic AI workflow that uses generative AI with RAG-enabled knowledge bases and chat interfaces to accelerate problem diagnosis. The implementation deploys a comprehensive EKS environment using Terraform with managed node groups, observability stack including Amazon Managed Prometheus and Grafana, and RBAC-based security mapped to IAM roles. A Slack-integrated AI agent system runs on the EKS cluster, where an orchestrator agent receives troubleshooting requests and delegates tasks to specialized agents that connect to Amazon S3 vector-based knowledge bases for semantic similarity matching against historical troubleshooting cases. You can significantly reduce your mean time to triage EKS issues while improving the accuracy of root cause analysis and remediation initiation across various infrastructure and application problems.
Benefits
Accelerate Kubernetes troubleshooting with AI
Provision production-ready EKS clusters efficiently
Implement infrastructure as code with Terraform to deploy secure, multi-AZ Amazon EKS environments with essential add-ons and observability. Ensure consistent configurations across development, staging, and production environments.
Enable collaborative platform engineering practices
Empower DevOps teams, SREs, and developers to troubleshoot Kubernetes issues directly from Slack using ChatOps patterns. Leverage real-time cluster insights and semantic search of past incidents to resolve problems faster.
How it works
This diagram shows how to provision an Amazon Elastic Kubernetes Service (EKS) cluster with best practices configuration and critical add-ons.
Download the architecture diagram
Step 1
This architecture diagram shows Agentic AI troubleshooting workflow working with real-time EKS cluster data and integrated with AWS AI services for analysis and Slack for ChatOps.
Download the architecture diagram
Step 1
Deploy with confidence
Everything you need to launch this Guidance in your account is right here.
We'll walk you through it
Get started fast. Read the implementation guide for deployment steps, architecture details, cost information, and customization options.
Let's make it happen
Ready to deploy? Review the sample code on GitHub for detailed deployment instructions to deploy as-is or customize to fit your needs
Related content
Architecting conversational observability for cloud applications
This blog post walks through building a generative AI–powered troubleshooting assistant for Kubernetes.
Streamline Amazon EKS operations with Agentic AI
This video explains how to build agentic AI systems that automate Amazon EKS cluster management, from real-time issue diagnosis to guided remediation.