

# Generate incident reports
<a name="Investigations-Incident-Reports"></a>

Incident reports help you more quickly and easily write a report about your incident investigation. You can use this report to provide details to management or to help your team learn from the incident and take actions to prevent future such occurrences. The structure of the report is based on industry standards for these types of reports and can be copied into other repositories for long-term retention.

When you use the AWS Management Console to create an *investigation group* resource in CloudWatch investigations, an IAM role is created for the group to give it access to resources during the investigation. Generating CloudWatch investigations incident reports requires additional permissions be granted to your investigation group. The new managed policy `AIOpsAssistantIncidentReportPolicy` provides the required permissions and is automatically added to investigation groups created using the AWS Management Console after October 10, 2025. For more information, see [AIOpsAssistantIncidentReportPolicy](managed-policies-cloudwatch.md#managed-policies-QInvestigations-AIOpsAssistantIncidentReportPolicy).

**Note**  
If you are using the CDK or SDK, you must explicitly add the investigation group role and specify the role policy or equivalent inline permissions on the role. For more details about permissions, see [Security in CloudWatch investigations](Investigations-Security.md) 

These reports capture investigation findings, root causes, timeline events, and recommended corrective actions in a structured format that can be easily shared with stakeholders and used for organizational learning.

Incident report generation is included at no additional charge for all CloudWatch investigations users and integrates seamlessly with your investigation workflow.

**How incident reports work**

1. Run an investigation on your incident.

1. Accept at least one hypothesis. Each hypothesis you accept is considered for the report. The hypothesis doesn't need to be 100% accurate.

1. Choose **Incident report**. During the investigation the AI parsed the data collected for your investigation and derived facts. Facts are atomic pieces of information about your incident that form the basis of generating the report. Fact extraction can take a few minutes.

1. When fact extraction is finished, you can review the facts available in the following areas:

   1. **Incident Overview** – High-level overview of the incident including its severity, duration, and operational hypothesis.

   1. **Impact Assessment** – Metrics and analysis related to the impact of the incident on customers, service function, and business operations.

   1. **Detection and Response** – Metrics and analysis related to how and when the incident was detected and how you responded to the incident.

   1. **Root Cause Analysis** – Detailed analysis of underlying causes based on investigation hypotheses.

   1. **Mitigation and Resolution** – Metrics and analysis related to mitigation steps and resolution measures, along with the time measurement for incident mitigation and resolution.

   1. **Learning and Next Steps** – A list of recommended actions for your team to consider, automatically generated from the investigation findings. These recommendations may include preventive measures against similar incidents, as well as suggested improvements to your monitoring and response processes.

1. After reviewing the facts, choose **Generate report** to create a comprehensive analysis of the incident. While the selected facts serve as key reference points, the report draws from all available information gathered during the investigation. This process can take a few minutes.

1. After generating the report, you can then either:
   + Use the report as is:
     + Copy it to edit in your external editor if needed
     + Save it for later reference
   + Enhance the report by adding more data:
     + Choose **Add facts** (recommended method) to input additional text-based content such as incident tickets or custom narratives. The AI will analyze this content to augment existing facts or infer new ones.
     + Edit facts directly (use sparingly) - Manually edited facts may create inconsistencies with the investigation timeline. This should be used only as a last resort when **Add facts** doesn't achieve the desired result.
   + Choose **Regenerate report** to produce a new report using the updated information.

**Topics**
+ [Understanding AI-derived facts in incident reports](Investigations-IncidentReports-ai-facts.md)
+ [Incident report terminology](Investigations-IncidentReports-terms.md)
+ [Generate a report from an investigation](Investigations-IncidentReports-Generate.md)
+ [Using 5 Whys analysis in incident reports](incident-report-5whys.md)

# Understanding AI-derived facts in incident reports
<a name="Investigations-IncidentReports-ai-facts"></a>

AI-derived facts form the foundation of CloudWatch investigations incident reports, representing information that the AI system considers objectively true or highly probable based on comprehensive analysis of your AWS environment. These facts emerge through a sophisticated process that combines machine learning pattern recognition with systematic verification methods, creating a robust framework for incident analysis that maintains the operational rigor required for production environments.

Understanding how AI-derived facts are developed helps you evaluate their reliability and make informed decisions during incident response. The process represents a hybrid approach where artificial intelligence augments human expertise rather than replacing it, ensuring that the insights generated are both comprehensive and trustworthy.

## The development process of AI-derived facts
<a name="Investigations-ai-facts-development"></a>

The journey from raw telemetry data to actionable AI-derived facts begins with pattern observation, where the CloudWatch investigations AI analyzes vast amounts of AWS telemetry using sophisticated machine learning algorithms. The AI examines your CloudWatch metrics, logs, and traces across multiple dimensions simultaneously, identifying recurring patterns and relationships that might not be immediately apparent to human operators. The analysis encompasses temporal patterns that reveal when incidents typically occur and their duration characteristics, service correlations that show how different AWS services interact during failure scenarios, metric anomalies that precede or accompany incidents, and log event sequences that indicate specific failure modes.

Consider, for example, how the AI might observe that in your environment, Amazon EC2 instance CPU utilization consistently spikes to above 90% approximately 15 minutes before application response times exceed acceptable thresholds. This temporal relationship, when observed across multiple incidents, becomes a significant pattern worthy of further investigation. The AI doesn't simply note the correlation; it measures the statistical significance of the relationship and considers various confounding factors that might influence the pattern.

From these observed patterns, the AI moves into hypothesis generation, formulating potential explanations for the relationships it has discovered. This process involves creating multiple competing hypotheses and ranking them by probability based on the strength of supporting evidence. When the AI observes that CPU spikes precede response time degradation, it might generate several hypotheses: resource exhaustion due to insufficient compute capacity, memory leaks causing increased CPU overhead, or inefficient algorithms triggered by specific input patterns. Each hypothesis receives a preliminary confidence level based on how well it explains the observed data and aligns with known AWS service behaviors.

The human verification and validation of these hypotheses ensures that these AI-generated insights meet operational standards before becoming facts in your incident reports. This process involves correlating AI-derived patterns with established AWS service behavior models, checking consistency with industry best practices for incident response, and validating against historical incident data from similar environments. The AI must demonstrate that its findings are reproducible across different analysis methods and time periods, meet statistical significance requirements for operational decision-making, align with empirical observations of AWS service behavior, and provide actionable insights for incident resolution or prevention.

Throughout this process, the AI faces several inherent challenges that you should understand when interpreting AI-derived facts. The distinction between correlation and causation remains a fundamental challenge; while the AI might identify strong correlations between network traffic spikes and incident occurrence, establishing direct causation requires additional investigation and domain expertise. Hidden variables that exist outside the scope of AWS telemetry, such as third-party service dependencies or external network provider issues, may influence incidents without being captured in the AI analysis. The quality of AI-derived facts depends entirely on the completeness and accuracy of the underlying CloudWatch data, making comprehensive monitoring coverage essential for reliable insights.

Novel incident patterns present another challenge, as those are not present in AI training data, and AIs often struggle to interpret unfamiliar failure modes. This limitation underscores the importance of human expertise in interpreting AI-derived facts and supplementing them with domain knowledge and contextual understanding.

## Applying AI-derived facts in incident response
<a name="Investigations-ai-facts-practical-application"></a>

AI excels at identifying patterns across large datasets that would be impractical for humans to analyze manually, providing insights that can significantly accelerate incident diagnosis and resolution. AI works best when combined with human expertise that can provide context, validate conclusions, and identify factors that may not be captured in telemetry data.

The most effective approach involves treating AI-derived facts as highly informed starting points for investigation rather than definitive conclusions. When the AI identifies a fact such as "Database connection pool exhaustion preceded the incident by 8 minutes," this provides a valuable lead that can be quickly verified through targeted analysis of database metrics and application logs. The fact gives you a specific timeframe and potential root cause to investigate, dramatically reducing the time needed to identify the issue compared to manually searching through all available telemetry.

Data quality plays a crucial role in the reliability of AI-derived facts. Comprehensive CloudWatch monitoring coverage provides the AI access to complete and accurate information for analysis. Gaps in monitoring can lead to incomplete or misleading facts, as the AI can only work with the data available to it. Organizations that use thorough observability practices that include detailed metrics collection, comprehensive logging, and distributed tracing are more likely to have accurate and actionable AI-derived facts in their incident reports.

# Incident report terminology
<a name="Investigations-IncidentReports-terms"></a>

The following terms are used in CloudWatch investigations incident reports:

AI-derived fact  
A piece of information or observation that the AI system considers to be objectively true or highly probable based on the available data, telemetry, logs, and historical patterns within AWS services. These facts are derived through algorithmic analysis and machine learning models, and while they are treated as reliable by the system, they should be subject to human verification, especially in critical decision-making contexts. AI-derived facts may include correlations between events, anomaly detections, or inferences about system behavior that might not be immediately apparent to human operators.

Corrective actions  
Specific, actionable steps recommended by CloudWatch investigations to address the root cause of an incident and prevent its recurrence, based on AWS best practices and the specific context of the affected resources.

Fact categories  
Structured groupings of incident-related information, such as impact metrics, detection details, and mitigation steps, used to organize data for report generation.

Impact assessment  
A quantitative and qualitative evaluation of an incident's effects on system performance, user experience, and business operations, derived from CloudWatch metrics and other AWS service data added to the investigation.

Incident report generation  
An automated process that creates comprehensive documentation of an operational incident, including its timeline, impact, root cause, and resolution steps, based on data collected during a CloudWatch investigations investigation.

Investigation Feed  
A chronological display of accepted observations, hypotheses, and user-added notes within a CloudWatch investigations investigation, serving as the primary record of the investigation's progress and findings.

Lessons learned  
Automatically generated insights and improvement opportunities identified through the incident investigation process, aimed at enhancing system reliability, operational efficiency, and incident response capabilities across the organization.

Report assessment  
An automated evaluation of the generated incident report, identifying potential data gaps or areas requiring additional information to improve report completeness and quality.

Root cause analysis  
A systematic process of identifying the fundamental reason for an operational issue, leveraging CloudWatch investigations AI-driven hypotheses and correlations across multiple AWS services.

Suggestions tab  
A feature in CloudWatch investigations that presents AI-generated observations and hypotheses about potential causes or related issues, based on analysis of system telemetry and logs.

Timeline events  
A chronological sequence of significant occurrences during an incident, automatically extracted from CloudWatch logs, metrics, and other AWS service data to provide a clear overview of incident progression.

# Generate a report from an investigation
<a name="Investigations-IncidentReports-Generate"></a>

You can generate incident reports from in-progress or completed investigations. Incident reports generated early in an investigation may not include key facts such as root causes and recommended actions. When the investigation is active you can edit the facts available to supplement the investigation with additional information. After the investigation is ended, you can't edit or add facts to the investigation.

**Prerequisites**

Before generating an incident confirm the following requirements are met:
+ Ensure the investigation group uses the required KMS key and has appropriate IAM policies attached to its role for decrypting data from AWS services. If your AWS resources are encrypted with customer-managed KMS keys, you must add IAM policy statements to the investigation group role to grant CloudWatch Investigations the permissions needed to decrypt and access this data.
+ Investigation group role has been granted the following permissions:
  + `aiops:GetInvestigation`
  + `aiops:ListInvestigationEvents`
  + `aiops:GetInvestigationEvent`
  + `aiops:PutFact`
  + `aiops:UpdateReport`
  + `aiops:CreateReport`
  + `aiops:GetReport`
  + `aiops:ListFacts`
  + `aiops:GetFact`
  + `aiops:GetFactVersions`
**Note**  
You can add these permissions as an inline policy to the investigation group role, or attach an additional permissions policy to investigation group role. For more information see, [Permissions for incident report generation](Investigations-Security.md#Investigations-Security-IAM-IRG).  
The new managed policy `AIOpsAssistantIncidentReportPolicy` provides the required permissions and is automatically added to investigation groups created after October 10, 2025. For more information, see [AIOpsAssistantIncidentReportPolicy](managed-policies-cloudwatch.md#managed-policies-QInvestigations-AIOpsAssistantIncidentReportPolicy).

**To generate an incident report**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the left navigation pane, choose **AI Operations**, **Investigations**.

1. Choose the name of an investigation.

1. On the investigation page, under **Feed** accept any additional relevant hypotheses and add any additional notes to the investigation.
**Note**  
Report generation requires an investigation with at least one accepted hypothesis.

1. On the top of the investigation page, choose **Incident report**. Wait while the relevant facts of the investigation are collected and synced.

1. On the **Incident Report** page review the facts being used to generate the report. The facts are available in the right pane. Navigate through the fact category tab using the left and right arrows, or expand the pane to see all of the categories.

   1. Choose **Edit** on a fact panel to manually add or edit the data in that category.

   1. Choose **View details** on a fact panel to see the supporting evidence and fact history gathered by the AI assistant. You can also choose **Edit** within the fact detail window.

   1. Choose **Add facts** if you want to provide additional context to the investigation, such as external events or extenuating circumstances.

1. Choose **Generate report**.

   CloudWatch investigations will analyze the investigation data and generate a structured report. This process might take some time.

1. Review the generated report in the preview pane. The report will include:
   + Automatically extracted timeline events
   + Root cause analysis based on accepted hypotheses
   + Impact assessment derived from investigation telemetry
   + Recommended corrective actions and lessons learned following AWS best practices

1. To retain a copy of the report in a different location, you can choose to copy the text of the report and paste it into your desired location.

1. Choose **Report assessment** to review a list of data gaps in the report. You can use this information to gather additional data for the report and then update the facts accordingly and regenerate the report.

# Using 5 Whys analysis in incident reports
<a name="incident-report-5whys"></a>

When generating incident reports, CloudWatch investigations can perform a 5 Whys root cause analysis to systematically identify the underlying causes of operational issues. This structured approach enhances your incident reports with deeper insights and actionable remediation steps.

This feature uses Amazon Q to provide a conversational chat. The user signed into the AWS Management Console must have the following permissions:

```
{ 
    "Sid" : "AmazonQAccess",
    "Effect" : "Allow",
    "Action" : [
       "q:StartConversation", 
       "q:SendMessage", 
       "q:GetConversation", 
       "q:ListConversations", 
       "q:UpdateConversation", 
       "q:DeleteConversation", 
       "q:PassRequest" 
     ],
    "Resource" : "*"
 }
```

You can add these permissions directly, or by attaching either the [AIOpsConsoleAdminPolicy](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AIOpsConsoleAdminPolicy.html) or [AIOpsOperatorAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AIOpsOperatorAccess.html) managed policy to the user or role. 

## What is 5 Whys analysis?
<a name="5whys-overview"></a>

The 5 Whys is a root cause analysis technique that asks "why" repeatedly to drill down from incident symptoms to fundamental causes. Each answer becomes the basis for the next question, creating a logical chain that reveals the true root cause rather than just surface-level symptoms.

During incident report generation, CloudWatch investigations uses this method to analyze investigation findings and provide structured root cause analysis that goes beyond immediate technical failures to identify process, configuration, or systemic issues.

## Benefits for incident reporting
<a name="why-5whys-incidents"></a>

Including 5 Whys analysis in incident reports provides several advantages:
+ **Comprehensive root cause identification** - Moves beyond immediate technical causes to identify underlying process or system issues
+ **Actionable remediation plans** - Provides specific, targeted actions to prevent recurrence rather than temporary fixes
+ **Organizational learning** - Documents the complete causal chain for future reference and team knowledge sharing
+ **Structured analysis** - Ensures systematic investigation rather than ad-hoc problem solving

## Example scenarios in incident reports
<a name="5whys-incident-examples"></a>

### Database connection failure incident
<a name="example-database-outage"></a>

**Initial incident:** E-commerce application experiencing widespread 500 errors

1. **Why 1:** Why are users getting 500 errors? The application cannot connect to the primary database.

1. **Why 2:** Why can't the application connect to the database? The database instance ran out of available connections.

1. **Why 3:** Why did the database run out of connections? A batch processing job opened many connections without properly closing them.

1. **Why 4:** Why didn't the batch job close connections properly? The job's error handling doesn't include connection cleanup in failure scenarios.

1. **Why 5:** Why wasn't proper error handling implemented? Code review process doesn't include specific checks for resource management patterns.

**Root cause:** Inadequate code review standards for resource management

**Recommended actions:** Update code review checklist, implement connection pooling monitoring, add automated resource leak detection

### Performance degradation incident
<a name="example-auto-scaling"></a>

**Initial incident:** API response times increased from 200ms to 5000ms during traffic spike

1. **Why 1:** Why did response times increase? CPU utilization reached 100% on all application instances.

1. **Why 2:** Why didn't auto scaling add more instances? Auto scaling was triggered but new instances failed health checks.

1. **Why 3:** Why did new instances fail health checks? The application startup process takes 8 minutes, longer than the health check timeout.

1. **Why 4:** Why does startup take so long? The application downloads large configuration files from S3 on every startup.

1. **Why 5:** Why wasn't this startup delay considered in auto scaling configuration? Performance testing was done with pre-warmed instances, not cold starts.

**Root cause:** Performance testing methodology doesn't reflect production auto scaling scenarios

**Recommended actions:** Include cold start testing, optimize application startup, adjust health check timeouts, implement configuration caching

### Complex incident with branch analysis
<a name="example-complex-branch"></a>

**Initial incident:** OpenSearch Serverless customers experienced 48.3% availability degradation for 11 hours

**Main analysis chain:**

1. **Why 1:** Why did customers experience service degradation? Service availability dropped to 48.3% due to incorrect ingester scaling.

1. **Why 2:** Why was ingester scaling incorrect? CortexOperator reduced ingesters from 223 to 174 due to AZ balance miscalculation.

1. **Why 3:** Why did CortexOperator miscalculate AZ balance? The code couldn't process new Kubernetes label formats after version 1.17 upgrade.

1. **Why 4 (Branch A - Technical):** Why didn't the code handle new label formats? The code expected 'failure-domain.beta.kubernetes.io/zone' labels but Kubernetes 1.17 changed to 'topology.kubernetes.io/zone'.

1. **Why 5 (Branch A):** Why wasn't backward compatibility implemented? The label format change wasn't documented in the upgrade notes reviewed during deployment planning.

**Branch B - Process Analysis:**

1. **Why 4 (Branch B - Process):** Why wasn't this caught in testing? Integration tests used pre-configured clusters with old label formats.

1. **Why 5 (Branch B):** Why didn't testing include label format validation? Test environment setup didn't mirror production Kubernetes version upgrade sequence.

**Root causes identified:**
+ Technical: Missing backward compatibility for Kubernetes label format changes
+ Process: Testing methodology doesn't validate version upgrade impacts

**Integrated remediation plan:** Implement label format detection logic, enhance upgrade testing procedures, add automated compatibility validation, and establish version change impact assessment process.

## Using the guided 5 Whys workflow
<a name="accessing-5whys"></a>

CloudWatch investigations provides a guided 5 Whys analysis workflow to help you address missing facts and strengthen your incident reports. This feature appears as a suggested workflow when the system identifies opportunities to enhance root cause analysis.

### Interactive analysis experience
<a name="interactive-analysis"></a>

The 5 Whys analysis in CloudWatch investigations uses an interactive, chat-based approach that guides you through the investigation process. This conversational method helps ensure comprehensive analysis while maintaining logical flow between questions.

**Key features of the interactive experience:**
+ **Fact-based initialization** - The system presents relevant facts from your investigation upfront, using them to pre-populate obvious answers and clearly indicating fact-based versus inference-based suggestions
+ **Guided probing** - For each "why" question, the system suggests answers based on available facts, requests specific additional context, and guides you to consider important aspects before proceeding
+ **Branch management** - When multiple contributing factors are identified, the system clearly presents branch options, explains relationships between branches, and helps prioritize parallel investigations
+ **Progressive validation** - For each response, the system reformulates answers for clarity, seeks confirmation, highlights key insights, and connects findings to broader context

This approach ensures that you capture all relevant information while maintaining focus on the most critical causal relationships.

**Accessing the guided workflow:**

1. During incident report generation, review the **Facts need attention** section in the right panel.

1. Look for the **Guided 5-Whys analysis** suggestion under **Suggested workflow**.

1. Choose **Guide me** to start the interactive 5 Whys process.

1. Follow the guided prompts to systematically work through each "why" question, building a complete causal chain from symptoms to root cause.

The guided workflow helps ensure you capture comprehensive root cause information by walking you through each step of the 5 Whys methodology. The analysis results are automatically incorporated into your incident report, providing structured documentation for post-incident reviews and organizational learning.

You can also request a 5 Whys analysis through the chat interface by asking questions such as "Perform a 5 Whys analysis for this incident" or "What is the root cause using 5 Whys methodology?"

## Handling complex incidents with multiple causes
<a name="branch-analysis"></a>

Some incidents involve multiple contributing factors that require parallel analysis paths. CloudWatch investigations supports branch analysis to ensure all significant causes are identified and addressed.

**When branch analysis is needed:**
+ Multiple independent failures occurred simultaneously
+ Different system components contributed to the same customer impact
+ Both technical and process failures played significant roles
+ Cascading failures created multiple causal chains

**Branch analysis process:**

1. **Branch identification** - The system identifies points where multiple causes converge or diverge

1. **Parallel investigation** - Each branch is analyzed using the complete 5 Whys methodology

1. **Connection mapping** - Relationships between branches are documented to show how they interact

1. **Integrated resolution** - Remediation plans address all identified root causes and their interactions

This comprehensive approach ensures that complex incidents receive thorough analysis and that all contributing factors are addressed in the final remediation plan.

## Best practices for effective 5 Whys analysis
<a name="5whys-best-practices"></a>

To maximize the effectiveness of 5 Whys analysis in your incident reports, follow these best practices derived from operational experience:

### Question formulation guidelines
<a name="question-formulation"></a>
+ **Start with customer impact** - Begin each analysis with the customer-facing problem to maintain focus on business impact
+ **Increase technical depth progressively** - Move from business impact to technical details as you progress through the questions
+ **Maintain logical continuity** - Ensure each answer naturally leads to the next question without logical gaps
+ **Include supporting evidence** - Reference specific metrics, logs, or timeline events to validate each answer

### Analysis validation
<a name="validation-criteria"></a>

Validate your 5 Whys analysis using these criteria:
+ **Logical flow** - Clear progression from symptoms to root cause with no missing steps
+ **Technical accuracy** - Correct terminology, accurate system behavior descriptions, and valid component interactions
+ **Completeness** - The analysis explains all observed symptoms and reaches a fundamental cause that, if addressed, would prevent recurrence
+ **Actionability** - The root cause identified leads to specific, implementable remediation actions

### Common pitfalls to avoid
<a name="common-pitfalls"></a>
+ **Stopping at symptoms** - Don't conclude the analysis at the first technical failure; continue until you reach systemic or process causes
+ **Blame-focused analysis** - Focus on system and process failures rather than individual actions
+ **Single-path thinking** - Consider multiple contributing factors and use branch analysis when appropriate
+ **Insufficient evidence** - Ensure each answer is supported by concrete data from your investigation

### Integration with incident report sections
<a name="5whys-integration"></a>

The 5 Whys analysis integrates with other sections of your incident report to provide comprehensive documentation:
+ **Timeline correlation** - Each "why" question can reference specific timeline events, providing temporal context for causal relationships
+ **Metrics validation** - Answers are supported by metrics and graphs that demonstrate the technical behaviors described
+ **Impact assessment alignment** - The first "why" directly connects to customer impact metrics documented in the impact assessment section
+ **Lessons learned foundation** - Root causes identified through 5 Whys analysis directly inform the lessons learned and corrective actions sections

This integration ensures consistency across your incident report and provides stakeholders with a complete, coherent narrative from initial symptoms through root cause to remediation plans.