

# Continuous monitoring
<a name="continuous-monitoring"></a>

 Continuous monitoring is the real-time observation and analysis of telemetry data to help optimize system performance. It encompasses alert configuration to notify teams of potential issues, promoting rapid response. Post-event investigations provide valuable insights to continuously optimize the monitoring process. By integrating artificial intelligence (AI) and machine learning (ML), continuous monitoring can achieve a higher level of precision and speed in detecting and responding to system issues. 

**Topics**
+ [

# Indicators for continuous-monitoring
](indicators-for-continuous-monitoring.md)
+ [

# Anti-patterns for continuous monitoring
](anti-patterns-for-continuous-monitoring.md)
+ [

# Metrics for continuous monitoring
](metrics-for-continuous-monitoring.md)

# Indicators for continuous-monitoring
<a name="indicators-for-continuous-monitoring"></a>

This is the real-time observation and analysis of telemetry data. This capability provides continuous optimization through alert tuning and post-event investigations.

**Topics**
+ [

# [O.CM.1] Automate alerts for security and performance issues
](o.cm.1-automate-alerts-for-security-and-performance-issues.md)
+ [

# [O.CM.2] Plan for large scale events
](o.cm.2-plan-for-large-scale-events.md)
+ [

# [O.CM.3] Conduct post-incident analysis for continuous improvement
](o.cm.3-conduct-post-incident-analysis-for-continuous-improvement.md)
+ [

# [O.CM.4] Report on business metrics to drive data-driven decision making
](o.cm.4-report-on-business-metrics-to-drive-data-driven-decision-making.md)
+ [

# [O.CM.5] Detect performance issues using application performance monitoring
](o.cm.5-detect-performance-issues-using-application-performance-monitoring.md)
+ [

# [O.CM.6] Gather user experience insights using digital experience monitoring
](o.cm.6-gather-user-experience-insights-using-digital-experience-monitoring.md)
+ [

# [O.CM.7] Visualize telemetry data in real-time
](o.cm.7-visualize-telemetry-data-in-real-time.md)
+ [

# [O.CM.8] Hold operational review meetings for data transparency
](o.cm.8-hold-operational-review-meetings-for-data-transparency.md)
+ [

# [O.CM.9] Optimize alerts to prevent fatigue and minimize monitoring costs
](o.cm.9-optimize-alerts-to-prevent-fatigue-and-minimize-monitoring-costs.md)
+ [

# [O.CM.10] Proactively detect issues using AI/ML
](o.cm.10-proactively-detect-issues-using-aiml.md)

# [O.CM.1] Automate alerts for security and performance issues
<a name="o.cm.1-automate-alerts-for-security-and-performance-issues"></a>

 **Category:** FOUNDATIONAL 

 Alerts should automatically notify teams when there are indicators of malicious activity, compromise, or performance degradation. Effective alerting accelerates incident response times, enabling teams to quickly address and resolve issues before they can significantly impact system performance or security. Without automatic alerting, teams can suffer from delayed response times that can lead to prolonged system downtime or increased exposure to security threats. 

 Implement centralized alerting mechanisms to track anomalous behavior across all systems. Define specific conditions and thresholds that, when breached, will raise alerts. Verify that the alerts are delivered to the appropriate teams by email, text message, or the team's preferred notification system. Integrating these alerts into your centralized incident management systems can also help in the automatic creation of tickets, aiding faster resolution. 

 In a more advanced workflow, alerts can be integrated with automated governance systems to start remediation actions immediately upon detection or to gather additional insights that will aid investigations. 

**Related information:**
+  [AWS Well-Architected Performance Pillar: PERF07-BP06 Monitor and alarm proactively](https://docs.aws.amazon.com/wellarchitected/latest/performance-efficiency-pillar/perf_monitor_instances_post_launch_proactive.html) 
+  [AWS Well-Architected Reliability Pillar: REL06-BP03 Send notifications (Real-time processing and alarming)](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_monitor_aws_resources_notification_monitor.html) 
+  [What is Anomaly Detection?](https://aws.amazon.com/what-is/anomaly-detection/) 
+  [AWS Security Hub CSPM](https://aws.amazon.com/security-hub/) 
+  [Amazon OpenSearch Service](https://aws.amazon.com/opensearch-service/) 
+  [AWS Health Aware](https://github.com/aws-samples/aws-health-aware/) 
+  [Amazon's approach to high-availability deployment: Anomaly detection](https://youtu.be/bCgD2bX1LI4?t=2493) 

# [O.CM.2] Plan for large scale events
<a name="o.cm.2-plan-for-large-scale-events"></a>

 **Category:** FOUNDATIONAL 

 A large scale event (LSE) is an incident that has a wide impact, such as service outages or major security incidents. Proper management of LSEs help to ensure business continuity, maintain customer trust, and reduce the negative impact of such events. 

 Prepare a detailed incident management plan, outlining the roles, responsibilities, and processes to be followed in the event of a large-scale incident. At a minimum, the plan should outline how teams expect to maintain availability and reliability of systems by having the capability to automatically scale resources, re-route traffic, and failover to backup systems when required. 

**Related information:**
+  [Disaster Recovery of Workloads on AWS: Recovery in the Cloud](https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-workloads-on-aws.html) 
+  [Incident management](https://docs.aws.amazon.com/whitepapers/latest/tagging-best-practices/incident-management.html) 
+  [Disaster recovery plan](https://aws.amazon.com/disaster-recovery/faqs/#Core_concepts) 
+  [Amazon's approach to security during development: Handling a security incident](https://youtu.be/NeR7FhHqDGQ?t=1962) 

# [O.CM.3] Conduct post-incident analysis for continuous improvement
<a name="o.cm.3-conduct-post-incident-analysis-for-continuous-improvement"></a>

 **Category:** FOUNDATIONAL 

 Drive the continuous improvement of analysis and response mechanisms by holding post-incident retrospectives. The post-incident retrospectives allow teams to identify gaps and areas for improvement by analyzing the actions that were taken during an incident. These retrospectives should not be used to place blame or point fingers at individuals. Instead, they provide the time for teams to optimize their response process for future incidents and helps ensure that they are continuously learning and improving their incident response capabilities. This approach leads to more efficient and effective resolution of incidents over time. 

 All relevant stakeholders involved with the incident and the system should attend the retrospective. At a minimum, this should include the leaders and individual contributors who support the system, the customer advocates, those who were impacted by the issue internally, as well as those involved with the resolution of the issue. The post-incident retrospective findings should be anonymized, as to not place blame onto any individuals, and should be well documented and shared with the broader organization so that others may learn as well. 

**Related information:**
+  [AWS Well-Architected Performance Pillar: PERF07-BP02 Analyze metrics when events or incidents occur](https://docs.aws.amazon.com/wellarchitected/latest/performance-efficiency-pillar/perf_monitor_instances_post_launch_review_metrics.html) 
+  [AWS Well-Architected Reliability Pillar: REL12-BP02 Perform post-incident analysis](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_testing_resiliency_rca_resiliency.html) 
+  [AWS Well-Architected Operational Excellence Pillar: OPS11-BP02 Perform post-incident analysis](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_evolve_ops_perform_rca_process.html) 

# [O.CM.4] Report on business metrics to drive data-driven decision making
<a name="o.cm.4-report-on-business-metrics-to-drive-data-driven-decision-making"></a>

 **Category:** FOUNDATIONAL 

 Business metrics for all systems should be accessible and comprehensible to leaders and key stakeholders. These metrics should inform key performance indicators (KPIs), service level objectives (SLOs), service level agreement (SLA) adherence, user engagement, conversion rates, and other metrics relevant to the business sides of your operations. 

 Just like with technology metrics, continuous monitoring tools should be used to detect when business metrics cross predefined thresholds, triggering alerts that highlight significant deviations or potential issues. These alerts should inform timely and data-driven decision-making, helping identify areas for improvement, optimizing system performance, and aligning actions with overarching business goals. 

 Create dashboards or reports that present these metrics, as well as how they are tracking against KPIs and SLAs, in a user-friendly, non-technical format. Ensure the data is up-to-date, accurate, and accessible to less technical leaders so that it can be used to make informed business decisions. Observability isn't merely about data collection—it is about turning that data into actionable insights that drive better outcomes for both the technology and business sides of the organization. 

 Fast feedback leads to success. Continuously monitoring and alerting on business metrics is becoming foundational for organizations committed to maximizing the value they get from their technology investments and for maintaining the quality of their digital services. 

**Related information:**
+  [AWS Well-Architected Performance Pillar: PERF07-BP05 Review metrics at regular intervals](https://docs.aws.amazon.com/wellarchitected/latest/performance-efficiency-pillar/perf_monitor_instances_post_launch_review_metrics_collected.html) 
+  [Operational observability](https://docs.aws.amazon.com/whitepapers/latest/tagging-best-practices/operational-observability.html) 
+  [The Amazon Software Development Process: Measure Everything](https://youtu.be/52SC80SFPOw?t=1922) 
+  [Using Cloud Fitness Functions to Drive Evolutionary Architecture](https://aws.amazon.com/blogs/architecture/using-cloud-fitness-functions-to-drive-evolutionary-architecture/) 

# [O.CM.5] Detect performance issues using application performance monitoring
<a name="o.cm.5-detect-performance-issues-using-application-performance-monitoring"></a>

 **Category:** RECOMMENDED 

 Application Performance Monitoring (APM) refers to the use of tools to monitor and manage the ongoing, real-time performance and availability of systems in production environments. APM tools help in maintaining the performance of systems by identifying performance issues early on. This leads to quicker resolution of issues, improved user experience, and reduced downtime. 

 To comprehensively monitor application performance, implement both Real-User Monitoring (RUM) and Synthetic Monitoring. These APM tools are recommended detect and diagnose performance issues in production systems. These APM tools enable teams to proactively detect and diagnose complex application performance problems to maintain an expected level of service. 

 RUM captures performance metrics based on actual user interactions. Analyze real user data to understand areas of the system that are frequently used and might benefit from performance improvements. This data can then be used to identify and debug client-side issues to optimize end-user experience. 

 On the other hand, Synthetic Monitoring involves writing scripts that simulate user interactions, known as canaries, to continuously monitor endpoints and APIs. Canaries follow the same routes and perform the same actions as a customer, allowing for the continuous verification of the customer experience even in the absence of actual customer traffic. By using insights from RUM, you can optimize which canaries to run continuously, ensuring they closely mimic the most common user paths. This strategy ensures potential issues are identified before impacting users, offering a seamless user experience. 

 Both tools collect metrics on response time, resource utilization, and other performance-related indicators, forming a holistic approach to continuous performance monitoring in production environments. 

**Related information:**
+  [AWS Well-Architected Performance Pillar: PERF01-BP06 Benchmark existing workloads](https://docs.aws.amazon.com/wellarchitected/latest/performance-efficiency-pillar/perf_performing_architecture_benchmark.html) 
+  [What is APM (Application Performance Monitoring)?](https://aws.amazon.com/what-is/application-performance-monitoring/) 
+  [Real-User Monitoring (RUM) for Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-RUM.html) 
+  [Amazon CloudWatch ServiceLens](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ServiceLens.html) 
+  [Amazon CloudWatch Synthetics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html) 
+  [Amazon CloudWatch Internet Monitor](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-InternetMonitor.html) 

# [O.CM.6] Gather user experience insights using digital experience monitoring
<a name="o.cm.6-gather-user-experience-insights-using-digital-experience-monitoring"></a>

 **Category:** RECOMMENDED 

 Digital Experience Monitoring (DEM) involves simulating user interactions with applications to measure the performance and availability of services from the perspective of end users. DEM allows teams to proactively detect and resolve issues that may impact user experience. It also helps in validating that application updates or changes do not negatively impact user experience. 

 Implement APM tools, such as synthetic transaction monitoring using canaries to simulate user interactions with your application and measure the response times and accuracy of the results. 

 DEM is recommended as it provides important insights into the user experience and helps detect issues that may impact user experience 

**Related information:**
+  [Amazon CloudWatch Synthetics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html) 
+  [AWS Marketplace - Digital Experience Monitoring](https://aws.amazon.com/marketplace/search/results?searchTerms=Digital+Experience+Monitoring) 

# [O.CM.7] Visualize telemetry data in real-time
<a name="o.cm.7-visualize-telemetry-data-in-real-time"></a>

 **Category:** RECOMMENDED 

 Visualization tools simplify the task of correlating and understanding large, complex datasets. Using these tools, teams are able to detect trends, patterns, and anomalies in data in a readily available and easy to understand way.  

 Utilize visualization tools to correlate and comprehend large sets of telemetry data in real-time. Visualization tools support the uniquely human capability to discover patterns that automated tools may otherwise miss. Choose a tool that provides a clear view of system data at varying time intervals, allowing teams to easily detect issues both during or after they arise. Ensure that the tool is flexible and customizable, so that teams can adjust the views and create dashboards based on their unique needs. 

**Related information:**
+  [Building dashboards for operational visibility](https://aws.amazon.com/builders-library/building-dashboards-for-operational-visibility) 
+  [Building Prowler into a QuickSight powered AWS Security Dashboard](https://catalog.us-east-1.prod.workshops.aws/workshops/b1cdc52b-eb11-44ed-8dc8-9dfe5fb254f5/en-US) 

# [O.CM.8] Hold operational review meetings for data transparency
<a name="o.cm.8-hold-operational-review-meetings-for-data-transparency"></a>

 **Category:** RECOMMENDED 

 Operational review meetings are regular gatherings where teams from across the organization come prepared with an operational dashboard that showcases telemetry data, performance metrics, and other insights into operations for their products. The aim is present to the broad audience to share and gain different perspectives on changes in the data, whether it is a spike, dip, or trend. This promotes a culture of transparency, preparedness, and continuous improvement throughout the organization. 

 Amazon implements this by holding weekly Ops review meetings and using the [spinning wheel](https://github.com/aws/aws-ops-wheel) as a random selection method for which team will present. The randomness of the selection ensures that each team comes prepared, as any team can be called upon to present. When presenting, teams must be capable of deep diving into the data, explaining root causes behind notable data changes, and articulating the steps taken or planned to rectify any anomalies. This pushes teams to maintain high-quality operational dashboards that reflect the real-time health and performance of their services. 

**Related information:**
+  [AWS Well-Architected Performance Pillar: PERF07-BP05 Review metrics at regular intervals](https://docs.aws.amazon.com/wellarchitected/latest/performance-efficiency-pillar/perf_monitor_instances_post_launch_review_metrics_collected.html) 
+  [AWS Well-Architected Reliability Pillar: REL06-BP06 Conduct reviews regularly](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_monitor_aws_resources_review_monitoring.html) 
+  [AWS Ops Wheel](https://github.com/aws/aws-ops-wheel) 
+  [AWS Well-Architected Operational Excellence Pillar: OPS11-BP07 Perform operations metrics reviews](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_evolve_ops_metrics_review.html) 
+  [The Amazon Software Development Process: Monitor Everything](https://youtu.be/52SC80SFPOw?t=1548) 

# [O.CM.9] Optimize alerts to prevent fatigue and minimize monitoring costs
<a name="o.cm.9-optimize-alerts-to-prevent-fatigue-and-minimize-monitoring-costs"></a>

 **Category:** RECOMMENDED 

 Reduce the number of ineffective alerts as well as the costs associated with monitoring by optimizing rules and thresholds for alerts based on business impact and issue severity. By continuously refining rules and thresholds for alerts, teams can minimize unnecessary notifications, reducing the time and resources spent on non-critical issues. This helps teams focus on high-impact issues, enhancing productivity and efficiency. 

 Set up alert rules and thresholds based on the severity and business impact of potential issues. Teams should leverage cost-effective methods for delivering notifications, and work to reduce the amount of false positive notifications. Regular reviews and adjustments of these rules and thresholds should be done based on usage patterns to further minimize costs, while still ensuring that teams are alerted to critical issues in a timely and effective manner. 

 Implementing intelligent alerting strategies, such as alert deduplication, aggregation, and comprehensive data visualization can help to reduce cost, alert fatigue, and data overload that comes with having too many alerts. 

# [O.CM.10] Proactively detect issues using AI/ML
<a name="o.cm.10-proactively-detect-issues-using-aiml"></a>

 **Category:** OPTIONAL 

 Adopt data-driven AI/ML monitoring tools and techniques like Artificial Intelligence Operations (AIOps), ML-powered anomaly detection, and predictive analytics solutions, to detect issues and performance bottlenecks proactively—even before system performance is impacted. 

 Choose a tool that can leverage data and analytics to automatically infer predictions, and begin to feed data to it and inject failure to test the validity of the tool. These tools should have access to both historical and real-time data. Once operational, the tool can automatically detect issues, predict impending resource exhaustion, detail likely causes, and recommend remediation actions to the team. Ensure that there is a feedback loop to continuously train and refine these models based on real-world data and incidents. 

 Start small when setting up alerts from these tools to avoid alert fatigue and maintain trust in the system. As the tool becomes more familiar with the data patterns, teams can gradually increase the alerting scope. Regularly validate the tool's predictions by injecting failures and observing the responses. 

**Related information:**
+  [Machine-Learning-Powered DevOps - Amazon DevOps Guru](https://aws.amazon.com/devops-guru/) 
+  [Amazon GuardDuty](https://aws.amazon.com/guardduty/) 
+  [Continuous Monitoring and Threat Detection](https://aws.amazon.com/security/continuous-monitoring-threat-detection/) 
+  [Gaining operational insights with AIOps using Amazon DevOps Guru Workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/f92df379-6add-4101-8b4b-38b788e1222b/en-US) 
+  [What Is Anomaly Detection?](https://aws.amazon.com/what-is/anomaly-detection) 
+  [What Is Predictive Analytics?](https://aws.amazon.com/what-is/predictive-analytics) 

# Anti-patterns for continuous monitoring
<a name="anti-patterns-for-continuous-monitoring"></a>
+  **Blame culture**: Encouraging a culture where individuals are blamed for errors or failures can deter open communication, and the collaborative diagnosis of issues. In a blame culture, team members may hide or underreport issues for fear of retribution. Instead, foster a culture of shared responsibility where failures are seen as opportunities for learning and improvement. Encourage open discussions and retrospectives to understand the root causes and to find ways to prevent similar issues in the future. 
+  **Overlooking derived metrics**: Relying solely on surface-level metrics without deriving deeper insights can lead to unaddressed issues and potential service disruptions. Ensure that monitoring includes a comprehensive understanding of system performance by analyzing metrics in depth, such as distinguishing between latencies based on query size or categorizing error types. Use techniques like anomaly detection and consider metrics like trimmed means for latency to reveal patterns obscured by averages. 
+  **Inadequate monitoring coverage: **Not monitoring every critical system or frequently reviewing your monitoring strategy can lead to undetected issues or performance degradation. Regularly assess and update monitoring coverage, ensuring that all systems and applications are being observed. A symptom of this anti-pattern is "no dogs barking," where the absence of expected alerts or metrics itself can indicate an issue. 
+  **Noisy and unactionable alarms:** If alarms frequently sound without actionable cause, trust in the alerting system diminishes, risking slower response times or overlooked genuine alerts. Ensure that alerts are both actionable and significant by continuously evaluating the outcomes they lead to. Implement mechanisms to mute false positives and adjust overly sensitive alarms. 

# Metrics for continuous monitoring
<a name="metrics-for-continuous-monitoring"></a>
+  **Mean time to detect (MTTD)**: The average time it takes to detect a performance issue, attack, or compromise. A shorter MTTD helps organizations respond more quickly to incidents, minimizing damage and downtime. Track this metric by calculating the average time from when incidents occur to when they're detected by the monitoring systems. This includes both automated system detections and manual reporting. 
+  **Mean time between failures (MTBF)**: The average time interval between consecutive failures in the production environment. Tracking this metric helps to gauge the reliability and stability of a system. It can be improved by improving testing capabilities, proactively monitoring for system health, and have post-incident reviews to address root causes. Monitor system outages and failures, then calculate the average time between these events over a given period. 
+  **Post-incident retrospective frequency**: The frequency at which post-incident retrospectives are held. Holding regular retrospectives help teams continuously improve analysis and incident response processes. Measure this metric by counting the number of retrospectives conducted within specified intervals, such as monthly or quarterly. This can also be validated against the total number of incidents to understand if all incidents are followed up with a retrospective. 
+  **False positive rate**: The percentage of alerts generated that are false positives, or incidents that do not require action. A lower false positive rate reduces alert fatigue and ensures that teams can focus on genuine issues. Calculate by dividing the number of false positive alerts by the total number of alerts generated and multiplying by 100 to get the percentage. 
+  **Application performance index ([Apdex](https://en.wikipedia.org/wiki/Apdex))**: Measures user satisfaction with application responsiveness using a scale from 0 to 1. A higher Apdex score indicates better application performance, likely resulting in improved user experience, while a lower score means that users might become frustrated.

  To determine the Apdex score, start by defining a target response time that represents an acceptable user experience for your application. Then, categorize every transaction in one of three ways:
  + **Satisfied**, if its response time is up to and including the target time.
  + **Tolerating**, if its response time is more than the target time but no more than four times the target time.
  + **Frustrated**, for any response time beyond four times the target time.

  Calculate the Apdex score by adding the number of *Satisfied* transactions with half the *Tolerating* transactions. Then, divide this sum by the total number of transactions. Continuously monitor and adjust your target time based on evolving user expectations and leverage the score to identify and rectify areas that contribute to user dissatisfaction. 