# Monitoring and observability
<a name="observability"></a>

 Like security, monitoring and observability are required for all teams who operate and administer cloud applications and services. As described in the [Operational Excellence Pillar whitepaper](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/welcome.html), your teams must define, capture, and analyze operations metrics to gain visibility into workload events so that you can take appropriate action. In the management layer, this also means understanding operational metrics as you provide guardrails, network, security, and identity services in your management platform. 

 All of your teams, whether responsible for many cloud environments or a single application, must be able to understand the health of their operations easily. Your teams will want to use metrics based on operations outcomes to gain useful insights. You should use these metrics to make informed decisions, and as key inputs into each of the eight M&G Guide capabilities. AWS makes it easier to bring together and analyze your operations logs so that you can generate metrics, know the status of your operations, and gain insight from operations over time. These activities are supported centrally when you provide an observability solution for consumption, storage, analysis, and presentation of operational data for analysis. 

 As described in [Responding to Events](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/responding-to-events.html), you should anticipate both planned operational events (such as, sales promotions, deployments, and failure tests) and unplanned ones (such as, surges in utilization and component failures). Use simulations, custom runbooks, and playbooks, and iterate to deliver consistent results when you respond to alerts. Defined alerts should be owned by a role or a team that is accountable for the response and escalations. You will also want to know the business impact of your system components and use this to target efforts when needed. Perform a root cause analysis (RCA) after events, and then introduce necessary changes and controls to prevent recurrence of failures or document workarounds. 

 In many enterprises, technical teams share integrated systems to monitor the services or infrastructure they manage. Shared observability systems bring together all the performance data for an entire organization, enabling teams to visualize the connections between services and components, collaborate with real-time data, and quickly identify the source of performance or security issues. 

 Observability systems collect data directly from applications, and AWS logging and service metric capabilities. AWS provides several services that can help increase your monitoring and observability posture. These services include [AWS CloudTrail](https://aws.amazon.com/cloudtrail/), [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/), [Amazon Managed Service for Prometheus](https://aws.amazon.com/prometheus/), [VPC Flow Logs](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html), [AWS X-Ray traces](https://aws.amazon.com/xray/features/), [Amazon EventBridge events](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-events.html), Amazon Managed Grafana, [Elastic Load Balancing](https://aws.amazon.com/elasticloadbalancing/), and [AWS Network Firewall](https://aws.amazon.com/network-firewall/). 

# Interoperable functions
<a name="interoperable-functions-5"></a>

 The eight management and governance functions, supported by AWS services and AWS Partner solutions, work together and interoperate to reduce complexity. Outputs from these functions are used to inform or integrate with other functions. For monitoring and observability this includes: 
+  Incorporating complementary **Controls** to observe changes and highlight them in the observability tools. 
+  **Network capabilities** that have VPC Flow Logs archived with the central infrastructure log archives and included in the log aggregation tools. 
+  Access to observability tooling defined by **Identity management** with changes to configuration recorded. 
+  **Security management** with observability by design, and specific systems to alert for changes in observability practices. 
+  **Service management** frameworks integrated to observability with operational tooling such as patch management and change and incident management. 
+  **Cloud Financial Management** with observability measures to alert for changes (including outliers in both upper and lower spend) in incurred and forecasted costs. 
+  **Sourcing and distribution** for both custom solutions and purchased solutions with specific logging integrated with your observability design. 

# Implementation priorities
<a name="implementation-priorities-5"></a>

## Collect, aggregate, and protect event and log data
<a name="mon-coll"></a>

 After you have provisioned your multi-account framework with [AWS Control Tower](https://aws.amazon.com/controltower/), you will have enabled the centralized collection of observable metrics and events to a log archive account, using CloudTrail. This collection uses a dedicated and encrypted Amazon S3 bucket, in a dedicated account, with access restricted. Encryption keys should be rotated on a regular basis to increase the security posture of the log archive. Use log aggregation to increase your visibility at scale. Use a service control policy to prevent changes to log configurations. 

 Use [AWS Systems Manager Quick Setup](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-quick-setup.html) with policies defined at the organization level, to deploy the [CloudWatch agent](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html) to EC2 instances across your environments. This will enable system-level metrics to be aggregated alongside your other log data. Feed events into an event management or SIEM platform that has been adapted for AWS environments via API integration. Logs, metrics, and traces should be collected across the following observability categories: 
+  **Control plane observability**—Enable CloudTrail logging to capture API call activity. As accounts are provisioned from AWS Control Tower, a service control policy will be provisioned which prevents changes to the CloudTrail configuration and log archive account. 
+  **Network observability**—Monitor and track network events and behaviors including network firewalls, network intrusion detection and prevention, load balancers, AWS WAF, proxy tools, and network flow data collection and monitoring. Track events and behaviors related to access controls (for example, security groups and firewall services) and monitor network activity with Amazon VPC Flow Logs and packet inspection with Amazon VPC Traffic Mirroring. 
+  **Workload observability** (including distributed tracing within your application observability solutions for serverless, container, storage, and database workloads)—Track events and behaviors at scale as workloads communicate within the cloud environment as a whole, in addition to the local application logs on individual systems. 

## Build capabilities to analyze and visualize log events and traces
<a name="mon-build"></a>

 Build capabilities to interactively search and analyze your local and centralized log data. As you scale with AWS, you will need to include the ability to index and visualize your log insights and metrics. Correlate logs and performance metrics across different types of data collection to drive meaningful conclusions and insights. Use rules to effectively respond to security events or patterns identified in your logs. Develop a nearly continuous monitoring strategy to scale your observability capabilities as you migrate and grow solutions on AWS. 

## Add detection and alerts for anomalous patterns across environments
<a name="mon-add"></a>

 Proactively assess environments for known vulnerabilities and add detection for anomalous patterns of events and activities. Monitor for unusual activity or behavior related to users and workloads using tools such as [Amazon GuardDuty](https://aws.amazon.com/guardduty/), [Amazon CloudWatch ServiceLens](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ServiceLens.html), and [Amazon CloudWatch dashboards](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Dashboards.html). Start with patterns or indicators of unintended account usage or permissions including any login activity to cloud management consoles, any changes, or attempted changes to important cloud objects and data, and any creation, deletion, or modification of credentials or cryptographic keys. Detect incidents and patterns of denials of access, unidentified network traffic, atypical increases in cloud services costs, and unusual application traffic behavior. Configure Amazon CloudWatch alarms, GuardDuty, and SIEMs to initiate alerts and notifications using [Amazon Simple Notification Service](https://aws.amazon.com/sns/) (Amazon SNS). Identify anomalous behavior with [Amazon DevOps Guru](https://aws.amazon.com/devops-guru/), [AWS X-Ray Insights](https://docs.aws.amazon.com/xray/latest/devguide/xray-console-insights.html), and [Amazon CloudWatch Contributor Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContributorInsights.html). 

## Define, automate, and measure response and remediation
<a name="mon-def"></a>

 Establish expected behavior thresholds paired with business metrics to understand KPIs for workloads and environments. Determine appropriate incident and response actions to pursue.  Use SIEM solutions to monitor workloads in real-time, identify security issues, and expedite root-cause analysis. 

 Automations can be initiated by [several different triggers](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-executing-triggers.html), such as EventBridge, State Manager associations, and maintenance windows. By using triggers, you can run automations because of a specific event or on a scheduled basis. Events can be derived from pattern matching using Amazon CloudWatch alerts or SIEM. Take advantage of security orchestration, automation, and response platforms (SOAR) while pairing with responses created from recorded events with tools like AWS Lambda. Maintain a process to continually improve mean time to identify (MTTI) root cause and mean time to respond (MTTR) to problems. Establish and measure goals to reduce the time to detect, identify, and remediate issues. This can also be done in conjunction with post-mortem or lessons learned procedures that align with your existing software development lifecycle or management practices. 

# AWS observability tools
<a name="aws-observability-tools"></a>

 The following AWS services can be used to help you meet the prescribed benefits of the M&G Guide: 

 [AWS CloudTrail](https://aws.amazon.com/cloudtrail/) provides event history of your AWS API activity, including actions taken through the AWS Management Console, AWS SDKs, command line tools, and other AWS services that you specifically enable. By default, AWS Control Tower uses AWS CloudTrail where it is enabled as a multi-account guardrail control, and stores control plane logs in a centralized account. Use the central account to store and analyze all trails. 

 [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/) is a monitoring and observability service built for DevOps engineers, developers, site reliability engineers, and IT managers. CloudWatch provides you with data and actionable insights to monitor your applications, respond to system-wide performance changes, optimize resource utilization, and get a unified view of operational health. CloudWatch collects monitoring and operational data as logs, metrics, and events, providing you with a unified view of AWS resources, applications, and services that run on AWS and on-premises servers. CloudWatch should be used to integrate AWS service, resource, and application logs. 

 With [AWS X-Ray](https://aws.amazon.com/xray/), you can understand how your application and its underlying services are performing to identify and troubleshoot the root cause of performance issues and errors. X-Ray provides an end-to-end view of requests as they travel through your application, and shows a map of your application’s underlying components. You can use X-Ray to analyze both applications in development and in production, from simple three-tier applications to complex microservices applications consisting of thousands of services. 

 To visualize, query, and correlate your metrics, logs, and traces at scale, and to provide a deeper analysis of your observability data, we recommend [Amazon Managed Grafana](https://aws.amazon.com/grafana/). Developed in collaboration with Grafana Labs, Amazon Managed Grafana manages the provisioning, setup, scaling, and maintenance of Grafana servers, decreasing the need for you to manage the underlying infrastructure. Based on open source Grafana with enhanced features such as single sign-on support, Amazon Managed Grafana enables you to query, visualize, alert on, and understand your observability metrics, logs, and traces no matter where the data is stored, such as querying container metrics stored in [Amazon Managed Service for Prometheus](https://aws.amazon.com/prometheus/). 

 Amazon Managed Service for Prometheus is a fully managed, Prometheus-compatible service that enables you to securely ingest, store, and query metrics from container environments. Amazon Managed Service for Prometheus scales on demand, collecting and accessing performance and operational data from container workloads on AWS and on premises. With Amazon Managed Service for Prometheus, you can use the open source Prometheus query language (PromQL) to monitor the performance of containerized workloads without having to manage the underlying infrastructure. Amazon Managed Service for Prometheus automatically scales as your workloads grow or shrink, and uses AWS security services to enable fast and secure access to data. You can use Amazon Managed Service for Prometheus to collect and query metrics from AWS container services including Amazon Elastic Kubernetes Service (EKS) and Amazon Elastic Container Service (Amazon ECS), via AWS Distro for OpenTelemetry or Prometheus servers as the collection agents. 

[Amazon OpenSearch Service](https://aws.amazon.com/opensearch-service/the-elk-stack/what-is-opensearch/) (successor to Amazon Elasticsearch Service) is a distributed, open-source search and analytics suite used for a broad set of use cases, such as real-time application monitoring, log analytics, and website search. Amazon OpenSearch Service provides a highly scalable system for providing fast access and response to large volumes of data with an integrated visualization tool, OpenSearch Dashboards, that makes it easy for users to explore their data. Like Elasticsearch and Apache Solr, OpenSearch Service is powered by the Apache Lucene search library. OpenSearch Service and OpenSearch Dashboards were originally derived from Elasticsearch 7.10.2 and Kibana 7.10.2.

 If you would like support implementing this guidance, or assisting you with building the foundational elements prescribed by the M&G Guide, we recommend you review the offerings provided by [AWS Professional Services](https://aws.amazon.com/professional-services/) or the AWS Partners in the [Built on Control Tower program](https://aws.amazon.com/controltower/partners/). 

 If you are seeking help to operate your workloads in AWS following this guidance, [AWS Managed Services (AMS)](https://aws.amazon.com/managed-services/) can augment your operational capabilities as a short-term accelerator or a long-term solution, letting you focus on transforming your applications and businesses in the cloud. 

# Integrated observability partners
<a name="integrated-observability-partners"></a>

 The M&G Guide recommends you consider the following questions when choosing an AWS Partner solution for observability: 
+  Does it continually monitor security risk? Does it provide or implement configuration changes across cloud environments? 
+  Does it offer threat detection, logging, and reports that align to your specific enterprise standards or regulatory compliance needs? 
+  Does it provide automation to address issues ranging from cloud service configurations to security settings as they relate to governance, compliance, and security for AWS resources? 
+  Does it highlight over-allocation of permissions and permissive traffic policies? 
+  Does it allow for inter-operability between observability and automation? 

 The following integrated monitoring and observability AWS Partners have provided integrations that align to the M&G Guide, and are available for entitlement in AWS Marketplace: 

 [AppDynamics](https://aws.amazon.com/marketplace/pp/B06Y25R2BW?ref_=srh_res_product_title) is designed for production and pre-production environments, and gives you visibility into your entire application topology from a single pane of glass. It allows you to monitor and manage: 
+  End-to-end performance of complex distributed applications with Application Performance Management 
+  Real user monitoring and browser synthetic monitoring 
+  Insights from correlating server and data base performance with application performance 
+  Real-time business awareness into IT operations, customer experience, and business outcomes with transaction, log, browser, and mobile analytics 

 [Datadog](https://aws.amazon.com/marketplace/solutions/control-tower/operational-intelligence/#Datadog) collects and unifies data streaming from complex AWS environments, with a one-click integration for pulling in metrics and tags from over 70 AWS services. You can deploy the Datadog Agent directly on your hosts and compute instances to collect metrics with greater granularity—down to one-second resolution. And with Datadog's out-of-the-box integration dashboards, you get not only a high-level view into the health of your infrastructure and applications but also a deeper visibility into individual services such as AWS Lambda and Amazon EKS. 

 [Dynatrace](https://aws.amazon.com/marketplace/solutions/control-tower/operational-intelligence/#Dynatrace) provides software intelligence to simplify cloud complexity. With automatic and intelligent observability at scale, it delivers precise answers about the performance of cloud platform environments. It seamlessly integrates with AWS Control Tower and securely governs AWS accounts as soon as they are created. A smart baselining capability adapts dynamically and monitors the performance of your environments in real time. 

 [ExtraHop Reveal(x) 360](https://aws.amazon.com/marketplace/solutions/control-tower/network-orchestration/#ExtraHop) provides multi-layered visibility, threat detection, and investigation in AWS via integrations with Amazon VPC Traffic Mirroring for packet-level visibility and VPC Flow Logs for broad coverage. ExtraHop is an AWS Security Competency Partner and offers a free trial of Reveal(x) 360. To learn more, see Reveal(x) 360 in the AWS Marketplace. 

 [New Relic One](https://aws.amazon.com/marketplace/solutions/control-tower/operational-intelligence/#NewRelic) includes a Telemetry Data Platform to ingest, analyze, and alert on your metrics, events, logs, and traces, full-stack observability to quickly visualize and troubleshoot your entire software stack in one connected experience, and applied intelligence to automatically detect anomalies, correlate issues, and reduce alert noise. 

 [Splunk Cloud](https://aws.amazon.com/marketplace/solutions/control-tower/siem/#Splunk) enables you to search, monitor, and analyze machine data from various sources to gain valuable intelligence and insights across your entire organization.

[Sysdig Monitor](https://aws.amazon.com/marketplace/pp/prodview-dq475uhgg4o6g) provides real-time, deep visibility into rapidly changing AWS Cloud and container environments. You can resolve issues faster using granular data derived from Linux system calls enriched with cloud and Kubernetes context along with Prometheus metrics. With Sysdig, cloud teams can optimize costs by visualizing capacity utilization across regions, services, and clusters.