

# Observability in Amazon OpenSearch Service
<a name="observability"></a>

Observability is the practice of understanding the internal state and performance of complex systems by examining their outputs. Traditional monitoring tells you a system is down; observability helps you understand why by letting you ask new questions about your telemetry data.

## What Amazon OpenSearch Service provides
<a name="observability-what"></a>

Amazon OpenSearch Service provides a unified observability solution by collecting, correlating, and visualizing three types of telemetry data:
+ **Logs** – Timestamped records of events.
+ **Traces** – End-to-end journey of requests through distributed services.
+ **Metrics** – Time-series data representing system health, through the Amazon Managed Service for Prometheus direct query integration.

By bringing these together in a single interface, Amazon OpenSearch Service helps operations teams, SREs, and developers detect, diagnose, and resolve issues faster.

## The Amazon OpenSearch Service approach to observability
<a name="observability-approach"></a>

Amazon OpenSearch Service differentiates itself in three key ways:
+ **OpenTelemetry-native with OpenSearch Ingestion as the last mile** – Standardize on OTel for instrumentation and collection. Amazon OpenSearch Ingestion serves as the fully managed pipeline that filters, enriches, transforms, and routes your telemetry data before indexing.
+ **Unified logs, traces, and metrics in OpenSearch UI** – Analyze all three signal types from a single observability workspace. Correlate a slow trace to its application logs, or overlay Prometheus metrics on your service dashboards.
+ **Purpose-driven query languages** – Use [Piped Processing Language (PPL)](https://observability.opensearch.org/docs/ppl/) for logs and traces, and PromQL for metrics. Each language is optimized for its signal type, giving you expressive querying without compromise.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/otel-sdk-service.png)


**Note**  
The observability features described in this section are available only in OpenSearch UI. They are not available in OpenSearch Dashboards. For new observability workloads, we recommend setting up an [Using OpenSearch UI in Amazon OpenSearch Service](application.md) with an observability workspace.

# Get started
<a name="observability-get-started"></a>

Get your observability stack running on AWS and start sending telemetry data in minutes.

## Quick start
<a name="observability-get-started-quick"></a>

The fastest way to deploy an end-to-end observability stack on AWS is the CLI installer. It creates the following resources:
+ An Amazon OpenSearch Service domain
+ An Amazon Managed Service for Prometheus workspace
+ An Amazon OpenSearch Ingestion pipeline
+ An OpenSearch UI application with an observability workspace

Optionally, the installer launches an EC2 instance with the OpenTelemetry Demo for sample telemetry.

Run the following command to start the installation:

```
bash -c "$(curl -fsSL https://raw.githubusercontent.com/opensearch-project/observability-stack/main/install.sh)" -- --deployment-target=aws
```

The installation takes approximately 15 minutes.

## CDK deployment
<a name="observability-get-started-cdk"></a>

For infrastructure-as-code, use AWS CDK. The CDK deployment creates two stacks:


| Stack | What it creates | Deploy time | 
| --- | --- | --- | 
| ObsInfra | OpenSearch domain, Amazon Managed Service for Prometheus workspace, direct query data source, pipeline IAM role | \$117 min | 
| ObservabilityStack | Fine-grained access control mapping, OpenSearch Ingestion pipeline, OpenSearch UI application, dashboard initialization, demo workload (optional) | \$16 min | 

Run the following commands to deploy:

```
cd aws/cdk
npm install
cdk deploy --all
```

For more information, see the [CDK deployment README](https://github.com/opensearch-project/observability-stack/tree/main/aws/cdk) on GitHub.

## Sending telemetry
<a name="observability-get-started-send"></a>

Both deployment methods create an OpenSearch Ingestion endpoint that accepts OTLP data. Configure your OTel Collector to export using SigV4 authentication:

```
extensions:
  sigv4auth:
    region: us-west-2
    service: osis

exporters:
  otlphttp/logs:
    logs_endpoint: ${OSIS_ENDPOINT}/v1/logs
    auth: { authenticator: sigv4auth }
    compression: none
  otlphttp/traces:
    traces_endpoint: ${OSIS_ENDPOINT}/v1/traces
    auth: { authenticator: sigv4auth }
    compression: none
  otlphttp/metrics:
    metrics_endpoint: ${OSIS_ENDPOINT}/v1/metrics
    auth: { authenticator: sigv4auth }
    compression: none
```

**Note**  
The IAM principal sending data needs `osis:Ingest` and `aps:RemoteWrite` permission on the pipeline ARN.

## Learn more
<a name="observability-get-started-learn-more"></a>

Use the following resources to learn more about sending telemetry data:
+ [OpenTelemetry instrumentation guides (per-language)](https://observability.opensearch.org/docs/send-data/applications/)
+ [Infrastructure monitoring (AWS, Docker, Kubernetes, Prometheus)](https://observability.opensearch.org/docs/send-data/infrastructure/)
+ [OTel Collector configuration](https://observability.opensearch.org/docs/send-data/opentelemetry/collector/)
+ [Data pipeline and batching](https://observability.opensearch.org/docs/send-data/data-pipeline/)
+ [Overview of Amazon OpenSearch Ingestion](ingestion.md) in this guide

# Ingesting application telemetry
<a name="observability-ingestion"></a>

To use observability features in Amazon OpenSearch Service, you need to ingest application traces, logs, and metrics. This page covers configuring the OpenTelemetry Collector and OpenSearch Ingestion pipelines to process and route telemetry data to OpenSearch and Amazon Managed Service for Prometheus.

## Configuring the OpenTelemetry Collector
<a name="observability-ingestion-otel"></a>

The OpenTelemetry (OTel) Collector is the entry point for all application telemetry. It receives data through OTLP and routes traces and logs to OpenSearch Ingestion while sending metrics to Prometheus.

You can configure the OTel Collector using one of the following approaches:

The OTel Collector exports traces and logs to an OpenSearch Ingestion endpoint using SigV4 authentication, and metrics to Amazon Managed Service for Prometheus using the Prometheus remote write exporter. OpenSearch Ingestion handles processing, enrichment, and routing to OpenSearch.

### OTel Collector configuration with OpenSearch Ingestion
<a name="observability-ingestion-otel-osis"></a>

The following example configuration uses SigV4 authentication to export traces and logs to an OpenSearch Ingestion endpoint, and metrics to Prometheus:

```
extensions:
  sigv4auth:
    region: us-west-2
    service: osis

exporters:
  otlphttp/osis-traces:
    traces_endpoint: ${OSIS_ENDPOINT}/v1/traces
    auth: { authenticator: sigv4auth }
    compression: none
  otlphttp/osis-logs:
    logs_endpoint: ${OSIS_ENDPOINT}/v1/logs
    auth: { authenticator: sigv4auth }
    compression: none
  # Amazon Managed Service for Prometheus via Prometheus Remote Write with SigV4 auth
  prometheusremotewrite/amp:
    endpoint: "https://aps-workspaces.region.amazonaws.com/workspaces/workspace-id/api/v1/remote_write"
    auth:
      authenticator: sigv4auth

service:
  extensions: [sigv4auth]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp/osis-traces]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp/osis-logs]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheusremotewrite/amp]
```

**Note**  
The IAM principal sending data needs `osis:Ingest` and `aps:RemoteWrite` permission on the pipeline ARN.

## Configuring OpenSearch Ingestion pipelines
<a name="observability-ingestion-pipelines"></a>

OpenSearch Ingestion (or self-managed Data Prepper) receives telemetry from the OTel Collector and processes it for application performance monitoring (APM).

### Pipeline architecture
<a name="observability-ingestion-pipeline-arch"></a>

The pipeline processes telemetry data in the following stages:

1. The entry pipeline receives all telemetry and routes logs and traces to separate sub-pipelines.

1. The log pipeline writes log data to OpenSearch using the `log-analytics-plain` index type.

1. The trace pipeline distributes spans to the raw storage pipeline and the service map pipeline.

1. The raw trace pipeline processes spans with the `otel_traces` processor and stores them in the `trace-analytics-plain-raw` index type.

1. The service map pipeline uses the `otel_apm_service_map` processor to generate topology and RED (Rate, Errors, Duration) metrics. It writes to OpenSearch and to Prometheus through remote write.

### Pipeline configuration
<a name="observability-ingestion-pipeline-config"></a>

The following example shows a complete pipeline configuration for OpenSearch Ingestion that covers all observability signal types – logs, traces, and metrics. You can include all pipelines or only the ones relevant to your use case. Replace the *placeholder* values with your own information.

```
version: '2'

# Main OTLP pipeline - receives all telemetry and routes by signal type
otlp-pipeline:
  source:
    otlp:
      logs_path: '/pipeline-name/v1/logs'
      traces_path: '/pipeline-name/v1/traces'
      metrics_path: '/pipeline-name/v1/metrics'
  route:
    - logs: 'getEventType() == "LOG"'
    - traces: 'getEventType() == "TRACE"'
    - metrics: 'getEventType() == "METRIC"'
  processor: []
  sink:
    - pipeline:
        name: otel-logs-pipeline
        routes:
          - logs
    - pipeline:
        name: otel-traces-pipeline
        routes:
          - traces
    - pipeline:
        name: otel-metrics-pipeline
        routes:
          - metrics

# Log processing pipeline
otel-logs-pipeline:
  source:
    pipeline:
      name: otlp-pipeline
  processor:
    - copy_values:
        entries:
          - from_key: "time"
            to_key: "@timestamp"
  sink:
    - opensearch:
        hosts:
          - 'https://opensearch-endpoint'
        index_type: log-analytics-plain
        aws:
          serverless: false
          region: 'region'
          sts_role_arn: "arn:aws:iam::account-id:role/pipeline-role"

# Trace fan-out pipeline
otel-traces-pipeline:
  source:
    pipeline:
      name: otlp-pipeline
  processor: []
  sink:
    - pipeline:
        name: traces-raw-pipeline
        routes: []
    - pipeline:
        name: service-map-pipeline
        routes: []

# Raw trace storage pipeline
traces-raw-pipeline:
  source:
    pipeline:
      name: otel-traces-pipeline
  processor:
    - otel_traces:
  sink:
    - opensearch:
        hosts:
          - 'https://opensearch-endpoint'
        index_type: trace-analytics-plain-raw
        aws:
          serverless: false
          region: 'region'
          sts_role_arn: "arn:aws:iam::account-id:role/pipeline-role"

# Service map generation pipeline (APM)
service-map-pipeline:
  source:
    pipeline:
      name: otel-traces-pipeline
  processor:
    - otel_apm_service_map:
        group_by_attributes:
          - telemetry.sdk.language # Add any resource attribute to group by
        window_duration: 30s
  route:
    - otel_apm_service_map_route: 'getEventType() == "SERVICE_MAP"'
    - service_processed_metrics: 'getEventType() == "METRIC"'
  sink:
    - opensearch:
        hosts:
          - 'https://opensearch-endpoint'
        aws:
          serverless: false
          region: 'region'
          sts_role_arn: "arn:aws:iam::account-id:role/pipeline-role"
        routes:
          - otel_apm_service_map_route
        index_type: otel-v2-apm-service-map
    - prometheus:
        url: 'https://aps-workspaces.region.amazonaws.com/workspaces/workspace-id/api/v1/remote_write'
        aws:
          region: 'region'
        routes:
          - service_processed_metrics

# Metrics processing pipeline
otel-metrics-pipeline:
  source:
    pipeline:
      name: otlp-pipeline
  processor:
    - otel_metrics:
  sink:
    - prometheus:
        url: 'https://aps-workspaces.region.amazonaws.com/workspaces/workspace-id/api/v1/remote_write'
        aws:
          region: 'region'
```

## Verifying ingestion
<a name="observability-ingestion-verify"></a>

After you configure the OTel Collector and pipelines, verify that telemetry data is flowing correctly.
+ **Verify OpenSearch indexes** – Confirm that the following indexes exist in your domain: `otel-v1-apm-span-*`, `otel-v2-apm-service-map`, and `logs-otel-v1-*`.
+ **Verify Prometheus targets** – Confirm that the Prometheus remote write target is receiving metrics from the service map pipeline.
+ **Verify in OpenSearch UI** – Navigate to **Observability**, then **Application Monitoring** to confirm that your services appear.

## Next steps
<a name="observability-ingestion-next"></a>

After you verify that telemetry data is ingested, explore the following topics:
+ [Application monitoring](observability-app-monitoring.md) – Monitor application health with service maps and RED metrics.
+ [Discover traces](observability-analyze-traces.md) – Discover and analyze distributed traces.
+ [Discover logs](observability-analyze-logs.md) – Discover and query application logs.
+ [Discover metrics](observability-metrics.md) – Discover and query Prometheus metrics using PromQL.

# Datasets
<a name="observability-datasets"></a>

Datasets are collections of indexes that represent a logical grouping of your observability data. You use datasets to organize logs and traces data so that you can query and analyze related indexes together in the Discover experience. Each dataset maps to one or more indexes in your OpenSearch Service domain and defines the data type, time field, and query language for the Discover page.

## Dataset types
<a name="observability-datasets-types"></a>

The following table describes the dataset types that you can create.


| Type | Description | Query language | 
| --- | --- | --- | 
| Logs | Groups one or more log indexes for querying and visualization in the Discover Logs page. | PPL | 
| Traces | Groups trace span indexes for querying and visualization in the Discover Traces page. | PPL | 

**Note**  
Metrics do not require a dataset because metric data is not stored in OpenSearch. Metrics are queried directly from Amazon Managed Service for Prometheus using PromQL.

## To create a logs dataset
<a name="observability-datasets-create-logs"></a>

Complete the following steps to create a logs dataset in OpenSearch UI.

1. In your observability workspace, expand **Discover** in the left navigation and choose **Logs**.

1. Choose **Create dataset**.

1. Select a data source from the list of available OpenSearch Service connections.  
![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/datasets/datasets-select-data-source.png)

1. Configure the dataset by entering a name, selecting the index, and specifying the timestamp field.  
![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/datasets/datasets-configure-logs.png)

1. Choose **Create dataset** to save the configuration.

## To create a traces dataset
<a name="observability-datasets-create-traces"></a>

Complete the following steps to create a traces dataset in OpenSearch UI.

1. In your observability workspace, expand **Discover** in the left navigation and choose **Traces**.

1. Choose **Create dataset**.

1. Select a data source from the list of available OpenSearch Service connections.

1. Configure the dataset by entering a name, selecting the span index, and specifying the timestamp field.  
![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/datasets/datasets-configure-traces.png)

1. Choose **Create dataset** to save the configuration.

## To view datasets
<a name="observability-datasets-view"></a>

You can view all configured datasets from the dataset selector on the Discover Logs or Discover Traces page. The dataset list shows the name, type, data source, and timestamp field for each dataset.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/datasets/datasets-list.png)


## Analyzing datasets in Discover
<a name="observability-datasets-analyze"></a>

After you create a dataset, you can analyze it in the corresponding Discover page.

### Logs
<a name="observability-datasets-analyze-logs"></a>

Select a logs dataset from the dataset selector on the Discover Logs page to query and visualize your log data using PPL. For more information, see [Discover Logs](observability-analyze-logs.md).

### Traces
<a name="observability-datasets-analyze-traces"></a>

Select a traces dataset from the dataset selector on the Discover Traces page to explore trace spans, view RED metrics, and drill into individual traces. For more information, see [Discover Traces](observability-analyze-traces.md).

# Discover Logs
<a name="observability-analyze-logs"></a>

The Discover Logs page provides a dedicated interface for exploring and analyzing log data in your OpenSearch Service observability workspace. You can write PPL queries to filter and aggregate log data, create visualizations directly from query results, and add those visualizations to dashboards. The page also provides natural language query assistance powered by the OpenSearch AI assistant.

## To access the Logs page
<a name="observability-logs-access"></a>

In your observability workspace, expand **Discover** in the left navigation and choose **Logs**.

## Exploring log data
<a name="observability-logs-explore"></a>

The Discover Logs interface provides the following components for exploring your log data.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/discover-logs/discover-logs-interface.png)

+ **Dataset selector** – Choose the logs dataset that you want to query. Each dataset maps to one or more indexes in your OpenSearch Service domain.
+ **Query editor** – Write PPL queries to filter, aggregate, and transform your log data. The editor provides autocomplete suggestions and syntax highlighting.
+ **Time filter** – Specify the time range for your query results. You can choose a relative range or specify absolute start and end times.
+ **Results panel** – View query results as a table of log events. You can expand individual events to see all fields.
+ **Histogram** – View the distribution of log events over time. The histogram updates automatically based on your query and time filter.
+ **Fields panel** – Browse available fields in your dataset and add them as columns to the results table.

## Querying logs using PPL
<a name="observability-logs-query-ppl"></a>

Piped processing language (PPL) is a query language that uses pipe-based (`|`) syntax for chaining commands. You can use PPL to filter, aggregate, and transform your log data.

### Basic queries
<a name="observability-logs-basic-queries"></a>

To retrieve all log events from a dataset, use the `source` command:

```
source = my-logs-dataset
```

To limit the number of results, use the `head` command:

```
source = my-logs-dataset | head 20
```

### Filtering with WHERE
<a name="observability-logs-where-clause"></a>

Use the `where` clause to filter log events based on field values:

```
source = my-logs-dataset | where severity_text = 'ERROR'
```

You can combine multiple conditions:

```
source = my-logs-dataset |
    where severity_text = 'ERROR' and service_name = 'payment-service'
```

### Managing queries
<a name="observability-logs-manage-queries"></a>

You can save frequently used queries for reuse. To save a query, choose **Save** in the query editor toolbar and enter a name for the query. To load a saved query, choose **Open** and select the query from the list.

For the complete list of PPL commands and functions, see the [Piped Processing Language reference](https://observability.opensearch.org/docs/ppl/).

## Creating visualizations from logs
<a name="observability-logs-visualizations"></a>

You can create visualizations directly from your PPL query results. Use the `stats` command to aggregate data for visualization:

```
source = my-logs-dataset |
    stats count() as error_count by service_name, span(timestamp, 1h)
```

After you run a `stats` query, choose the **Visualization** tab to see the results as a chart.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/discover-logs/discover-logs-visualization.png)


### Visualization types
<a name="observability-logs-viz-types"></a>

The following table describes the visualization types that you can use.


| Type | Description | 
| --- | --- | 
| Line | Displays data points connected by lines, useful for showing trends over time. | 
| Area | Similar to a line chart with the area under the line filled in, useful for showing volume over time. | 
| Bar | Displays data as vertical or horizontal bars, useful for comparing values across categories. | 
| Metric | Displays a single numeric value, useful for showing key performance indicators. | 
| State timeline | Displays state changes over time as colored bands, useful for monitoring status transitions. | 
| Heatmap | Displays data as a matrix of colored cells, useful for showing density and patterns. | 
| Bar gauge | Displays a single value as a filled bar within a range, useful for showing progress toward a threshold. | 
| Pie | Displays data as proportional slices of a circle, useful for showing composition. | 

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/discover-logs/discover-logs-viz-types.png)


### Visualization settings
<a name="observability-logs-viz-settings"></a>

When the **Visualization** tab is active, a settings panel appears on the right side of the screen. Use this panel to configure the chart type, map fields to axes, and customize visual styles such as colors and legends.

To switch the axes of a visualization, use the axis configuration in the settings panel.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/discover-logs/discover-logs-switch-axes.png)


## Adding visualizations to dashboards
<a name="observability-logs-add-to-dashboard"></a>

After you create a visualization, you can add it to a dashboard for ongoing monitoring. Choose **Save to dashboard** in the visualization toolbar, then select an existing dashboard or create a new one. The visualization is saved with its underlying PPL query so that it refreshes automatically when you open the dashboard.

# Discover Traces
<a name="observability-analyze-traces"></a>

The Discover Traces page provides a dedicated interface for exploring distributed trace data in your OpenSearch Service observability workspace. You can view RED metrics (rate, error rate, duration) for your services, browse trace spans with faceted filtering, and drill into individual spans and traces to diagnose performance issues. The page also supports correlating traces with related log data.

## To access the Traces page
<a name="observability-traces-access"></a>

In your observability workspace, expand **Discover** in the left navigation and choose **Traces**.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/discover-traces/discover-traces.png)


## Configuring trace datasets
<a name="observability-traces-configure-datasets"></a>

Before you can explore trace data, you must configure a traces dataset. You can create a dataset automatically or manually.

### Automatic dataset creation
<a name="observability-traces-auto-create"></a>

When you navigate to the Discover Traces page for the first time and trace data exists in your domain, the page prompts you to create a dataset automatically. Choose **Create dataset** to accept the default configuration.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/discover-traces/trace-auto-create.png)


### Manual dataset creation
<a name="observability-traces-manual-create"></a>

To manually create a traces dataset, follow the steps in [To create a traces dataset](observability-datasets.md#observability-datasets-create-traces). Manual creation gives you control over the index pattern, timestamp field, and dataset name.

## Exploring trace data
<a name="observability-traces-explore"></a>

The Discover Traces page provides the following components for exploring your trace data.
+ **RED metrics** – View rate (requests per second), error rate (percentage of failed requests), and duration (latency percentiles) for the selected dataset. These metrics update based on your time filter.
+ **Faceted fields** – Filter trace spans by service name, operation, status code, and other span attributes. Select values in the faceted fields panel to narrow your results.
+ **Span table** – Browse individual spans with columns for trace ID, span ID, service name, operation, duration, and status. You can sort by any column and expand rows to see span details.

## Viewing a specific span
<a name="observability-traces-view-span"></a>

To view details for a specific span, choose the span row in the span table. A flyout panel opens with the span attributes, resource attributes, and event information.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/discover-traces/trace-details-flyout.png)


## Trace detail page
<a name="observability-traces-detail-page"></a>

To view the complete trace, choose the trace ID link in the span table or flyout panel. The trace detail page displays a waterfall chart showing all spans in the trace, their timing relationships, and the overall trace duration. You can expand individual spans to view their attributes and identify bottlenecks.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/discover-traces/trace-detail-page.png)


## Correlating traces with logs
<a name="observability-traces-correlate-logs"></a>

When you configure a correlation between a traces dataset and a logs dataset, you can view related log entries directly from the Discover Traces page. For information about creating correlations, see [Correlations](observability-correlations.md).

### Viewing related logs
<a name="observability-traces-related-logs"></a>

In the span details flyout or trace detail page, choose the **Related logs** tab to view log entries that match the span's trace ID, service name, and time range. This correlation helps you understand what happened in your application during the span execution.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/discover-traces/related-logs.png)


### Log redirection with context
<a name="observability-traces-log-redirection"></a>

You can navigate from a trace span directly to the Discover Logs page with the relevant context preserved. Choose **View in Logs** from the related logs panel to open the Discover Logs page with the query pre-populated to filter by the span's trace ID and time range.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/discover-traces/logs-redirection.png)


## Querying traces using PPL
<a name="observability-traces-querying"></a>

You can use PPL to query trace data directly. PPL chains commands using the pipe character to filter, transform, and aggregate span data.

The following example finds the 10 slowest traces:

```
source = otel-v1-apm-span-*
| where durationInNanos > 5000000000
| fields traceId, serviceName, name, durationInNanos
| sort - durationInNanos
| head 10
```

The following example counts errors by service:

```
source = otel-v1-apm-span-*
| where status.code = 2
| stats count() as errorCount by serviceName
| sort - errorCount
```

The following example finds traces for a specific service:

```
source = otel-v1-apm-span-*
| where serviceName = 'checkout-service'
| where parentSpanId = ''
| sort - startTime
| head 20
```

# Metrics
<a name="observability-metrics"></a>

The Discover Metrics page in OpenSearch UI provides a dedicated interface for discovering, querying, and visualizing time-series metric data. This page is optimized for working with Prometheus metrics using PromQL.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/dashboards/prometheus.png)


The Discover Metrics page is available in Observability workspaces. To access it, navigate to an Observability workspace, expand **Discover** in the left navigation, and select **Metrics**.

## Configuring a Prometheus data source
<a name="observability-metrics-data-source"></a>

Before you start, configure a Prometheus data source using one of the following methods:
+ [Creating an Amazon Managed Service for Prometheus data source](direct-query-prometheus-creating.md) in the AWS Management Console
+ [AWS CLI](https://docs.aws.amazon.com/cli/latest/reference/opensearch/add-direct-query-data-source.html)

## Query panel
<a name="observability-metrics-query"></a>

You can write and run metric queries in the query panel at the top of the Discover Metrics page. The query editor provides autocomplete suggestions and syntax highlighting for PromQL.

**Writing queries**  
Write queries using PromQL syntax. For example:

```
up{job="prometheus"}
```

**Running queries**  
To run a query, enter your query in the query editor and select **Refresh**.

You can run multiple PromQL queries together by separating them with a semicolon (`;`):

```
up{job="prometheus"};
node_cpu_seconds_total{mode="idle"};
```

Each query runs independently, and the results are combined in the output.

## Time filter
<a name="observability-metrics-time-filter"></a>

Use the time filter to specify the time range for your metric data:
+ **Quick select** – Choose a relative time range (for example, the last 15 minutes or the last 1 hour).
+ **Commonly used** – Select from predefined time ranges.
+ **Custom** – Specify absolute start and end times.
+ **Auto-refresh** – Set an automatic refresh interval.

## Viewing results
<a name="observability-metrics-results"></a>

After running a query, the results are displayed in a tabbed interface:
+ **Metrics** – Displays the latest data point for each series in a table format.
+ **Raw** – Shows the latest data point for each series as raw JSON returned by the data source.
+ **Visualization** – Provides interactive charts for your metric data.

## Configuring visualizations
<a name="observability-metrics-visualizations"></a>

When the **Visualization** tab is selected, a settings panel appears on the right side of the screen. Use this panel to:
+ **Select a chart type** – Choose from line, bar, pie, gauge, or table visualizations.
+ **Map axes** – Assign fields to the X and Y axes.
+ **Customize styles** – Adjust colors, legends, gridlines, and other visual options.

When you modify the settings, the visualization is updated automatically.

# Correlations
<a name="observability-correlations"></a>

Correlations link a traces dataset to a logs dataset so that you can view related log entries when you investigate trace spans. By defining a correlation, you enable the Discover Traces page to display logs that occurred during a span's execution, helping you diagnose issues faster without switching between pages.

## Correlation requirements
<a name="observability-correlations-requirements"></a>

To create a correlation, your log and trace data must contain matching fields. The following table describes the fields that the correlation uses to join trace and log data.


| Field | Description | Required | 
| --- | --- | --- | 
| Trace ID | The unique identifier for the trace. Must exist in both the trace span index and the log index. | Yes | 
| Span ID | The unique identifier for the span. Used to match logs to a specific span within a trace. | No | 
| Service name | The name of the service that generated the telemetry. Used to filter related logs by service. | No | 
| Timestamp | The time field used to scope related logs to the span's time range. | Yes | 

## To create a trace-to-logs correlation
<a name="observability-correlations-create"></a>

Complete the following steps to create a correlation between a traces dataset and a logs dataset.

1. In your observability workspace, expand **Discover** in the left navigation and choose **Traces**.

1. Select the traces dataset that you want to correlate.

1. Choose the **Correlations** tab in the dataset configuration panel.  
![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/datasets/correlations-trace-dataset-tab.png)

1. Choose **Create correlation**.

1. In the configuration dialog, select the target logs dataset and map the required correlation fields (trace ID and timestamp). Optionally, map span ID and service name for more precise matching.  
![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/datasets/correlations-configure-dialog.png)

1. Choose **Create** to save the correlation.

1. Verify that the correlation appears in the correlations table.  
![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/datasets/correlations-created-table.png)

## Viewing correlations in logs datasets
<a name="observability-correlations-view-logs"></a>

After you create a correlation, you can also view it from the logs dataset side. Navigate to the Discover Logs page, select the correlated logs dataset, and choose the **Correlations** tab to see the linked traces dataset.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/datasets/correlations-logs-dataset-tab.png)


## Using correlations in the Traces page
<a name="observability-correlations-use-traces"></a>

When a correlation exists, the Discover Traces page displays related logs in the span details view. Choose a span in the span table to open the details flyout, then choose the **Related logs** tab to view correlated log entries.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/datasets/correlations-span-details-logs.png)


## Managing correlations
<a name="observability-correlations-manage"></a>

You can edit or remove correlations from the **Correlations** tab of either the traces or logs dataset.
+ **Editing** – Choose the correlation in the table and choose **Edit** to update the field mappings or target dataset.
+ **Removing** – Choose the correlation in the table and choose **Delete** to remove the correlation. Removing a correlation does not delete any data.

# Dashboards
<a name="observability-dashboards"></a>

Dashboards combine visualizations from logs, traces, and metrics into a single view. You can use dashboards to monitor operational health, respond to incidents, and track resource utilization across your distributed system.

The following table describes common use cases for dashboards.


| Use case | Example | 
| --- | --- | 
| Operational monitoring | Track service health, throughput, and error rates in real time. | 
| Incident response | Correlate logs, traces, and metrics during an active incident. | 
| Capacity planning | Monitor resource utilization trends to plan for scaling. | 
| Availability tracking | Measure uptime and availability against service-level objectives. | 
| Post-incident review | Analyze historical data to understand the root cause of past incidents. | 

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/otel-dashboard.png)


## Dashboard structure
<a name="observability-dashboards-structure"></a>

A dashboard is a collection of panels arranged on a grid. Each panel consists of the following components.
+ **Data source** – The OpenSearch index or Amazon Managed Service for Prometheus data source that the panel queries.
+ **Query** – A PPL or PromQL query that retrieves the data for the panel.
+ **Visualization type** – The chart type used to render the query results, such as line, bar, or metric value.
+ **Optional configuration** – Axes, legends, thresholds, and formatting options.

The time range picker at the top of the dashboard applies to all panels. You can override the time range for individual panels when needed.

## Building dashboards from Discover
<a name="observability-dashboards-discover"></a>

The recommended workflow for building dashboards starts in Discover. This workflow is consistent across logs, traces, and metrics.

1. **Query your data in Discover** – Navigate to Discover Logs, Discover Traces, or Discover Metrics and write a query using PPL (for logs and traces) or PromQL (for metrics).

1. **Build a visualization** – When your query returns results, use the visualization tab to choose a chart type and configure the display. For log and trace queries, aggregation commands like `stats` automatically switch to the visualization view.

1. **Save to a dashboard** – Choose **Add to dashboard** to save the visualization to a new or existing dashboard. The panel stays live, updating as new data arrives.

1. **Iterate** – Repeat for each question you want the dashboard to answer. When something looks wrong on a dashboard, choose any panel to open the underlying query in Discover for further investigation.

**Important**  
Visualizations created through the **Visualizations** page in OpenSearch UI use DQL (Dashboards Query Language) and DSL (Domain Specific Language), which do not support Piped Processing Language (PPL) at this time. To create PPL-based visualizations, use the Discover workflow described above.

## Dashboard filters
<a name="observability-dashboards-filters"></a>

Filters let you narrow the data displayed across all panels on a dashboard without editing individual queries.

**To add a filter**

1. Open the dashboard you want to filter.

1. Choose **Add filter** in the filter bar.

1. Select a field name from the dropdown list.

1. Select an operator and enter a value.

1. Choose **Save**.

The following table describes common filter use cases.


| Scenario | Field | Operator | Value | 
| --- | --- | --- | --- | 
| View a single environment | environment | is | production | 
| Isolate errors | status\$1code | is greater than or equal to | 400 | 
| Focus on a specific service | service.name | is | order-service | 
| Exclude health checks | http.url | is not | /health | 

**Pinned compared to unpinned filters** – A pinned filter persists when you navigate between dashboards. An unpinned filter applies only to the current dashboard. To pin a filter, choose the pin icon next to the filter badge.

## Building dashboards
<a name="observability-dashboards-build"></a>

### Visualization types
<a name="observability-dashboards-build-viz-types"></a>

The following table describes the visualization types available for dashboard panels.


| Type | Use case | 
| --- | --- | 
| Line | Trends over time, such as request rates or latency | 
| Area | Volume over time with stacked breakdowns | 
| Bar | Comparing values across categories | 
| Horizontal bar | Ranked comparisons, such as top services by error count | 
| Data table | Tabular data with sorting and pagination | 
| Metric value | Single key performance indicators, such as total requests | 
| Gauge | Progress toward a threshold, such as CPU utilization | 
| Pie | Composition and proportions, such as traffic by region | 
| Heat map | Density and distribution patterns over two dimensions | 
| Tag cloud | Relative frequency of terms, such as common error messages | 

### Configuring panels
<a name="observability-dashboards-build-configure"></a>

Each panel has a query editor where you write PPL or PromQL queries. The following examples show common panel queries.

Error count by service (PPL):

```
source = logs-dataset |
    where severity_text = 'ERROR' |
    stats count() as error_count by service_name, span(timestamp, 5m)
```

CPU utilization rate (PromQL):

```
rate(container_cpu_usage_seconds_total{namespace="production"}[5m])
```

You can also configure the following panel options.
+ **Axes** – Set axis labels, scales (linear or logarithmic), and value ranges.
+ **Legends** – Control legend position and which series to display.
+ **Thresholds** – Add horizontal threshold lines to highlight warning or critical levels.

### Layout tips
<a name="observability-dashboards-build-layout"></a>

Use the following tips to organize your dashboard panels effectively.
+ Place high-level summary panels (metric values, gauges) at the top of the dashboard.
+ Group related panels together, such as all panels for a single service.
+ Use consistent widths for panels in the same row.
+ Drag panel edges to resize, and drag panel headers to reposition.

### Recommended layouts
<a name="observability-dashboards-build-recommended"></a>

The following tables describe recommended panel layouts for common dashboard types.

**Service health dashboard**


| Panel | Visualization type | 
| --- | --- | 
| Request rate | Line | 
| Error rate | Line | 
| P99 latency | Line | 
| Active alerts | Metric value | 
| Top errors by service | Horizontal bar | 

**Incident response dashboard**


| Panel | Visualization type | 
| --- | --- | 
| Error logs | Data table | 
| Error count over time | Area | 
| Affected services | Pie | 
| Latency spikes | Line | 

**Resource utilization dashboard**


| Panel | Visualization type | 
| --- | --- | 
| CPU utilization | Gauge | 
| Memory usage over time | Area | 
| Disk I/O | Line | 
| Network throughput | Line | 

### Time range controls
<a name="observability-dashboards-build-time-range"></a>

The time range picker at the top of the dashboard controls the time window for all panels. You can select a preset range (such as **Last 15 minutes** or **Last 24 hours**) or specify a custom absolute range.

To enable auto-refresh, choose the refresh interval dropdown next to the time range picker and select an interval. Auto-refresh re-runs all panel queries at the specified interval so that your dashboard displays the latest data.

## Sharing dashboards
<a name="observability-dashboards-sharing"></a>

You can share dashboards with other users in your organization through URLs, snapshots, and exports.

### Share through URL
<a name="observability-dashboards-sharing-url"></a>

Copy the dashboard URL from your browser address bar and share it directly. The URL preserves the current time range and filters. You can include dashboard links in bookmarks, runbooks, or incident response documentation.

### Snapshots
<a name="observability-dashboards-sharing-snapshots"></a>

A snapshot captures the current state of a dashboard, including all panel data, at a specific point in time. Snapshots are read-only and do not update when the underlying data changes. Use snapshots to preserve a record of dashboard state during incidents or reviews.

### Import and export definitions
<a name="observability-dashboards-sharing-import-export"></a>

You can export a dashboard definition as JSON and import it into another workspace or environment. This approach is useful for promoting dashboards from development to production or sharing standard layouts across teams.

### Best practices for sharing
<a name="observability-dashboards-sharing-best-practices"></a>
+ **Audience** – Design dashboards for a specific audience, such as on-call engineers or leadership.
+ **Focus** – Limit each dashboard to a single purpose or workflow.
+ **Conventions** – Use consistent naming conventions for dashboards and panels across your organization.
+ **Version control** – Export dashboard JSON definitions and store them in version control to track changes over time.

## Troubleshooting dashboards
<a name="observability-dashboards-troubleshooting"></a>

This section describes common dashboard issues and how to resolve them.

### No data in a panel
<a name="observability-dashboards-troubleshooting-no-data"></a>

If a panel displays no data, check the following common causes.


| Cause | Check | Fix | 
| --- | --- | --- | 
| Time range too narrow | Verify that the dashboard time range covers the period when data was ingested. | Expand the time range or select Last 24 hours. | 
| Active filter excluding data | Review the filter bar for filters that might exclude all matching documents. | Remove or adjust the filter, then verify that data appears. | 
| Incorrect index pattern | Confirm that the panel data source points to an index that contains data. | Update the data source to the correct index pattern in the panel editor. | 
| Query syntax error | Look for error messages in the panel header or query editor. | Correct the PPL or PromQL syntax and re-run the query. | 

### Wrong data in a panel
<a name="observability-dashboards-troubleshooting-wrong-data"></a>

If a panel displays unexpected results, try the following steps.
+ Verify that the query returns the expected fields by running it in Discover first.
+ Check that the visualization type matches the data shape (for example, use a line chart for time-series data).
+ Confirm that the correct data source is selected in the panel editor.

### Stale data
<a name="observability-dashboards-troubleshooting-stale-data"></a>

If dashboard panels display outdated information, try the following steps.
+ Choose the refresh icon in the toolbar to manually refresh all panels.
+ Verify that auto-refresh is enabled and set to an appropriate interval.
+ Confirm that your ingestion pipeline is actively sending data to the configured indexes.

### Performance issues
<a name="observability-dashboards-troubleshooting-performance"></a>

The following tips can help you resolve common performance issues.
+ **Slow dashboard** – Reduce the number of panels or narrow the time range. Dashboards with many panels run multiple queries simultaneously, which can increase load times.
+ **Slow panel** – Simplify the panel query. Avoid using wildcard patterns in PPL `where` clauses and limit the number of aggregation buckets.
+ **Browser lag** – Reduce the data density in visualizations. For example, increase the time span interval in `stats` commands to produce fewer data points.

### Filter issues
<a name="observability-dashboards-troubleshooting-filters"></a>

If filters do not behave as expected, try the following steps.
+ Verify that the field name in the filter matches the field name in the index mapping.
+ Check whether a pinned filter from another dashboard is affecting results.
+ Remove all filters and add them back one at a time to isolate the issue.

### Inspect a panel
<a name="observability-dashboards-troubleshooting-inspect"></a>

The panel inspector helps you debug data and query issues. To open the inspector, choose the panel menu (three dots) and select **Inspect**. The inspector provides the following tabs.
+ **Data** – Displays the raw data returned by the query in tabular format.
+ **Request** – Shows the query sent to the data source, including the full PPL or PromQL statement.
+ **Response** – Shows the raw response from the data source, including timing and status information.

### Browser developer tools
<a name="observability-dashboards-troubleshooting-browser"></a>

For advanced troubleshooting, use your browser developer tools to inspect network requests. Open the **Network** tab, filter for API calls, and look for failed requests or slow responses. Check the response body for error messages that can help you identify the root cause.

# Application monitoring
<a name="observability-app-monitoring"></a>

Application monitoring provides a real-time view of how your services are performing. It combines topology data stored in OpenSearch with time-series RED metrics (Rate, Errors, Duration) from Amazon Managed Service for Prometheus to surface health, latency, throughput, and error information across your distributed system.

To access application monitoring, in OpenSearch UI navigate to **Observability** > **Application Monitoring**. The sidebar shows two views:
+ **Application Map** – Interactive topology graph of service dependencies
+ **Services** – Catalog of all instrumented services with filtering, detail views, and correlation links

## Prerequisites
<a name="observability-app-monitoring-prereqs"></a>

Before you can use application monitoring, you must have the following resources configured.
+ [OTLP trace data flowing from your OTel Collectors to OpenSearch Ingestion](observability-ingestion.md) (metrics and logs are optional)
+ [Amazon Managed Service for Prometheus configured to receive remote write from OpenSearch Ingestion](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/configure-client-prometheus.html)
+ An OpenSearch UI workspace with Observability enabled

## How it works
<a name="observability-app-monitoring-how-it-works"></a>

The following diagram shows the end-to-end architecture for application monitoring.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/otel-sdk-service.png)


1. Your applications and infrastructure emit telemetry through OpenTelemetry SDKs, auto-instrumentation, or the OTel API to the OTel Collector.

1. The OTel Collector forwards trace data to OpenSearch Ingestion over OTLP.

1. The OpenSearch Ingestion `otel_apm_service_map` processor extracts service-to-service relationships and computes RED metrics.

1. Topology and raw trace data are indexed into OpenSearch. RED metrics are exported to Amazon Managed Service for Prometheus through remote write.

1. OpenSearch UI queries both stores to render the Application Map, Services catalog, and service detail views.

## Services
<a name="observability-app-services"></a>

The Services view provides a centralized catalog of all instrumented services, displaying RED metrics (Rate, Errors, Duration) at a glance. You can use this view to quickly identify unhealthy services and drill into detail views for deeper analysis.

To access the Services view, navigate to the Observability workspace in OpenSearch UI and choose **APM** > **Services**.

The Services home page displays a table of all instrumented services along with summary panels. The following image shows the Services home page.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/apm/services-home.png)


The following table describes the columns in the services table.


| Column | Description | 
| --- | --- | 
| Service name | The name of the instrumented service. | 
| P99 latency | The 99th percentile latency for the service. | 
| P90 latency | The 90th percentile latency for the service. | 
| P50 latency | The 50th percentile (median) latency for the service. | 
| Total requests | The total number of requests processed during the selected time range. | 
| Failure ratio | The ratio of failed requests to total requests. | 
| Environment | The deployment environment of the service, such as production or staging. | 

The home page also includes the following summary panels:
+ **Top services by fault rate** – Services with the highest percentage of 5xx responses.
+ **Top dependency paths by fault rate** – Service-to-service dependency paths with the highest fault rates.

You can filter the services table by using the following filters:
+ **Environment** – Filter by deployment environment.
+ **Latency** – Filter by latency range.
+ **Throughput** – Filter by request throughput range.
+ **Failure ratio** – Filter by failure ratio range.

### Service overview
<a name="observability-app-services-overview"></a>

To open the service detail view, select a service name in the services table. The Overview tab displays metric tiles and time-series charts for the selected service.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/apm/services-overview.png)


The Overview tab includes the following time-series charts:
+ **Latency by service dependencies** – P50, P90, and P99 latency broken down by downstream dependencies.
+ **Requests by operations** – Request volume for each operation of the service.
+ **Availability by operations** – Percentage of successful responses for each operation.
+ **Fault rate and error rate by operations** – Percentage of 5xx and 4xx responses for each operation.

### Operations
<a name="observability-app-services-operations"></a>

The Operations tab provides a per-operation breakdown for the selected service. You can sort the table by any column to identify problematic operations.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/apm/service-operations.png)


The following table describes the columns in the operations table.


| Column | Description | 
| --- | --- | 
| Operation name | The name of the operation. | 
| P50/P90/P99 latency | The 50th, 90th, and 99th percentile latency for the operation. | 
| Total requests | The total number of requests for the operation during the selected time range. | 
| Error rate | The percentage of requests that returned errors. | 
| Availability | The percentage of successful responses for the operation. | 

### Dependencies
<a name="observability-app-services-dependencies"></a>

The Dependencies tab displays the downstream services that the selected service calls.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/apm/service-dependencies.png)


The following table describes the columns in the dependencies table.


| Column | Description | 
| --- | --- | 
| Dependency service | The name of the downstream service. | 
| Remote operation | The operation called on the downstream service. | 
| Service operations | The operations on the current service that call this dependency. | 
| P99/P90/P50 latency | The 99th, 90th, and 50th percentile latency for the dependency path. | 
| Total requests | The total number of requests to the dependency during the selected time range. | 
| Error rate | The percentage of requests to the dependency that returned errors. | 
| Availability | The percentage of successful responses from the dependency. | 

### Correlations
<a name="observability-app-services-correlations"></a>

The service detail view provides in-context correlations that let you navigate from service metrics directly to related traces and logs. You can use correlations to investigate the root cause of latency spikes or error rate increases.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/apm/service-span-correlations.png)


The following correlation options are available:
+ **View related traces** – Opens a filtered trace view for the selected service or operation.
+ **View related logs** – Opens a filtered log view for the selected service or operation.
+ **Filter by attributes** – Narrows correlation results by specific span attributes.

## Application Map
<a name="observability-app-map"></a>

The Application Map is an interactive topology visualization that OpenSearch Ingestion auto-generates from your trace data by using the `otel_apm_service_map` processor. The map displays services as nodes with directional edges that show communication patterns, overlaid with RED metrics (Rate, Errors, Duration).

To access the Application Map, navigate to the Observability workspace in OpenSearch UI and choose **APM** > **Application map**.

The following image shows the Application Map.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/apm/application-map.png)


The map displays the following RED metrics for each service:
+ **Rate** – Requests per second processed by the service.
+ **Errors** – Percentage of 4xx and 5xx responses.
+ **Duration** – P50 and P99 latency for the service.

The `otel_apm_service_map` processor generates these metrics and stores them in Amazon Managed Service for Prometheus through remote write.

The topology visualization represents services as nodes and communication direction as edges. Color coding indicates the health status of each service. The map updates automatically as OpenSearch Ingestion ingests new trace data.

### Grouping services
<a name="observability-app-map-groupby"></a>

You can group services by attributes such as programming language, team, or environment. When you select a group-by attribute, the map switches from a topology graph to a card grid view. Each card represents a group of services that share the same attribute value.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/apm/groupby-attributes.png)


The available group-by attributes are determined by the `group_by_attributes` setting in the `otel_apm_service_map` processor configuration in OpenSearch Ingestion.

### Viewing node details
<a name="observability-app-map-node-details"></a>

To view details for a service, select a node on the map. A detail panel opens with the following sections.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/apm/application-map-node-details.png)


The **Health** section displays the following summary metrics:
+ Total Requests
+ Total Errors 4xx
+ Total Faults 5xx

The **Metrics** section displays the following time-series charts:
+ Requests
+ Latency P50/P90/P99
+ Faults 5xx
+ Errors 4xx

Choose **View details** to navigate to the Services detail view for the selected service.

### Filtering the map
<a name="observability-app-map-filters"></a>

You can filter the Application Map by using the following filters:
+ **Fault rate** – Filter services by server-side fault rate (5xx).
+ **Error rate** – Filter services by client-side error rate (4xx).
+ **Environment** – Filter services by deployment environment.

The following image shows the map filtered by error rate.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/apm/filter-by-error-rate.png)


### In-context correlations
<a name="observability-app-map-correlations"></a>

You can navigate from the topology view directly to related traces and logs. From any service node, the following correlation options are available:
+ **View related traces** – Opens a filtered trace view for the selected service.
+ **View related logs** – Opens a filtered log view for the selected service.

# AI observability
<a name="observability-ai"></a>

AI observability in OpenSearch provides end-to-end tooling for monitoring, debugging, and optimizing AI agent and large language model (LLM) workflows. Built on GenAI semantic conventions and natively integrated with OpenTelemetry (OTel), it gives you full visibility into how your AI applications behave in production.

AI observability includes the following capabilities:
+ **Agent tracing** – Capture hierarchical execution traces across agent orchestration steps, LLM calls, tool invocations, and retrieval operations.
+ **GenAI semantic conventions** – Use standardized OTel attributes such as `gen_ai.system`, `gen_ai.request.model`, and `gen_ai.usage.input_tokens` to describe AI-specific telemetry.
+ **Auto-instrumentation** – Automatically capture traces from popular AI frameworks and providers, including OpenAI, Anthropic, Amazon Bedrock, LangChain, and more than 20 additional libraries.
+ **PPL querying** – Query and aggregate trace data using Piped Processing Language (PPL) directly from OpenSearch UI.

## Getting started
<a name="observability-ai-getting-started"></a>

This section walks you through instrumenting an AI agent, sending traces to Amazon OpenSearch Service, and viewing them in OpenSearch UI.

### To install the SDK
<a name="observability-ai-install-sdk"></a>

Install the OpenTelemetry GenAI instrumentation package:

```
pip install opentelemetry-instrumentation-openai-v2 opentelemetry-sdk opentelemetry-exporter-otlp
```

### To instrument your agent code
<a name="observability-ai-instrument"></a>

The following example shows how to register the OTel SDK, annotate your agent functions with the `@observe` decorator, and enrich spans with GenAI attributes.

```
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Register the tracer provider
provider = TracerProvider()
processor = BatchSpanProcessor(
    OTLPSpanExporter(endpoint="<your-osis-endpoint>")  # Use your OpenSearch Ingestion endpoint with SigV4
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)

# Decorate your agent function
@observe
def run_agent(prompt: str):
    with tracer.start_as_current_span("invoke_agent") as span:
        span.set_attribute("gen_ai.operation.name", "invoke_agent")
        span.set_attribute("gen_ai.system", "openai")
        span.set_attribute("gen_ai.request.model", "gpt-4")

        # Enrich with token usage after the LLM call
        response = call_llm(prompt)
        span.set_attribute("gen_ai.usage.input_tokens", response.input_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", response.output_tokens)
        return response
```

**Note**  
When you send traces to Amazon OpenSearch Service, use your OpenSearch Ingestion pipeline endpoint with SigV4 authentication instead of a localhost endpoint. For more information about configuring OpenSearch Ingestion pipelines, see [Overview of Amazon OpenSearch Ingestion](ingestion.md).

### To view traces in OpenSearch UI
<a name="observability-ai-view-traces"></a>

After your instrumented application sends trace data, you can explore it in OpenSearch UI. In your observability workspace, expand **Discover** in the left navigation and choose **Agent Traces**.

## Agent Tracing UI
<a name="observability-ai-agent-tracing"></a>

The Agent Traces page in OpenSearch UI provides a purpose-built interface for exploring, debugging, and monitoring LLM agent execution traces. It gives developers and platform operators full observability into agentic AI applications, including hierarchical trace views, detail flyouts, flow visualizations, and aggregate metrics.

### Architecture
<a name="observability-ai-architecture"></a>

The following diagram shows the data flow from instrumented applications to the Agent Traces UI:

```
LLM Application (with OTel SDK + GenAI instrumentation)
    |
    |  OTLP (gRPC/HTTP)
    v
OTel Collector (batch, transform)
    |
    +---- OTLP ----> OpenSearch Ingestion --> OpenSearch (otel-v1-apm-span-*)
    |
    +---- Prometheus Remote Write --> Prometheus (metrics)
                                          |
                                          v
                              OpenSearch UI
                              +-- Agent Traces Plugin
```

### Prerequisites
<a name="observability-ai-prereqs"></a>

Before you use Agent Traces, make sure you have the following:
+ An OpenSearch cluster with trace data indexed in `otel-v1-apm-span-*` indices.
+ OpenTelemetry instrumentation with GenAI semantic conventions enabled in your LLM application.
+ OpenSearch Ingestion configured with the `otel_trace_raw` processor to ingest spans into OpenSearch.
+ PPL query support enabled in OpenSearch UI.

### Required span attributes
<a name="observability-ai-span-attributes"></a>

Agent Traces requires specific span attributes to render trace data correctly. The following tables describe the core fields and GenAI-specific attributes.

**Core span fields**  
Each span must include the following core fields:


| Field | Type | Description | 
| --- | --- | --- | 
| traceId | String | Unique identifier for the entire trace. | 
| spanId | String | Unique identifier for this span. | 
| parentSpanId | String | Identifier of the parent span. Empty for root spans. | 
| startTime | Timestamp | Time when the span started. | 
| endTime | Timestamp | Time when the span ended. | 
| durationInNanos | Long | Duration of the span in nanoseconds. | 
| status.code | Integer | Span status code (0 = unset, 1 = OK, 2 = error). | 

**GenAI attributes**  
The following `gen_ai.*` attributes enable AI-specific features in the Agent Traces UI:


| Attribute | Example value | Description | 
| --- | --- | --- | 
| gen\$1ai.operation.name | chat | The type of GenAI operation. Determines the span category. | 
| gen\$1ai.system | openai | The AI system or provider. | 
| gen\$1ai.request.model | gpt-4 | The model used for the request. | 
| gen\$1ai.usage.input\$1tokens | 150 | Number of input tokens consumed. | 
| gen\$1ai.usage.output\$1tokens | 85 | Number of output tokens generated. | 
| gen\$1ai.response.finish\$1reasons | ["stop"] | Reasons the model stopped generating. | 

### Page layout and metrics bar
<a name="observability-ai-page-layout"></a>

The Agent Traces page displays a metrics bar at the top that summarizes key statistics across all visible traces. Metrics include total trace count, average duration, error rate, and token usage. These values update dynamically based on your time filter and query.

### Traces tab
<a name="observability-ai-traces-tab"></a>

The Traces tab lists all root agent traces that match your current query and time range. Each row represents a single agent invocation.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/agent-traces/traces-table.png)


The following table describes the columns in the traces table:


| Column | Description | 
| --- | --- | 
| Trace ID | Unique identifier for the trace. Choose the link to open the trace details flyout. | 
| Agent name | Name of the agent that initiated the trace. | 
| Status | Overall trace status (OK or Error). | 
| Duration | Total time from the first span to the last span in the trace. | 
| Spans | Total number of spans in the trace. | 
| Input tokens | Total input tokens consumed across all LLM calls in the trace. | 
| Output tokens | Total output tokens generated across all LLM calls in the trace. | 
| Start time | Timestamp when the trace started. | 

### Span categories
<a name="observability-ai-span-categories"></a>

Spans are categorized based on the `gen_ai.operation.name` attribute. Each category is displayed with a unique color and icon in the UI.


| Operation name | Category | Description | 
| --- | --- | --- | 
| invoke\$1agent, create\$1agent | Agent | Agent orchestration step. | 
| chat | LLM | LLM chat completion call. | 
| text\$1completion, generate\$1content | Content | Text generation operation. | 
| execute\$1tool | Tool | Tool invocation. | 
| embeddings | Embeddings | Embedding generation. | 
| retrieval | Retrieval | Data retrieval operation. | 

### Spans tab
<a name="observability-ai-spans-tab"></a>

The Spans tab displays individual spans across all traces. You can filter and sort spans to find specific operations.

![\[alt text not found\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/agent-traces/spans-table.png)


### Trace details flyout
<a name="observability-ai-trace-details"></a>

When you choose a trace ID, a flyout panel opens with two main areas:
+ **Left panel** – Displays the trace tree, which shows the hierarchical parent-child relationships between spans. It also includes a flow DAG (directed acyclic graph) that visualizes the execution path of the agent.
+ **Right panel** – Contains two tabs. The **Detail** tab shows span attributes, resource attributes, and GenAI-specific metadata. The **Timeline** tab shows a waterfall chart of span durations and their timing relationships.

### Querying traces
<a name="observability-ai-querying"></a>

Agent Traces uses PPL (Piped Processing Language) for all data fetching. You can write queries in the query panel at the top of the page.

**To list root traces**  
The following query returns the 100 most recent root agent traces:

```
source = otel-v1-apm-span-*
| where parentSpanId = "" AND isnotnull(`attributes.gen_ai.operation.name`)
| sort - startTime
| head 100
```

**To fetch all spans for a trace**  
The following query returns the complete span tree for a specific trace:

```
source = otel-v1-apm-span-*
| where traceId = "trace-id"
| head 1000
```

**To compute aggregate metrics**  
The following query computes average duration and total token usage grouped by model:

```
source = otel-v1-apm-span-*
| where isnotnull(`attributes.gen_ai.request.model`)
| stats avg(durationInNanos) as avg_duration,
        sum(`attributes.gen_ai.usage.input_tokens`) as total_input_tokens,
        sum(`attributes.gen_ai.usage.output_tokens`) as total_output_tokens
    by `attributes.gen_ai.request.model`
```