# 스트림 모니터링
<a name="cdc-monitoring"></a>

**중요**  
이 기능은 AWS 프리뷰로 제공되며 변경될 수 있습니다. 자세한 내용은 [AWS 서비스 약관](https://aws.amazon.com/service-terms/)의 섹션 2, 베타 및 프리뷰를 참조하세요. CDC 스트림 가격에 대한 자세한 내용은 [Aurora DSQL 가격](https://aws.amazon.com/rds/aurora/dsql/pricing/) 페이지를 참조하세요.  
정식 출시 전에 스트림 페이로드에 새 작업 유형(업데이트를 위한 `"op": "u"`)을 추가합니다. 애플리케이션이 이러한 변경 사항을 수정하지 않고 처리하도록 하려면 `after` 페이로드를 적용하여 인식할 수 없는 `op` 값을 업서트로 취급합니다. 세부 정보는 [CDC 레코드 이해](cdc-record-format.md) 섹션을 참조하세요.

Aurora DSQL에서 CDC 레코드를 전달하는 동안 오류가 발생하면 스트림이 `IMPAIRED` 상태로 전환됩니다. 손상된 스트림은 계속해서 다른 레코드를 처리하고 전송합니다. Aurora DSQL은 실패한 레코드만 재시도합니다. Aurora DSQL은 가장 오래된 전송되지 않은 레코드에서 복제 지연을 측정하며, 문제를 해결할 때까지 지연이 증가합니다. Aurora DSQL은 전송되지 않은 변경 사항을 1주일 동안 내부적으로 유지합니다.

이 창에서 기본 문제를 해결하면 다음 재시도가 성공하고 오류 상태가 지워지며 스트림이 다시 `ACTIVE`로 전환됩니다. 외부 문제(IAM 정책, AWS KMS 키, Amazon Kinesis 용량 등) 및 Aurora DSQL 재시도를 자동으로 수정합니다.

복제 지연이 실패 임곗값을 초과하면 스트림이 `FAILED`로 전환됩니다.

**중요**  
실패한 스트림은 복구할 수 없습니다. 실패한 스트림을 삭제하고 새로 생성해야 합니다.

## 스트림 수명 주기
<a name="cdc-lifecycle"></a>

스트림은 수명 주기 동안 다음 상태를 통해 전환됩니다.
+ **`CREATING`** - Aurora DSQL이 스트림을 설정하고 있습니다. Aurora DSQL은 아직 CDC 레코드를 전달하지 않습니다.
+ **`ACTIVE`** - 스트림이 작동 중이며 CDC 레코드를 대상으로 전송합니다.
+ **`IMPAIRED`** - 스트림에 작업이 필요한 문제가 발생했습니다. 다른 레코드는 계속 전송할 수 있지만 Aurora DSQL은 지수 백오프를 사용하여 실패한 레코드를 재시도합니다. Aurora DSQL은 가장 오래된 전송되지 않은 레코드에서 복제 지연을 측정하며, 문제를 해결할 때까지 지연이 증가합니다. Aurora DSQL은 전송되지 않은 변경 사항을 1주일 동안 내부적으로 버퍼링합니다. [오류 코드 참조](#cdc-failure-reasons)을(를) 참조하세요.
+ **`FAILED`** - 스트림에 지속적인 오류가 발생하여 더 이상 CDC 레코드를 전송하지 않습니다. 실패한 스트림은 복구할 수 없으므로 삭제해야 합니다. 스트림이이 상태가 되는 조건은 [오류 코드 참조](#cdc-failure-reasons) 섹션을 참조하세요.
+ **`DELETING`** - Aurora DSQL이 스트림 리소스를 제거하고 있습니다.
+ **`DELETED`** - Aurora DSQL이 스트림을 삭제했습니다. 삭제가 완료되면 `GetStream`가 `ResourceNotFoundException`을 반환합니다.

언제든지 `GetStream`을 직접적으로 호출하여 스트림 상태를 확인합니다. 스트림이 `IMPAIRED` 또는 `FAILED`인 경우 응답에는 오류 코드와 타임스탬프가 있는 `statusReason` 객체가 포함됩니다. `GetStream` 응답 필드에 대한 자세한 내용은 Amazon Aurora DSQL API 참조의 [GetStream](https://docs.aws.amazon.com/aurora-dsql/latest/APIReference/API_GetStream.html)을 참조하세요.

## 손상되거나 실패한 스트림 문제 해결
<a name="cdc-troubleshooting"></a>

CDC 스트림이 손상되거나 실패하면 다음 단계를 따릅니다. 스트림이 `FAILED`인 경우 스트림을 복구할 수 없습니다. 스트림을 삭제하고, 기본 문제를 해결하고, 새 스트림을 생성합니다.

1. **스트림 상태를 가져옵니다.** `GetStream`을 직접적으로 호출하고 `status` 필드를 확인합니다. 상태가 `ACTIVE`인 경우 스트림이 정상입니다.

   ```
   aws dsql get-stream \
     --cluster-identifier {{cluster-id}} \
     --stream-identifier {{stream-id}} \
     --region {{region}}
   ```

1. **오류 코드를 읽습니다.** 상태가 `IMPAIRED` 또는 `FAILED`인 경우 응답에는 `statusReason` 객체가 포함됩니다. `error` 필드에는 오류 코드가 포함됩니다.

   ```
   {
       "status": "IMPAIRED",
       "statusReason": {
           "error": "KINESIS_THROUGHPUT_EXCEEDED",
           "updatedAt": "2025-01-15T14:30:00Z"
       }
   }
   ```

1. **문제 해결을 따릅니다.** 스트림이 `IMPAIRED`인 경우 다음 표에서 오류 코드를 조회하고 권장 수정 사항을 적용합니다. Aurora DSQL은 기본 문제를 해결한 후 자동으로 재시도합니다. 스트림이 `FAILED`인 경우 스트림을 삭제하고 문제를 해결한 다음 새 스트림을 생성합니다.

## 오류 코드 참조
<a name="cdc-failure-reasons"></a>

다음 표에서는 각 오류 코드, 원인, 스트림 복구 가능 여부 및 문제 해결 단계를 설명합니다.


| 오류 코드 | 원인 | 복구 가능 여부 | 해결 방법 | 
| --- |--- |--- |--- |
| KINESIS\_THROUGHPUT\_EXCEEDED | Your Kinesis data stream exceeded its throughput limit, or AWS KMS throttled encryption operations on the Kinesis data stream, and the replication lag has grown. | Yes | Increase the number of shards on your Kinesis data stream, or switch to on-demand capacity mode. If the Kinesis data stream uses an AWS KMS customer managed key, verify that the key's request quota is large enough. After you increase capacity, Aurora DSQL retries automatically. | 
| KINESIS\_STREAM\_NOT\_FOUND | The target Kinesis data stream no longer exists. | No | The stream transitions directly to FAILED. Delete the CDC stream and create a new one pointing to a valid Kinesis data stream. | 
| ROLE\_ACCESS\_DENIED | Aurora DSQL can't assume the IAM role specified in the target definition. The AWS STS AssumeRole call returned AccessDenied. | Yes | Verify the role's trust policy allows the Aurora DSQL service principal (dsql.amazonaws.com) to assume it. Verify the aws:SourceAccount and aws:SourceArn conditions match your cluster. For details, see [서비스 역할 신뢰 정책](cdc-iam.md#cdc-iam-trust-policy). After you fix the trust policy, Aurora DSQL retries automatically. | 
| KINESIS\_ACCESS\_DENIED | The assumed role doesn't have permission to write to the Kinesis data stream. Kinesis returned AccessDeniedException. | Yes | Add kinesis:PutRecord and kinesis:PutRecords permissions to the role's policy for the target Kinesis data stream Amazon Resource Name (ARN). After you fix the policy, Aurora DSQL retries automatically. | 
| KINESIS\_KMS\_ACCESS\_DENIED | The assumed role doesn't have permission to use the AWS KMS key that encrypts the Kinesis data stream. This error covers AWS KMS access denial and invalid key states. | Yes | Verify the role has kms:GenerateDataKey permission on the AWS KMS key that the Kinesis data stream uses. Also verify that the AWS KMS key is in an enabled and valid state. This key is the encryption key on the Kinesis data stream, not the cluster's AWS KMS key. For details, see [서비스 역할 권한 정책](cdc-iam.md#cdc-iam-permissions-policy). After you fix the permissions or key state, Aurora DSQL retries automatically. | 
| KINESIS\_OVERSIZE\_RECORD | A CDC record exceeded the maximum record size configured on the Kinesis data stream. | Yes | Increase MaxRecordSizeInKiB on the Kinesis data stream to 10240 (10 MiB). You can update this setting on an existing Kinesis data stream without deleting it. After you increase the limit, Aurora DSQL retries the oversized record automatically and the stream transitions back to ACTIVE. | 
| CLUSTER\_CMK\_INACCESSIBLE | The AWS KMS customer managed key that encrypts the Aurora DSQL cluster is inaccessible. | Yes | Verify the AWS KMS key policy and key state. Re-enable or restore access to the key. After the key becomes accessible again, the stream transitions back to ACTIVE. | 

위 표에는 모든 `StreamFailureErrorCode` 값이 나열되어 있습니다. `statusReason` 응답 필드에 대한 자세한 내용은 [Amazon Aurora DSQL API 참조](https://docs.aws.amazon.com/aurora-dsql/latest/userguide/CHAP_api_reference.html)의 [GetStream](https://docs.aws.amazon.com/aurora-dsql/latest/APIReference/API_GetStream.html)을 참조하세요.

## 손상된 스트림 복구
<a name="cdc-how-recovery-works"></a>

대부분의 오류는 먼저 스트림을 `IMPAIRED`로 전환합니다. 손상된 스트림은 다른 레코드를 계속 처리하고 실패한 레코드를 자동으로 재시도합니다. `FAILED` 스트림은 복구할 수 없습니다. 스트림을 삭제하고 새 스트림을 생성해야 합니다.
+ **복구 가능한 오류의 경우:** 외부 문제(IAM 정책, AWS KMS 키, Kinesis 용량 또는 Kinesis 레코드 크기 제한)를 수정합니다. 다음에 재시도가 성공하면 오류 상태가 지워지고 스트림이 다시 `ACTIVE`로 전환됩니다.
+ **`KINESIS_STREAM_NOT_FOUND`의 경우:** 스트림이 `FAILED`로 직접 전환됩니다. 실패한 스트림을 삭제하고 유효한 Kinesis 데이터 스트림을 가리키는 새 스트림을 생성합니다.

다른 모든 오류 코드의 경우 문제를 해결하기 전에 복제 지연이 실패 임곗값을 초과하면 스트림이 `IMPAIRED`에서 `FAILED`로 전환됩니다. 실패한 스트림은 `ACTIVE`로 다시 전환할 수 없습니다. 실패한 스트림을 삭제하고 기본 문제를 해결한 다음 새 스트림을 생성합니다.

## 스트림 상태 모니터링
<a name="cdc-stream-health"></a>

CloudWatch 지표와 `GetStream` API를 사용하여 스트림 상태를 모니터링합니다. CloudWatch 지표는 CDC 파이프라인 성능에 대한 지속적인 가시성을 제공하고 `GetStream`은 스트림이 손상되거나 실패할 때 특정 오류 코드를 제공합니다.

`IsImpaired`, `BehindSourceLag`, `PublishedBytes` 및 `PublishedRecords`를 포함한 CDC 지표의 전체 목록은 [CDC 스트림에 대한 CloudWatch 지표](#cdc-cloudwatch-metrics) 섹션을 참조하세요. `GetStream` 응답 필드에 대한 자세한 내용은 Amazon Aurora DSQL API 참조의 [GetStream](https://docs.aws.amazon.com/aurora-dsql/latest/APIReference/API_GetStream.html)을 참조하세요.

## CDC 스트림에 대한 CloudWatch 지표
<a name="cdc-cloudwatch-metrics"></a>

다음 CloudWatch 지표를 사용하여 각 CDC 스트림의 상태와 처리량을 모니터링합니다. Aurora DSQL은 차원 `ClusterId` 및 `StreamId`와 함께 `AWS/AuroraDSQL` 네임스페이스에 이러한 지표를 게시합니다. 마지막 지표는 다운스트림 읽기 지연을 측정하는 `AWS/Kinesis` 네임스페이스의 표준 Amazon Kinesis 지표입니다.

**참고**  
또한 Aurora DSQL은 사용량 및 결제 추적을 위해 `AWS/AuroraDSQL` 네임스페이스에 `BytesStreamed` 및 `StreamDPU` 지표를 게시합니다. 설명은 [CDC 스트림 지표](cloudwatch-monitoring.md#cdc-stream-metrics) 섹션을 참조하세요.


| 지표 이름 | 유용한 통계 | 설명 | 
| --- |--- |--- |
| IsImpaired | Maximum | Indicates whether the stream is impaired. The value is 1 when the stream is in the IMPAIRED state, and 0 when the stream is healthy. Aurora DSQL emits this metric continuously for each active or impaired stream. Use this metric to create a CloudWatch alarm that notifies you when a stream becomes impaired. | 
| BehindSourceLag | Average | The delay, in milliseconds, between when a transaction commits in Aurora DSQL and when the CDC system processes the resulting record. A rising value indicates that the CDC pipeline is falling behind the write workload. | 
| PublishedBytes | Sum | The total bytes of CDC records that Aurora DSQL wrote to the target during the period. Use this metric together with your Kinesis shard count to determine whether you've provisioned enough write capacity. | 
| PublishedRecords | Sum | The total number of CDC records that Aurora DSQL wrote to the target during the period. Each committed row change produces one record. | 
| GetRecords.IteratorAgeMilliseconds (AWS/Kinesis) | Average | A standard Kinesis metric that reports the age of the last record read from the Kinesis data stream by your downstream app, in milliseconds. Use the StreamName dimension. A rising value indicates that your downstream app can't keep up with the rate at which Aurora DSQL writes CDC records to Kinesis. | 

Aurora DSQL 콘솔의 **모니터링** 탭에는 `BehindSourceLag`(CDC 소스 지연 시간)과 `GetRecords.IteratorAgeMilliseconds`(Kinesis 리더 지연 시간)를 결합한 **평균 전체 지연 시간** 값이 표시됩니다. 이 결합된 값은 데이터베이스 커밋에서 다운스트림 읽기까지의 총 지연 시간을 나타냅니다.

## 모니터링 모범 사례
<a name="cdc-monitoring-best-practices"></a>

다음 방법을 사용하여 다운스트림 시스템에 영향을 미치기 전에 CDC 파이프라인 문제를 감지하고 해결합니다.

**`BehindSourceLag`에서 경보 설정**  
`BehindSourceLag`가 워크로드에 중요한 임곗값을 초과할 때 실행되는 CloudWatch 경보를 생성합니다. 예를 들어 1분 지연 시간 대상의 경우 60초를 설정합니다. 이 지표가 지속적으로 증가하면 CDC 파이프라인이 뒤처지고 있음을 의미합니다. 지연이 실패 임곗값에 도달하면 스트림이 실패로 전환됩니다. 추세를 파악하면 스트림 성능이 저하되기 전에 Kinesis 용량을 늘리거나 처리량 병목 현상을 조사할 수 있습니다.

**Kinesis 측에서 `GetRecords.IteratorAgeMilliseconds` 모니터링**  
Aurora DSQL이 정시에 레코드를 전송하더라도 다운스트림 앱은 뒤처질 수 있습니다. `GetRecords.IteratorAgeMilliseconds`(`AWS/Kinesis`네임스페이스, 차원 `StreamName`)에서 CloudWatch 경보를 생성하여 다운스트림 지연을 독립적으로 감지합니다. 이 지표가 상승하고 `BehindSourceLag`가 평탄하게 유지되면 병목 현상은 Aurora DSQL이 아닌 다운스트림 앱에 발생합니다.

**Kinesis 샤드 용량을 기준으로 `PublishedBytes` 추적**  
각 Kinesis 샤드는 쓰기에 대해 초당 최대 1MiB를 지원합니다. 분당 `PublishedBytes` Sum을 총 샤드 쓰기 용량(샤드 수 × 분당 60MiB)과 비교합니다. 사용량이 80%에 가까워지면 스로틀링이 `KINESIS_THROUGHPUT_EXCEEDED`를 트리거하기 전에 샤드를 추가하거나 온디맨드 용량 모드로 전환합니다.

**즉각적인 장애 감지를 위한 `IsImpaired`의 경보**  
`IsImpaired` Maximum이 한 평가 기간 동안 `1`보다 크거나 같을 때 실행되는 CloudWatch 경보를 생성합니다. 이렇게 하면 스트림이 API를 폴링하지 않고 `IMPAIRED` 상태로 전환될 때 직접 신호를 보낼 수 있습니다. 경보가 실행된 후 `GetStream`를 직접적으로 호출하여 `statusReason.error` 필드를 읽고 [손상되거나 실패한 스트림 문제 해결](#cdc-troubleshooting)의 문제 해결 단계를 따릅니다.

**세부 상태를 위한 `GetStream` 폴링**  
`IsImpaired` 지표는 스트림이 손상되었음을 알려주지만 `GetStream` API는 특정 오류 코드와 타임스탬프를 제공합니다. 일정에 따라(예: 5분마다) 또는 `IsImpaired` 경보에 대한 응답으로 `GetStream`을 폴링합니다. `statusReason.error` 필드는 무엇이 잘못되었는지 알려줍니다. 이를 [손상되거나 실패한 스트림 문제 해결](#cdc-troubleshooting)의 문제 해결 단계와 페어링하여 빠르게 해결할 수 있습니다.

**대시보드를 사용하여 지표 상호 연결**  
`IsImpaired`, `BehindSourceLag`, `PublishedRecords`, `PublishedBytes` 및 `GetRecords.IteratorAgeMilliseconds`를 나란히 표시하는 CloudWatch 대시보드를 생성합니다. 이러한 지표를 상호 연관하면 CDC 파이프라인 문제(`BehindSourceLag` 상승)와 다운스트림 읽기 문제(`IteratorAge` 상승 및 안정적인 상승 `BehindSourceLag`)를 구별하는 데 도움이 됩니다.