기계 번역으로 제공되는 번역입니다. 제공된 번역과 원본 영어의 내용이 상충하는 경우에는 영어 버전이 우선합니다. # Amazon EMR에서 CloudWatch 이벤트에 대한 대응 이 섹션에서는 Amazon EMR이 [CloudWatch 이벤트 메시지](emr-manage-cloudwatch-events.md)로 생성하는 실행 가능한 이벤트에 대응할 수 있는 다양한 방법을 설명합니다. 이벤트에 대응할 수 있는 방법으로, 규칙 생성, 경보 설정 및 기타 응답이 있습니다. 다음 섹션에는 절차에 대한 링크와 일반적인 이벤트에 대한 권장 응답이 포함되어 있습니다. **Topics** + [CloudWatch를 사용하여 Amazon EMR 이벤트에 대한 규칙 생성](emr-events-cloudwatch-console.md) + [Amazon EMR에서 CloudWatch 지표에 대한 경보 설정](UsingEMR_ViewingMetrics_Alarm.md) + [Amazon EMR 클러스터 인스턴스 용량 부족 이벤트에 대한 대응](emr-events-response-insuff-capacity.md) + [Amazon EMR 클러스터 인스턴스 플릿 크기 조정 제한 시간 이벤트에 대한 대응](emr-events-response-timeout-events.md) # CloudWatch를 사용하여 Amazon EMR 이벤트에 대한 규칙 생성 Amazon EMR은 이벤트를 CloudWatch 이벤트 스트림에 자동으로 전송합니다. 지정된 패턴에 따라 이벤트를 일치시키는 규칙을 생성하고 이벤트를 대상으로 라우트하여 조치를 취할 수 있습니다(예: 이메일 알림 전송). 패턴은 이벤트 JSON 객체와 비교됩니다. Amazon EMR 이벤트 세부 정보에 대한 자세한 내용은 *Amazon CloudWatch Events 사용 설명서*에서 [Amazon EMR events](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html#emr_event_type)를 참조하세요. CloudWatch 이벤트 규칙 설정에 대한 자세한 내용은 [Creating a CloudWatch rule that triggers on an event](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/Create-CloudWatch-Events-Rule.html)를 참조하세요. # Amazon EMR에서 CloudWatch 지표에 대한 경보 설정 Amazon EMR은 지표를 Amazon CloudWatch로 전송합니다. 이에 대응하여 CloudWatch를 사용하여 Amazon EMR 지표에 대한 경보를 설정할 수 있습니다. 예를 들어, HDFS 사용률이 80%를 초과할 때마다 이메일을 전송하도록 CloudWatch에서 경보를 구성할 수 있습니다. 자세한 지침은 *Amazon CloudWatch 사용 설명서*에서 [Amazon CloudWatch 경보 생성 또는 편집](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ConsoleAlarms.html)을 참조하세요. # Amazon EMR 클러스터 인스턴스 용량 부족 이벤트에 대한 대응 ## 개요 Amazon EMR 클러스터는 선택한 가용 영역에서 클러스터 시작 또는 크기 조정 요청을 처리할 용량이 부족한 경우 이벤트 코드 `EC2 provisioning - Insufficient Instance Capacity`를 반환합니다. Amazon EMR에서 용량 부족 예외가 반복적으로 발생하여 클러스터 시작 또는 클러스터 크기 조정 작업에 대한 프로비저닝 요청을 이행할 수 없는 경우, 이 이벤트는 인스턴스 그룹과 인스턴스 플릿 모두에서 주기적으로 발생합니다. 이 페이지에서는 EMR 클러스터에서 이 이벤트가 발생할 때 이 이벤트 유형에 가장 잘 대응하는 방법을 설명합니다. ## 용량 부족 이벤트에 대한 권장 대응 부족한 용량 이벤트에 다음 방법 중 하나를 사용하여 대응하는 것이 좋습니다. + 용량이 복구될 때까지 기다립니다. 용량이 자주 바뀌므로 용량 부족 예외는 저절로 복구될 수 있습니다. Amazon EC2 용량이 확보되는 즉시 클러스터의 크기 조정이 시작되거나 완료됩니다. + 또는 클러스터를 종료하고, 인스턴스 유형 구성을 수정하며, 업데이트된 클러스터 구성 요청으로 새 클러스터를 생성할 수 있습니다. 자세한 내용은 [Amazon EMR 클러스터에 대한 가용 영역 유연성](emr-flexibility.md) 단원을 참조하십시오. 다음 섹션에 설명된 대로 용량 부족 이벤트에 대한 규칙 또는 자동 대응을 설정할 수도 있습니다. ## 용량 부족 이벤트 발생 시 자동 복구 Amazon EMR 이벤트(예: `EC2 provisioning - Insufficient Instance Capacity` 이벤트 코드의 이벤트)에 대한 대응으로 자동화를 구축할 수 있습니다. 예를 들어 다음 AWS Lambda 함수는 온디맨드 인스턴스를 사용하는 인스턴스 그룹으로 EMR 클러스터를 종료한 다음 원래 요청과 다른 인스턴스 유형을 포함하는 인스턴스 그룹으로 새 EMR 클러스터를 생성합니다. 다음과 같은 조건에 따라 자동 프로세스가 트리거됩니다. + 프라이머리 또는 코어 노드에서 20분 넘게 용량 부족 이벤트가 생성되었습니다. + 클러스터가 **준비** 또는 **대기 중** 상태가 아닙니다. EMR 클러스터 상태에 대한 자세한 내용은 [클러스터 수명 주기 이해](emr-overview.md#emr-overview-cluster-lifecycle) 섹션을 참조하세요. **참고** 용량 부족 예외에 대한 자동 프로세스를 구축할 때 용량 부족 이벤트는 복구 가능하다는 점을 고려해야 합니다. 용량은 자주 변경되며 Amazon EC2 용량이 확보되는 즉시 클러스터는 크기 조정 또는 시작 작업을 재개합니다. **Example 용량 부족 이벤트에 대응하는 함수** ``` // Lambda code with Python 3.10 and handler is lambda_function.lambda_handler // Note: related IAM role requires permission to use Amazon EMR import json import boto3 import datetime from datetime import timezone INSUFFICIENT_CAPACITY_EXCEPTION_DETAIL_TYPE = "EMR Instance Group Provisioning" INSUFFICIENT_CAPACITY_EXCEPTION_EVENT_CODE = ( "EC2 provisioning - Insufficient Instance Capacity" ) ALLOWED_INSTANCE_TYPES_TO_USE = [ "m5.xlarge", "c5.xlarge", "m5.4xlarge", "m5.2xlarge", "t3.xlarge", ] CLUSTER_START_ACCEPTABLE_STATES = ["WAITING", "RUNNING"] CLUSTER_START_SLA = 20 CLIENT = boto3.client("emr", region_name="us-east-1") # checks if the incoming event is 'EMR Instance Fleet Provisioning' with eventCode 'EC2 provisioning - Insufficient Instance Capacity' def is_insufficient_capacity_event(event): if not event["detail"]: return False else: return ( event["detail-type"] == INSUFFICIENT_CAPACITY_EXCEPTION_DETAIL_TYPE and event["detail"]["eventCode"] == INSUFFICIENT_CAPACITY_EXCEPTION_EVENT_CODE ) # checks if the cluster is eligible for termination def is_cluster_eligible_for_termination(event, describeClusterResponse): # instanceGroupType could be CORE, MASTER OR TASK instanceGroupType = event["detail"]["instanceGroupType"] clusterCreationTime = describeClusterResponse["Cluster"]["Status"]["Timeline"][ "CreationDateTime" ] clusterState = describeClusterResponse["Cluster"]["Status"]["State"] now = datetime.datetime.now() now = now.replace(tzinfo=timezone.utc) isClusterStartSlaBreached = clusterCreationTime < now - datetime.timedelta( minutes=CLUSTER_START_SLA ) # Check if instance group receiving Insufficient capacity exception is CORE or PRIMARY (MASTER), # and it's been more than 20 minutes since cluster was created but the cluster state and the cluster state is not updated to RUNNING or WAITING if ( (instanceGroupType == "CORE" or instanceGroupType == "MASTER") and isClusterStartSlaBreached and clusterState not in CLUSTER_START_ACCEPTABLE_STATES ): return True else: return False # Choose item from the list except the exempt value def choice_excluding(exempt): for i in ALLOWED_INSTANCE_TYPES_TO_USE: if i != exempt: return i # Create a new cluster by choosing different InstanceType. def create_cluster(event): # instanceGroupType cloud be CORE, MASTER OR TASK instanceGroupType = event["detail"]["instanceGroupType"] # Following two lines assumes that the customer that created the cluster already knows which instance types they use in original request instanceTypesFromOriginalRequestMaster = "m5.xlarge" instanceTypesFromOriginalRequestCore = "m5.xlarge" # Select new instance types to include in the new createCluster request instanceTypeForMaster = ( instanceTypesFromOriginalRequestMaster if instanceGroupType != "MASTER" else choice_excluding(instanceTypesFromOriginalRequestMaster) ) instanceTypeForCore = ( instanceTypesFromOriginalRequestCore if instanceGroupType != "CORE" else choice_excluding(instanceTypesFromOriginalRequestCore) ) print("Starting to create cluster...") instances = { "InstanceGroups": [ { "InstanceRole": "MASTER", "InstanceCount": 1, "InstanceType": instanceTypeForMaster, "Market": "ON_DEMAND", "Name": "Master", }, { "InstanceRole": "CORE", "InstanceCount": 1, "InstanceType": instanceTypeForCore, "Market": "ON_DEMAND", "Name": "Core", }, ] } response = CLIENT.run_job_flow( Name="Test Cluster", Instances=instances, VisibleToAllUsers=True, JobFlowRole="EMR_EC2_DefaultRole", ServiceRole="EMR_DefaultRole", ReleaseLabel="emr-6.10.0", ) return response["JobFlowId"] # Terminated the cluster using clusterId received in an event def terminate_cluster(event): print("Trying to terminate cluster, clusterId: " + event["detail"]["clusterId"]) response = CLIENT.terminate_job_flows(JobFlowIds=[event["detail"]["clusterId"]]) print(f"Terminate cluster response: {response}") def describe_cluster(event): response = CLIENT.describe_cluster(ClusterId=event["detail"]["clusterId"]) return response def lambda_handler(event, context): if is_insufficient_capacity_event(event): print( "Received insufficient capacity event for instanceGroup, clusterId: " + event["detail"]["clusterId"] ) describeClusterResponse = describe_cluster(event) shouldTerminateCluster = is_cluster_eligible_for_termination( event, describeClusterResponse ) if shouldTerminateCluster: terminate_cluster(event) clusterId = create_cluster(event) print("Created a new cluster, clusterId: " + clusterId) else: print( "Cluster is not eligible for termination, clusterId: " + event["detail"]["clusterId"] ) else: print("Received event is not insufficient capacity event, skipping") ``` # Amazon EMR 클러스터 인스턴스 플릿 크기 조정 제한 시간 이벤트에 대한 대응 ## 개요 Amazon EMR 클러스터는 인스턴스 플릿 클러스터의 크기 조정 작업을 실행하는 동안 [이벤트](emr-manage-cloudwatch-events.md#emr-cloudwatch-instance-fleet-resize-events)를 생성합니다. 프로비저닝 제한 시간 이벤트는 제한 시간이 만료된 후 Amazon EMR이 플릿에 대한 스팟 또는 온디맨드 용량 프로비저닝을 중지할 때 발생합니다. 제한 시간은 인스턴스 플릿의 [크기 조정 사양](https://docs.aws.amazon.com/emr/latest/APIReference/API_InstanceFleetResizingSpecifications.html)의 일부로 사용자가 구성할 수 있습니다. 동일한 인스턴스 플릿의 크기가 연속적으로 조정되는 시나리오에서 Amazon EMR은 현재 크기 조정 작업의 제한 시간이 만료되면 `Spot provisioning timeout - continuing resize` 또는 `On-Demand provisioning timeout - continuing resize` 이벤트를 생성합니다. 그런 다음 플릿의 다음 크기 조정 작업을 위한 용량 프로비저닝을 시작합니다. ## 인스턴스 플릿 크기 조정 제한 시간 이벤트에 대한 대응 프로비저닝 제한 시간 이벤트에 대한 다음 방법 중 하나를 사용하여 대응하는 것이 좋습니다. + [크기 조정 사양](https://docs.aws.amazon.com/emr/latest/APIReference/API_InstanceFleetResizingSpecifications.html)을 다시 확인하고 크기 조정 작업을 다시 시도합니다. 용량이 자주 변경되므로 Amazon EC2 용량이 확보되면 즉시 클러스터의 크기가 성공적으로 조정됩니다. 더 엄격한 SLA가 필요한 작업의 경우 제한 시간을 더 낮은 값으로 구성하는 것이 좋습니다. + 또는 다음 중 하나를 수행할 수 있습니다. + [인스턴스 및 가용 영역 유연성에 대한 모범 사례](emr-flexibility.md#emr-flexibility-types)를 기반으로 다양한 인스턴스 유형으로 새 클러스터 시작 또는 + 온디맨드 용량으로 클러스터 시작 + 프로비저닝 제한 시간 - 크기 조정 계속 이벤트의 경우 크기 조정 작업이 처리될 때까지 더 기다릴 수 있습니다. Amazon EMR은 구성된 크기 조정 사양에 따라 플릿에 대해 트리거된 크기 조정 작업을 계속 순차적으로 처리합니다. 다음 섹션의 설명에 따라 이 이벤트에 대한 규칙이나 자동 대응을 설정할 수도 있습니다. ## 프로비저닝 제한 시간 이벤트에서 자동 복구 `Spot Provisioning timeout` 이벤트 코드의 Amazon EMR 이벤트에 대한 대응으로 자동화를 구축할 수 있습니다. 예를 들어, 다음 AWS Lambda 함수는 태스크 노드에 대해 스팟 인스턴스를 사용하는 인스턴스 플릿이 있는 EMR 클러스터를 종료한 다음, 원래 요청과 다른 인스턴스 유형을 포함하는 인스턴스 플릿으로 새 EMR 클러스터를 생성합니다. 이 예제에서 태스크 노드에 대해 생성된 `Spot Provisioning timeout` 이벤트는 Lambda 함수의 실행을 트리거합니다. **Example `Spot Provisioning timeout` 이벤트에 대응하는 예제 함수** ``` // Lambda code with Python 3.10 and handler is lambda_function.lambda_handler // Note: related IAM role requires permission to use Amazon EMR import json import boto3 import datetime from datetime import timezone SPOT_PROVISIONING_TIMEOUT_EXCEPTION_DETAIL_TYPE = "EMR Instance Fleet Resize" SPOT_PROVISIONING_TIMEOUT_EXCEPTION_EVENT_CODE = ( "Spot Provisioning timeout" ) CLIENT = boto3.client("emr", region_name="us-east-1") # checks if the incoming event is 'EMR Instance Fleet Resize' with eventCode 'Spot provisioning timeout' def is_spot_provisioning_timeout_event(event): if not event["detail"]: return False else: return ( event["detail-type"] == SPOT_PROVISIONING_TIMEOUT_EXCEPTION_DETAIL_TYPE and event["detail"]["eventCode"] == SPOT_PROVISIONING_TIMEOUT_EXCEPTION_EVENT_CODE ) # checks if the cluster is eligible for termination def is_cluster_eligible_for_termination(event, describeClusterResponse): # instanceFleetType could be CORE, MASTER OR TASK instanceFleetType = event["detail"]["instanceFleetType"] # Check if instance fleet receiving Spot provisioning timeout event is TASK if (instanceFleetType == "TASK"): return True else: return False # create a new cluster by choosing different InstanceType. def create_cluster(event): # instanceFleetType cloud be CORE, MASTER OR TASK instanceFleetType = event["detail"]["instanceFleetType"] # the following two lines assumes that the customer that created the cluster already knows which instance types they use in original request instanceTypesFromOriginalRequestMaster = "m5.xlarge" instanceTypesFromOriginalRequestCore = "m5.xlarge" # select new instance types to include in the new createCluster request instanceTypesForTask = [ "m5.xlarge", "m5.2xlarge", "m5.4xlarge", "m5.8xlarge", "m5.12xlarge" ] print("Starting to create cluster...") instances = { "InstanceFleets": [ { "InstanceFleetType":"MASTER", "TargetOnDemandCapacity":1, "TargetSpotCapacity":0, "InstanceTypeConfigs":[ { 'InstanceType': instanceTypesFromOriginalRequestMaster, "WeightedCapacity":1, } ] }, { "InstanceFleetType":"CORE", "TargetOnDemandCapacity":1, "TargetSpotCapacity":0, "InstanceTypeConfigs":[ { 'InstanceType': instanceTypesFromOriginalRequestCore, "WeightedCapacity":1, } ] }, { "InstanceFleetType":"TASK", "TargetOnDemandCapacity":0, "TargetSpotCapacity":100, "LaunchSpecifications":{}, "InstanceTypeConfigs":[ { 'InstanceType': instanceTypesForTask[0], "WeightedCapacity":1, }, { 'InstanceType': instanceTypesForTask[1], "WeightedCapacity":2, }, { 'InstanceType': instanceTypesForTask[2], "WeightedCapacity":4, }, { 'InstanceType': instanceTypesForTask[3], "WeightedCapacity":8, }, { 'InstanceType': instanceTypesForTask[4], "WeightedCapacity":12, } ], "ResizeSpecifications": { "SpotResizeSpecification": { "TimeoutDurationMinutes": 30 } } } ] } response = CLIENT.run_job_flow( Name="Test Cluster", Instances=instances, VisibleToAllUsers=True, JobFlowRole="EMR_EC2_DefaultRole", ServiceRole="EMR_DefaultRole", ReleaseLabel="emr-6.10.0", ) return response["JobFlowId"] # terminated the cluster using clusterId received in an event def terminate_cluster(event): print("Trying to terminate cluster, clusterId: " + event["detail"]["clusterId"]) response = CLIENT.terminate_job_flows(JobFlowIds=[event["detail"]["clusterId"]]) print(f"Terminate cluster response: {response}") def describe_cluster(event): response = CLIENT.describe_cluster(ClusterId=event["detail"]["clusterId"]) return response def lambda_handler(event, context): if is_spot_provisioning_timeout_event(event): print( "Received spot provisioning timeout event for instanceFleet, clusterId: " + event["detail"]["clusterId"] ) describeClusterResponse = describe_cluster(event) shouldTerminateCluster = is_cluster_eligible_for_termination( event, describeClusterResponse ) if shouldTerminateCluster: terminate_cluster(event) clusterId = create_cluster(event) print("Created a new cluster, clusterId: " + clusterId) else: print( "Cluster is not eligible for termination, clusterId: " + event["detail"]["clusterId"] ) else: print("Received event is not spot provisioning timeout event, skipping") ```