기계 번역으로 제공되는 번역입니다. 제공된 번역과 원본 영어의 내용이 상충하는 경우에는 영어 버전이 우선합니다. # Amazon EMR 클러스터 인스턴스 용량 부족 이벤트에 대한 대응 ## 개요 Amazon EMR 클러스터는 선택한 가용 영역에서 클러스터 시작 또는 크기 조정 요청을 처리할 용량이 부족한 경우 이벤트 코드 `EC2 provisioning - Insufficient Instance Capacity`를 반환합니다. Amazon EMR에서 용량 부족 예외가 반복적으로 발생하여 클러스터 시작 또는 클러스터 크기 조정 작업에 대한 프로비저닝 요청을 이행할 수 없는 경우, 이 이벤트는 인스턴스 그룹과 인스턴스 플릿 모두에서 주기적으로 발생합니다. 이 페이지에서는 EMR 클러스터에서 이 이벤트가 발생할 때 이 이벤트 유형에 가장 잘 대응하는 방법을 설명합니다. ## 용량 부족 이벤트에 대한 권장 대응 부족한 용량 이벤트에 다음 방법 중 하나를 사용하여 대응하는 것이 좋습니다. + 용량이 복구될 때까지 기다립니다. 용량이 자주 바뀌므로 용량 부족 예외는 저절로 복구될 수 있습니다. Amazon EC2 용량이 확보되는 즉시 클러스터의 크기 조정이 시작되거나 완료됩니다. + 또는 클러스터를 종료하고, 인스턴스 유형 구성을 수정하며, 업데이트된 클러스터 구성 요청으로 새 클러스터를 생성할 수 있습니다. 자세한 내용은 [Amazon EMR 클러스터에 대한 가용 영역 유연성](emr-flexibility.md) 단원을 참조하십시오. 다음 섹션에 설명된 대로 용량 부족 이벤트에 대한 규칙 또는 자동 대응을 설정할 수도 있습니다. ## 용량 부족 이벤트 발생 시 자동 복구 Amazon EMR 이벤트(예: `EC2 provisioning - Insufficient Instance Capacity` 이벤트 코드의 이벤트)에 대한 대응으로 자동화를 구축할 수 있습니다. 예를 들어 다음 AWS Lambda 함수는 온디맨드 인스턴스를 사용하는 인스턴스 그룹으로 EMR 클러스터를 종료한 다음 원래 요청과 다른 인스턴스 유형을 포함하는 인스턴스 그룹으로 새 EMR 클러스터를 생성합니다. 다음과 같은 조건에 따라 자동 프로세스가 트리거됩니다. + 프라이머리 또는 코어 노드에서 20분 넘게 용량 부족 이벤트가 생성되었습니다. + 클러스터가 **준비** 또는 **대기 중** 상태가 아닙니다. EMR 클러스터 상태에 대한 자세한 내용은 [클러스터 수명 주기 이해](emr-overview.md#emr-overview-cluster-lifecycle) 섹션을 참조하세요. **참고** 용량 부족 예외에 대한 자동 프로세스를 구축할 때 용량 부족 이벤트는 복구 가능하다는 점을 고려해야 합니다. 용량은 자주 변경되며 Amazon EC2 용량이 확보되는 즉시 클러스터는 크기 조정 또는 시작 작업을 재개합니다. **Example 용량 부족 이벤트에 대응하는 함수** ``` // Lambda code with Python 3.10 and handler is lambda_function.lambda_handler // Note: related IAM role requires permission to use Amazon EMR import json import boto3 import datetime from datetime import timezone INSUFFICIENT_CAPACITY_EXCEPTION_DETAIL_TYPE = "EMR Instance Group Provisioning" INSUFFICIENT_CAPACITY_EXCEPTION_EVENT_CODE = ( "EC2 provisioning - Insufficient Instance Capacity" ) ALLOWED_INSTANCE_TYPES_TO_USE = [ "m5.xlarge", "c5.xlarge", "m5.4xlarge", "m5.2xlarge", "t3.xlarge", ] CLUSTER_START_ACCEPTABLE_STATES = ["WAITING", "RUNNING"] CLUSTER_START_SLA = 20 CLIENT = boto3.client("emr", region_name="us-east-1") # checks if the incoming event is 'EMR Instance Fleet Provisioning' with eventCode 'EC2 provisioning - Insufficient Instance Capacity' def is_insufficient_capacity_event(event): if not event["detail"]: return False else: return ( event["detail-type"] == INSUFFICIENT_CAPACITY_EXCEPTION_DETAIL_TYPE and event["detail"]["eventCode"] == INSUFFICIENT_CAPACITY_EXCEPTION_EVENT_CODE ) # checks if the cluster is eligible for termination def is_cluster_eligible_for_termination(event, describeClusterResponse): # instanceGroupType could be CORE, MASTER OR TASK instanceGroupType = event["detail"]["instanceGroupType"] clusterCreationTime = describeClusterResponse["Cluster"]["Status"]["Timeline"][ "CreationDateTime" ] clusterState = describeClusterResponse["Cluster"]["Status"]["State"] now = datetime.datetime.now() now = now.replace(tzinfo=timezone.utc) isClusterStartSlaBreached = clusterCreationTime < now - datetime.timedelta( minutes=CLUSTER_START_SLA ) # Check if instance group receiving Insufficient capacity exception is CORE or PRIMARY (MASTER), # and it's been more than 20 minutes since cluster was created but the cluster state and the cluster state is not updated to RUNNING or WAITING if ( (instanceGroupType == "CORE" or instanceGroupType == "MASTER") and isClusterStartSlaBreached and clusterState not in CLUSTER_START_ACCEPTABLE_STATES ): return True else: return False # Choose item from the list except the exempt value def choice_excluding(exempt): for i in ALLOWED_INSTANCE_TYPES_TO_USE: if i != exempt: return i # Create a new cluster by choosing different InstanceType. def create_cluster(event): # instanceGroupType cloud be CORE, MASTER OR TASK instanceGroupType = event["detail"]["instanceGroupType"] # Following two lines assumes that the customer that created the cluster already knows which instance types they use in original request instanceTypesFromOriginalRequestMaster = "m5.xlarge" instanceTypesFromOriginalRequestCore = "m5.xlarge" # Select new instance types to include in the new createCluster request instanceTypeForMaster = ( instanceTypesFromOriginalRequestMaster if instanceGroupType != "MASTER" else choice_excluding(instanceTypesFromOriginalRequestMaster) ) instanceTypeForCore = ( instanceTypesFromOriginalRequestCore if instanceGroupType != "CORE" else choice_excluding(instanceTypesFromOriginalRequestCore) ) print("Starting to create cluster...") instances = { "InstanceGroups": [ { "InstanceRole": "MASTER", "InstanceCount": 1, "InstanceType": instanceTypeForMaster, "Market": "ON_DEMAND", "Name": "Master", }, { "InstanceRole": "CORE", "InstanceCount": 1, "InstanceType": instanceTypeForCore, "Market": "ON_DEMAND", "Name": "Core", }, ] } response = CLIENT.run_job_flow( Name="Test Cluster", Instances=instances, VisibleToAllUsers=True, JobFlowRole="EMR_EC2_DefaultRole", ServiceRole="EMR_DefaultRole", ReleaseLabel="emr-6.10.0", ) return response["JobFlowId"] # Terminated the cluster using clusterId received in an event def terminate_cluster(event): print("Trying to terminate cluster, clusterId: " + event["detail"]["clusterId"]) response = CLIENT.terminate_job_flows(JobFlowIds=[event["detail"]["clusterId"]]) print(f"Terminate cluster response: {response}") def describe_cluster(event): response = CLIENT.describe_cluster(ClusterId=event["detail"]["clusterId"]) return response def lambda_handler(event, context): if is_insufficient_capacity_event(event): print( "Received insufficient capacity event for instanceGroup, clusterId: " + event["detail"]["clusterId"] ) describeClusterResponse = describe_cluster(event) shouldTerminateCluster = is_cluster_eligible_for_termination( event, describeClusterResponse ) if shouldTerminateCluster: terminate_cluster(event) clusterId = create_cluster(event) print("Created a new cluster, clusterId: " + clusterId) else: print( "Cluster is not eligible for termination, clusterId: " + event["detail"]["clusterId"] ) else: print("Received event is not insufficient capacity event, skipping") ```