本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。
对由 Amazon EKS 编排的 SageMaker HyperPod 集群上训练作业的可观察性进行建模
SageMaker HyperPod 使用 Amazon EKS 编排的集群可以与 Amazon Studio 上的 mlFlow 应用程序集成。 SageMaker 集群管理员设置 mlFlow 服务器并将其与群集连接。 SageMaker HyperPod 数据科学家可以深入了解模型。
使用 AWS CLI 设置 mlFlow 服务器
集群管理员必须创建 MLflow 跟踪服务器。
-
确保
eks-auth:AssumeRoleForPodIdentity权限存在于的 IAM 执行角色中 SageMaker HyperPod。 -
如果 EKS 集群上尚未安装
eks-pod-identity-agent插件,请在 EKS 集群上安装此插件。aws eks create-addon \ --cluster-name<eks_cluster_name>\ --addon-name eks-pod-identity-agent \ --addon-versionvx.y.z-eksbuild.1 -
为 Pod 的新角色创建
trust-relationship.json文件,以便调用 MLflow API。cat >trust-relationship.json <<EOF { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowEksAuthToAssumeRoleForPodIdentity", "Effect": "Allow", "Principal": { "Service": "pods.eks.amazonaws.com" }, "Action": [ "sts:AssumeRole", "sts:TagSession" ] } ] } EOF运行以下代码创建角色并附加信任关系。
aws iam create-role --role-namehyperpod-mlflow-role\ --assume-role-policy-document file://trust-relationship.json \ --description "allow pods to emit mlflow metrics and put data in s3" -
创建以下策略,授予 Pod 调用所有
sagemaker-mlflow操作和将模型构件放入 S3 的权限。跟踪服务器中已经存在 S3 权限,但如果模型构件过大,则会从 MLflow 代码中直接调用 S3 来上传构件。cat >hyperpod-mlflow-policy.json <<EOF { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "sagemaker-mlflow:AccessUI", "sagemaker-mlflow:CreateExperiment", "sagemaker-mlflow:SearchExperiments", "sagemaker-mlflow:GetExperiment", "sagemaker-mlflow:GetExperimentByName", "sagemaker-mlflow:DeleteExperiment", "sagemaker-mlflow:RestoreExperiment", "sagemaker-mlflow:UpdateExperiment", "sagemaker-mlflow:CreateRun", "sagemaker-mlflow:DeleteRun", "sagemaker-mlflow:RestoreRun", "sagemaker-mlflow:GetRun", "sagemaker-mlflow:LogMetric", "sagemaker-mlflow:LogBatch", "sagemaker-mlflow:LogModel", "sagemaker-mlflow:LogInputs", "sagemaker-mlflow:SetExperimentTag", "sagemaker-mlflow:SetTag", "sagemaker-mlflow:DeleteTag", "sagemaker-mlflow:LogParam", "sagemaker-mlflow:GetMetricHistory", "sagemaker-mlflow:SearchRuns", "sagemaker-mlflow:ListArtifacts", "sagemaker-mlflow:UpdateRun", "sagemaker-mlflow:CreateRegisteredModel", "sagemaker-mlflow:GetRegisteredModel", "sagemaker-mlflow:RenameRegisteredModel", "sagemaker-mlflow:UpdateRegisteredModel", "sagemaker-mlflow:DeleteRegisteredModel", "sagemaker-mlflow:GetLatestModelVersions", "sagemaker-mlflow:CreateModelVersion", "sagemaker-mlflow:GetModelVersion", "sagemaker-mlflow:UpdateModelVersion", "sagemaker-mlflow:DeleteModelVersion", "sagemaker-mlflow:SearchModelVersions", "sagemaker-mlflow:GetDownloadURIForModelVersionArtifacts", "sagemaker-mlflow:TransitionModelVersionStage", "sagemaker-mlflow:SearchRegisteredModels", "sagemaker-mlflow:SetRegisteredModelTag", "sagemaker-mlflow:DeleteRegisteredModelTag", "sagemaker-mlflow:DeleteModelVersionTag", "sagemaker-mlflow:DeleteRegisteredModelAlias", "sagemaker-mlflow:SetRegisteredModelAlias", "sagemaker-mlflow:GetModelVersionByAlias" ], "Resource": "arn:aws:sagemaker:us-west-2:111122223333:mlflow-tracking-server/<ml tracking server name>" }, { "Effect": "Allow", "Action": [ "s3:PutObject" ], "Resource": "arn:aws:s3:::<mlflow-s3-bucket_name>" } ] } EOF注意
ARN 是来自 MLflow 服务器以及按照设置 MLflow 基础设施的说明创建服务器时与 MLflow 服务器一起设置的 S3 存储桶。
-
使用上一步中保存的策略文档,将
mlflow-metrics-emit-policy策略附加到hyperpod-mlflow-role。aws iam put-role-policy \ --role-namehyperpod-mlflow-role\ --policy-namemlflow-metrics-emit-policy\ --policy-documentfile://hyperpod-mlflow-policy.json -
为 Pod 创建 Kubernetes 服务账户,以访问 MLflow 服务器。
cat >mlflow-service-account.yaml<<EOF apiVersion: v1 kind: ServiceAccount metadata: name:mlflow-service-accountnamespace:kubeflowEOF运行以下命令应用到 EKS 集群。
kubectl apply -fmlflow-service-account.yaml -
创建容器组身份关联。
aws eks create-pod-identity-association \ --cluster-nameEKS_CLUSTER_NAME\ --role-arnarn:aws:iam::111122223333:role/hyperpod-mlflow-role\ --namespacekubeflow\ --service-accountmlflow-service-account
要向 MLflow 服务器收集来自训练作业的指标
数据科学家需要设置训练脚本和 Docker 映像,以便向 MLflow 服务器发送指标。
-
在训练脚本的开头添加以下几行。
import mlflow # Set the Tracking Server URI using the ARN of the Tracking Server you created mlflow.set_tracking_uri(os.environ['MLFLOW_TRACKING_ARN']) # Enable autologging in MLflow mlflow.autolog() -
使用训练脚本构建 Docker 映像,并推送到 Amazon ECR。获取 ECR 容器的 ARN。有关构建和推送 Docker 映像的更多信息,请参阅《ECR 用户指南》中的推送 Docker 映像。
提示
确保在 Docker 文件中添加 mlflow 和 sagemaker-mlflow 软件包的安装。要详细了解软件包的安装、要求和软件包的兼容版本,请参阅安装 mlFlow 和 SageMaker AI mlFlow 插件。
-
在训练作业 Pod 中添加服务账号使其能够访问
hyperpod-mlflow-role。这允许 Pod 调用 MLflow API。运行以下 SageMaker HyperPod CLI 作业提交模板。创建此文件,文件名为mlflow-test.yaml。defaults: - override hydra/job_logging: stdout hydra: run: dir: . output_subdir: null training_cfg: entry_script:./train.pyscript_args: [] run: name:test-job-with-mlflow# Current run name nodes:2# Number of nodes to use for current training # ntasks_per_node:1# Number of devices to use per node cluster: cluster_type: k8s # currently k8s only instance_type:ml.c5.2xlargecluster_config: # name of service account associated with the namespace service_account_name:mlflow-service-account# persistent volume, usually used to mount FSx persistent_volume_claims: null namespace:kubeflow# required node affinity to select nodes with SageMaker HyperPod # labels and passed health check if burn-in enabled label_selector: required: sagemaker.amazonaws.com/node-health-status: - Schedulable preferred: sagemaker.amazonaws.com/deep-health-check-status: - Passed weights: - 100 pullPolicy: IfNotPresent # policy to pull container, can be Always, IfNotPresent and Never restartPolicy: OnFailure # restart policy base_results_dir: ./result # Location to store the results, checkpoints and logs. container:111122223333.dkr.ecr.us-west-2.amazonaws.com/tag# container to use env_vars: NCCL_DEBUG: INFO # Logging level for NCCL. Set to "INFO" for debug information MLFLOW_TRACKING_ARN:arn:aws:sagemaker:us-west-2:11112223333:mlflow-tracking-server/tracking-server-name -
使用 YAML 文件启动作业,如下所示。
hyperpod start-job --config-file/path/to/mlflow-test.yaml -
为 MLflow 跟踪服务器生成预先指定的网址。您可以在浏览器上打开链接,开始跟踪您的训练作业。
aws sagemaker create-presigned-mlflow-tracking-server-url \ --tracking-server-name "tracking-server-name" \ --session-expiration-duration-in-seconds1800\ --expires-in-seconds300\ --regionregion