

本文属于机器翻译版本。若本译文内容与英语原文存在差异，则一律以英文原文为准。

# 使用 Amazon A SageMaker I 自定义模型
<a name="customize-model"></a>

Amazon SageMaker AI 模型定制功能可将传统上复杂而耗时的定制 AI 模型的过程从长达数月的工作转变为可在几天内完成的简化工作流程。此功能解决了 AI 开发人员所面临的关键挑战，他们需要使用专有数据自定义基础模型，以创建高度差异化的客户体验。本 SageMaker AI step-by-step 指南提供了详细的自定义文档，包括指南和高级配置选项。有关 Nova 模型定制的简要概述，请参阅 Amazon Nova 用户指南 SageMaker中的[自定义和微调](https://docs.aws.amazon.com/nova/latest/userguide/customize-fine-tune-sagemaker.html)。

该功能包括一个了解自然语言要求的全新引导式用户界面，以及一套全面的高级模型定制技术，所有这些技术均由无服务器基础架构提供支持，从而消除了管理计算资源的运营开销。无论您是在构建法律研究应用程序、增强客户服务聊天机器人，还是开发特定领域的人工智能代理，此功能都可以加快您从生产部署到生产部署的速度。 proof-of-concept

由 Amazon Bedrock 评估提供支持的模型定制功能可以在您的地理区域 AWS 区域 内安全地传输数据以进行处理。有关更多信息，请访问 [Amazon Bedrock 评估文档](https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation-judge.html)。

## 重要概念
<a name="model-customize-concepts"></a>

**无服务器训练**

一种完全托管的计算基础架构，可消除所有基础架构的复杂性，使您能够完全专注于模型开发。这包括根据模型大小和训练要求自动配置 GPU 实例（P5、p4de、p4d、G5）、预先优化的训练配方（其中包含每种自定义技术的最佳实践）、通过用户界面访问的实时指标和日志进行实时监控，以及在训练完成后自动清理资源以优化成本。

**模型定制技术**

一套全面的高级方法，包括监督微调 (SFT)、直接偏好优化 (DPO)、可验证奖励的强化学习 (RLVR) 和带人工智能反馈的强化学习 (RLAIF)。

**自定义模型**

基础模型的专用版本，通过根据您自己的数据进行训练来适应特定用例，从而生成的 AI 模型保留了原始基础模型的一般功能，同时添加了根据您的要求量身定制的特定领域知识、术语、风格或行为。

**AI 模型自定义资产**

在模型定制过程中用于训练、完善和评估自定义模型的资源和工件。***这些资产包括用于微调基础模型以学习特定行为、知识或风格的训练示例（提示响应对、特定领域文本或带标签的数据）的**集**合，以及评估**器，它们是通过奖励函数（基于代码的逻辑，根据特定标准对模型输出进行评*****分，用于 RLVR 训练和自定义记分手评估）或奖励提示（用于奖励***提示）来评估和提高模型性能的机制（基于代码的逻辑，用于RLVR训练和自定义评分员评估）***自然语言指令，用于指导法学硕士判断模型响应的质量，用于 RLAIF 训练和 LLM-as-a-judge评估）。

**模型包组**

一个集合容器，用于跟踪训练作业中所有记录的模型，为模型版本及其血统提供集中位置。

**已记录的模型**

 SageMaker AI 在运行无服务器训练作业时创建的输出。这可以是经过微调的模型（成功作业）、检查点（带有检查点的失败作业）或关联的元数据（没有检查点的失败作业）。

**注册型号**

已标记为用于正式跟踪和治理目的的记录模型，可实现完整的血统和生命周期管理。

**血统**

在 SageMaker AI 和 Amazon Bedrock 上自动捕获的训练作业、输入数据集、输出模型、评估作业和部署之间的关系。

**跨账号共享**

能够使用 Resource Acc AWS ess Manager (RAM) 跨 AWS 账户共享模型、数据集和赋值器，同时保持完全的谱系可见性。

# 在亚马逊 AI 上自定义 Amazon Nova 车型 SageMaker
<a name="nova-model"></a>

这个话题已经移动。有关最新信息，请参阅亚马逊 Nova [1.0 用户指南或 Amazon Nova 2.0 用户指南中的在亚马逊 A SageMaker ](https://docs.aws.amazon.com//nova/latest/userguide/nova-model.html) [I 上自定义 Amazon Nova](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model.html) 机型。

# Amazon Nova 配方
<a name="nova-model-recipes"></a>

这个话题已经移动。有关最新信息，请参阅亚马逊 Nova [1.0 用户指南或 Amazon Nova 2.0 用户指南](https://docs.aws.amazon.com//nova/latest/userguide/nova-model-recipes.html)[中的亚马逊 Nova](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model-recipes.html) 食谱。

# Amazon Nova 定制 SageMaker 训练任务
<a name="nova-model-training-job"></a>

这个话题已经移动。有关最新信息，请参阅 Amazon Nova [1.0 用户指南或 Amazon Nova [2.](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model-training-job.html) 0 用户指南](https://docs.aws.amazon.com//nova/latest/userguide/nova-model-training-job.html)中关于 SageMaker 训练作业的自定义。

**Topics**
+ [

# Amazon Nova 蒸馏
](nova-distillation.md)
+ [

# Nova 自定义 SDK
](nova-customization-sdk.md)
+ [

# 使用 SageMaker 训练作业微调 Amazon Nova 模型
](nova-fine-tuning-training-job.md)
+ [

# 监控迭代进度
](nova-model-monitor.md)
+ [

# 评估你的 SageMaker AI 训练模型
](nova-model-evaluation.md)

# Amazon Nova 蒸馏
<a name="nova-distillation"></a>

这个话题已经移动。有关最新信息，请参阅 Amazon Nova [1.0 用户指南中的 Amazon Nova](https://docs.aws.amazon.com//nova/latest/userguide/nova-distillation.html) 蒸馏法。

# Nova 自定义 SDK
<a name="nova-customization-sdk"></a>

Amazon Nova 定制软件开发工具包是一款全面的 Python 软件开发工具包，用于在 Amazon Nova 模型的整个生命周期（从训练和评估到部署和推理）对其进行自定义。

这个话题已经移动。有关最新信息，请参阅 Amazon Nova [1.0 用户指南或 Amazon Nova 2.0 用户指南](https://docs.aws.amazon.com//nova/latest/userguide/nova-customization-sdk.html)[中的 Nova](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-customization-sdk.html) 定制软件开发工具包。

# 使用 SageMaker 训练作业微调 Amazon Nova 模型
<a name="nova-fine-tuning-training-job"></a>

这个话题已经移动。有关最新信息，请参阅 Amazon Nova [1.0 用户指南或 Amazon Nova 2.0 用户指南](https://docs.aws.amazon.com//nova/latest/userguide/nova-fine-tune-1.html)[中的使用 SageMaker 训练作业微调 Amazon Nova](https://docs.aws.amazon.com//nova/latest/nova2-userguide/smtj-training.html) 模型。

# 监控迭代进度
<a name="nova-model-monitor"></a>

这个话题已经移动。有关最新信息，请参阅 [Amazon Nova 1.0 用户指南或 Amazon Nova [2.](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model-monitor.html) 0 用户指南](https://docs.aws.amazon.com//nova/latest/userguide/nova-model-monitor.html)中的跨迭代监控进度。

# 评估你的 SageMaker AI 训练模型
<a name="nova-model-evaluation"></a>

这个话题已经移动。有关最新信息，请参阅 A [mazon Nova 1.0 用户指南或 Amazon Nova [2.](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model-evaluation.html) 0 用户指南](https://docs.aws.amazon.com//nova/latest/userguide/nova-model-evaluation.html)中的评估 SageMaker 经过人工智能训练的模型。

# 亚马逊上的 Amazon Nova 定制 SageMaker HyperPod
<a name="nova-hp"></a>

这个话题已经移动。有关最新信息，请参阅亚马逊 Nova [1.0 用户指南或 Amazon Nova 2.0 用户指南 SageMaker HyperPod ](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp.html)[中的亚马逊新星](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp.html)定制。

# Nova 自定义 SDK
<a name="nova-hp-customization-sdk"></a>

这个话题已经移动。有关最新信息，请参阅 Amazon Nova [1.0 用户指南或 Amazon Nova 2.0 用户指南](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp-customization-sdk.html)[中的 Nova](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp-customization-sdk.html) 定制软件开发工具包。

# Amazon HyperPod 基本命令指南
<a name="nova-hp-essential-commands-guide"></a>

这个话题已经移动。有关最新信息，请参阅 [Amazon Nova 1.0 用户指南或 Amazon N](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp-essential-commands-guide.html) [ova 2.0 用户指南中的 SageMaker HyperPod 基本命令指南](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp-essential-commands-guide.html)。

# 使用受限实例组 (RIG) 创建 HyperPod EKS 集群
<a name="nova-hp-cluster"></a>

这个话题已经移动。有关最新信息，请参阅 Amazon Nova [1.0 用户指南或 Amazon Nova 2.0 用户指南](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp-cluster.html)[中的使用 RIG 创建 E SageMaker HyperPod KS](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp-cluster.html) 集群。

# Nova Forge SageMaker 的人工智能访问和设置 HyperPod
<a name="nova-forge-hp-access"></a>

这个话题已经移动。有关最新信息，请参阅亚马逊 Nova [1.0 用户指南或 Amazon Nova 2.0 用户指南 HyperPod ](https://docs.aws.amazon.com//nova/latest/userguide/nova-forge-hp-access.html)[中的 Nova Forge SageMaker ](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-forge-hp-access.html) 人工智能访问和设置。

# Amazon Nova 模型训练
<a name="nova-hp-training"></a>

这个话题已经移动。有关最新信息，请参阅亚马逊 Nova [1.0 用户指南或 Amazon Nova 2.0 用户指南](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp-training.html)[中的 Amazon Nova](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp-training.html) 机型培训。

# 在亚马逊上微调 Amazon Nova 机型 SageMaker HyperPod
<a name="nova-hp-fine-tune"></a>

这个话题已经移动。有关最新信息，请参阅亚马逊 Nova [1.0 用户指南 SageMaker HyperPod 中的在亚马逊上微调 Amazon Nova](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp-fine-tune.html) 机型。

# 评估经过训练的模型
<a name="nova-hp-evaluate"></a>

这个话题已经移动。有关最新信息，请参阅 [Amazon Nova 1.0 用户指南或 Amazon Nov [a 2.0 用户指南](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp-evaluate.html)](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp-evaluate.html)中的评估训练过的模型。

# 迭代训练
<a name="nova-iterative-training"></a>

迭代训练允许您在多个训练周期中提高模型性能，在之前的检查点的基础上再接再厉，系统地解决故障模式并适应不断变化的需求。

这个话题已经移动。有关最新信息，请参阅 [Amazon Nova 1.0 用户指南或 Amazon Nov [a 2.0 用户指南](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-iterative-training.html)](https://docs.aws.amazon.com//nova/latest/userguide/nova-iterative-training.html)中的迭代训练。

# Amazon Bedrock 推理
<a name="nova-model-bedrock-inference"></a>

这个话题已经移动。有关最新信息，请参阅亚马逊 [Nova 1.0 用户指南或 Amazon Nova 2.0 用户指南](https://docs.aws.amazon.com//nova/latest/userguide/nova-model-bedrock-inference.html)[中的亚马](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model-bedrock-inference.html)逊 Bedrock 推理。

# 自定义 Amazon Nova 机型的局限性
<a name="nova-model-limitations"></a>

这个话题已经移动。有关最新信息，请参阅 Amazon Nova [2.0 用户指南中的自定义 Amazon Nova](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model-limitations.html) 机型的限制。

# 开放式权重模型定制
<a name="model-customize-open-weight"></a>

本节将引导您完成开放式权重模型定制入门的过程。

**Topics**
+ [

# 先决条件
](model-customize-open-weight-prereq.md)
+ [

# 在 UI 中创建用于模型自定义的资源
](model-customize-open-weight-create-assets-ui.md)
+ [

# AI 模型定制作业提交
](model-customize-open-weight-job.md)
+ [

# 模型评估作业提交
](model-customize-open-weight-evaluation.md)
+ [

# 模型部署
](model-customize-open-weight-deployment.md)
+ [

# 样本数据集和赋值器
](model-customize-open-weight-samples.md)

# 先决条件
<a name="model-customize-open-weight-prereq"></a>

在开始之前，请满足以下先决条件：
+ 通过 Studio 访问权限登录 SageMaker AI 域。如果您没有权限将 Studio 设置为域的默认体验，请联系您的管理员。有关更多信息，请参阅 [Amazon A SageMaker I 域名概述](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-studio-onboard.html)。
+  AWS CLI 按照[安装当前 AWS CLI 版本中的步骤进行](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv1.html#install-tool-bundled)更新。
+ 在本地计算机上运行 `aws configure` 并提供您的 AWS 凭证。有关 AWS 证书的信息，请参阅[了解和获取您的 AWS 证书](https://docs.aws.amazon.com/IAM/latest/UserGuide/security-creds.html)。

## 所需的 IAM 权限
<a name="model-customize-open-weight-iam"></a>

SageMaker 自定义 AI 模型需要为你的 SageMaker AI 域执行添加适当的权限。为此，您可以创建内联 IAM 权限策略并将其附加到 IAM 角色。有关添加策略的信息，请参阅 Identity and A *ccess Management 用户指南中的添加和删除 IAM AWS 身份*[权限](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html)。

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "AllowNonAdminStudioActions",
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreatePresignedDomainUrl",
                "sagemaker:DescribeDomain",
                "sagemaker:DescribeUserProfile",
                "sagemaker:DescribeSpace",
                "sagemaker:ListSpaces",
                "sagemaker:DescribeApp",
                "sagemaker:ListApps"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:domain/*",
                "arn:aws:sagemaker:*:*:user-profile/*",
                "arn:aws:sagemaker:*:*:app/*",
                "arn:aws:sagemaker:*:*:space/*"
             ]
        },
        {
            "Sid": "LambdaListPermissions",
            "Effect": "Allow",
            "Action": [
                "lambda:ListFunctions"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "LambdaPermissionsForRewardFunction",
            "Effect": "Allow",
            "Action": [
                "lambda:CreateFunction",
                "lambda:DeleteFunction",
                "lambda:InvokeFunction",
                "lambda:GetFunction"
            ],
            "Resource": [
                "arn:aws:lambda:*:*:function:*SageMaker*",
                "arn:aws:lambda:*:*:function:*sagemaker*",
                "arn:aws:lambda:*:*:function:*Sagemaker*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "LambdaLayerForAWSSDK",
            "Effect": "Allow",
            "Action": [
                "lambda:GetLayerVersion"
            ],
            "Resource": [
                "arn:aws:lambda:*:336392948345:layer:AWSSDK*"
            ]
        },
        {
            "Sid": "SageMakerPublicHubPermissions",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListHubContents"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:aws:hub/SageMakerPublicHub"
            ]
        },
        {
            "Sid": "SageMakerHubPermissions",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListHubs",
                "sagemaker:ListHubContents",
                "sagemaker:DescribeHubContent",
                "sagemaker:DeleteHubContent",
                "sagemaker:ListHubContentVersions",
                "sagemaker:Search"
            ],
            "Resource": [
                "*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "JumpStartAccess",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::jumpstart*"
            ]
        },
        {
            "Sid": "ListMLFlowOperations",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListMlflowApps",
                "sagemaker:ListMlflowTrackingServers"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "MLFlowAccess",
            "Effect": "Allow",
            "Action": [
                "sagemaker:UpdateMlflowApp",
                "sagemaker:DescribeMlflowApp",
                "sagemaker:CreatePresignedMlflowAppUrl",
                "sagemaker:CallMlflowAppApi",
                "sagemaker-mlflow:*"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:mlflow-app/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "BYODataSetS3Access",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::*SageMaker*",
                "arn:aws:s3:::*Sagemaker*",
                "arn:aws:s3:::*sagemaker*"
            ]
        },
        {
            "Sid": "AllowHubPermissions",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ImportHubContent"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:hub/*",
                "arn:aws:sagemaker:*:*:hub-content/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "PassRoleForSageMaker",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::*:role/service-role/AmazonSageMaker-ExecutionRole-*"
            ],
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": "sagemaker.amazonaws.com",
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "PassRoleForAWSLambda",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::*:role/service-role/AmazonSageMaker-ExecutionRole-*"
            ],
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": "lambda.amazonaws.com",
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "PassRoleForBedrock",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::*:role/service-role/AmazonSageMaker-ExecutionRole-*"
            ],
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": "bedrock.amazonaws.com",
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "TrainingJobRun",
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateTrainingJob",
                "sagemaker:DescribeTrainingJob",
                "sagemaker:ListTrainingJobs"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:training-job/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "ModelPackageAccess",
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateModelPackage",
                "sagemaker:DescribeModelPackage",
                "sagemaker:ListModelPackages",
                "sagemaker:CreateModelPackageGroup",
                "sagemaker:DescribeModelPackageGroup",
                "sagemaker:ListModelPackageGroups",
                "sagemaker:CreateModel"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:model-package-group/*",
                "arn:aws:sagemaker:*:*:model-package/*",
                "arn:aws:sagemaker:*:*:model/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "TagsPermission",
            "Effect": "Allow",
            "Action": [
                "sagemaker:AddTags",
                "sagemaker:ListTags"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:model-package-group/*",
                "arn:aws:sagemaker:*:*:model-package/*",
                "arn:aws:sagemaker:*:*:hub/*",
                "arn:aws:sagemaker:*:*:hub-content/*",
                "arn:aws:sagemaker:*:*:training-job/*",
                "arn:aws:sagemaker:*:*:model/*",
                "arn:aws:sagemaker:*:*:endpoint/*",
                "arn:aws:sagemaker:*:*:endpoint-config/*",
                "arn:aws:sagemaker:*:*:pipeline/*",
                "arn:aws:sagemaker:*:*:inference-component/*",
                "arn:aws:sagemaker:*:*:action/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "LogAccess",
            "Effect": "Allow",
            "Action": [
                "logs:DescribeLogGroups",
                "logs:DescribeLogStreams",
                "logs:GetLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:*:*:log-group*",
                "arn:aws:logs:*:*:log-group:/aws/sagemaker/TrainingJobs:log-stream:*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "BedrockDeploy",
            "Effect": "Allow",
            "Action": [
                "bedrock:CreateModelImportJob"
            ],
            "Resource": [
                "arn:aws:bedrock:*:*:*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "BedrockOperations",
            "Effect": "Allow",
            "Action": [
                "bedrock:GetModelImportJob",
                "bedrock:GetImportedModel",
                "bedrock:ListProvisionedModelThroughputs",
                "bedrock:ListCustomModelDeployments",
                "bedrock:ListCustomModels",
                "bedrock:ListModelImportJobs",
                "bedrock:GetEvaluationJob",
                "bedrock:CreateEvaluationJob", 
                "bedrock:InvokeModel"
            ],
            "Resource": [
                "arn:aws:bedrock:*:*:evaluation-job/*",
                "arn:aws:bedrock:*:*:imported-model/*",
                "arn:aws:bedrock:*:*:model-import-job/*",
                "arn:aws:bedrock:*:*:foundation-model/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "BedrockFoundationModelOperations",
            "Effect": "Allow",
            "Action": [
                "bedrock:GetFoundationModelAvailability",
                "bedrock:ListFoundationModels"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "SageMakerPipelinesAndLineage",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListActions",
                "sagemaker:ListArtifacts",
                "sagemaker:QueryLineage",
                "sagemaker:ListAssociations",
                "sagemaker:AddAssociation",
                "sagemaker:DescribeAction",
                "sagemaker:AddAssociation",
                "sagemaker:CreateAction",
                "sagemaker:CreateContext",
                "sagemaker:DescribeTrialComponent"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:artifact/*",
                "arn:aws:sagemaker:*:*:action/*",
                "arn:aws:sagemaker:*:*:context/*",
                "arn:aws:sagemaker:*:*:action/*",
                "arn:aws:sagemaker:*:*:model-package/*",
                "arn:aws:sagemaker:*:*:context/*",
                "arn:aws:sagemaker:*:*:pipeline/*",
                "arn:aws:sagemaker:*:*:experiment-trial-component/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "ListOperations",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListInferenceComponents",
                "sagemaker:ListWorkforces"
            ],
            "Resource": [
                "*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "SageMakerInference",
            "Effect": "Allow",
            "Action": [
                "sagemaker:DescribeInferenceComponent",
                "sagemaker:CreateEndpoint",
                "sagemaker:CreateEndpointConfig",
                "sagemaker:DescribeEndpoint",
                "sagemaker:DescribeEndpointConfig",
                "sagemaker:ListEndpoints"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:inference-component/*",
                "arn:aws:sagemaker:*:*:endpoint/*",
                "arn:aws:sagemaker:*:*:endpoint-config/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "SageMakerPipelines",
            "Effect": "Allow",
            "Action": [
                "sagemaker:DescribePipelineExecution",
                "sagemaker:ListPipelineExecutions",
                "sagemaker:ListPipelineExecutionSteps",
                "sagemaker:CreatePipeline",
                "sagemaker:UpdatePipeline",
                "sagemaker:StartPipelineExecution"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:pipeline/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        }
    ]
}
```

如果您已将执行角色附加[AmazonSageMakerFullAccessPolicy](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerFullAccess.html)到您的执行角色，则可以添加以下简化策略：

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "LambdaListPermissions",
            "Effect": "Allow",
            "Action": [
                "lambda:ListFunctions"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "LambdaPermissionsForRewardFunction",
            "Effect": "Allow",
            "Action": [
                "lambda:CreateFunction",
                "lambda:DeleteFunction",
                "lambda:InvokeFunction",
                "lambda:GetFunction"
            ],
            "Resource": [
                "arn:aws:lambda:*:*:function:*SageMaker*",
                "arn:aws:lambda:*:*:function:*sagemaker*",
                "arn:aws:lambda:*:*:function:*Sagemaker*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "LambdaLayerForAWSSDK",
            "Effect": "Allow",
            "Action": [
                "lambda:GetLayerVersion"
            ],
            "Resource": [
                "arn:aws:lambda:*:336392948345:layer:AWSSDK*"
            ]
        },
        {
            "Sid": "S3Access",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::*SageMaker*",
                "arn:aws:s3:::*Sagemaker*",
                "arn:aws:s3:::*sagemaker*",
                "arn:aws:s3:::jumpstart*"
            ]
        },
        {
            "Sid": "PassRoleForSageMakerAndLambdaAndBedrock",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::*:role/service-role/AmazonSageMaker-ExecutionRole-*"
            ],
            "Condition": { 
                "StringEquals": { 
                    "iam:PassedToService": [ 
                        "lambda.amazonaws.com", 
                        "bedrock.amazonaws.com"
                     ],
                     "aws:ResourceAccount": "${aws:PrincipalAccount}" 
                 } 
            }
        },
        {
            "Sid": "BedrockDeploy",
            "Effect": "Allow",
            "Action": [
                "bedrock:CreateModelImportJob"
            ],
            "Resource": [
                "*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "BedrockOperations",
            "Effect": "Allow",
            "Action": [
                "bedrock:GetModelImportJob",
                "bedrock:GetImportedModel",
                "bedrock:ListProvisionedModelThroughputs",
                "bedrock:ListCustomModelDeployments",
                "bedrock:ListCustomModels",
                "bedrock:ListModelImportJobs",
                "bedrock:GetEvaluationJob",
                "bedrock:CreateEvaluationJob",
                "bedrock:InvokeModel"
            ],
            "Resource": [
                "arn:aws:bedrock:*:*:evaluation-job/*",
                "arn:aws:bedrock:*:*:imported-model/*",
                "arn:aws:bedrock:*:*:model-import-job/*",
                "arn:aws:bedrock:*:*:foundation-model/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "BedrockFoundationModelOperations",
            "Effect": "Allow",
            "Action": [
                "bedrock:GetFoundationModelAvailability",
                "bedrock:ListFoundationModels"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}
```

然后，您必须单击 “**编辑信任策略**” 并将其替换为以下策略，然后单击 “**更新策略**”。

```
{
    "Version": "2012-10-17",		 	 	                    
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                 "Service": "lambda.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        },
        {
            "Effect": "Allow",
            "Principal": {
                   "Service": "sagemaker.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        },
        {
            "Effect": "Allow",
            "Principal": {
                  "Service": "bedrock.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
```

# 在 UI 中创建用于模型自定义的资源
<a name="model-customize-open-weight-create-assets-ui"></a>

您可以在 UI 中创建和管理可用于模型自定义的数据集和赋值器资产。

## 资产
<a name="model-customize-open-weight-assets"></a>

在左侧面板中选择**资源**和 Amazon SageMaker Studio 用户界面，然后选择**数据集**。

![\[包含模型自定义访问权限的图像。\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/screenshot-open-model-16.png)


选择**上传数据集**以添加将在模型自定义作业中使用的数据集。通过选择**必填数据输入格式**，您可以访问要使用的数据集格式参考。

![\[包含模型自定义访问权限的图像。\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/screenshot-open-model-15.png)


## 评估程序
<a name="model-customize-open-weight-evaluators"></a>

您还可以为强化学习自定义作业添加**奖**励函数**和奖励提示**。

![\[包含模型自定义访问权限的图像。\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/screenshot-open-model-14.png)


用户界面还提供了有关奖励功能或奖励提示所需格式的指导。

![\[包含模型自定义访问权限的图像。\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/screenshot-open-model-13.png)


## 使用 AWS SDK 自定义模型的资产
<a name="model-customize-open-weight-create-assets-sdk"></a>

你也可以使用 SageMaker AI Python 软件开发工具包来创建资产。请参阅下面的示例代码片段：

```
from sagemaker.ai_registry.air_constants import REWARD_FUNCTION, REWARD_PROMPT
from sagemaker.ai_registry.dataset import DataSet, CustomizationTechnique
from sagemaker.ai_registry.evaluator import Evaluator

# Creating a dataset example
dataset = DataSet.create(
            name="sdkv3-gen-ds2",
            source="s3://sample-test-bucket/datasets/training-data/jamjee-sft-ds1.jsonl", # or use local filepath as source.
            customization_technique=CustomizationTechnique.SFT
        )

# Refreshes status from hub
dataset.refresh()
pprint(dataset.__dict__)

# Creating an evaluator. Method : Lambda
evaluator = Evaluator.create(
                name = "sdk-new-rf11",
                source="arn:aws:lambda:us-west-2:<>:function:<function-name>8",
                type=REWARD_FUNCTION
        )

# Creating an evaluator. Method : Bring your own code
evaluator = Evaluator.create(
                name = "eval-lambda-test",
                source="/path_to_local/eval_lambda_1.py",
                type = REWARD_FUNCTION
        )

# Optional wait, by default we have wait = True during create call.
evaluator.wait()

evaluator.refresh()
pprint(evaluator)
```

# AI 模型定制作业提交
<a name="model-customize-open-weight-job"></a>

可以在左侧面板的 Amazon SageMaker Studio 模型页面访问 SageMaker 人工智能模型自定义功能。您还可以找到资产页面，您可以在其中创建和管理模型自定义数据集和赋值器。

![\[包含模型自定义访问权限的图像。\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/screenshot-open-model-12.png)


要开始提交模型定制任务，请选择模型选项以访问 Jumpstart 基础模型选项卡：

![\[包含如何选择基础模型的图片。\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/screenshot-open-model-11.png)


你可以直接点击模型卡片中的自定义模型，也可以从 Meta 中搜索任何你感兴趣的自定义模型。

![\[包含模型卡片以及如何选择要自定义的模型的图片。\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/screenshot-open-model-10.png)


单击模型卡片后，您可以访问模型详细信息页面并启动自定义作业，方法是单击 “自定义模型”，然后选择 “使用用户界面定制”，开始配置 RLVR 作业。

![\[包含如何启动自定义任务的图片。\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/screenshot-open-model-9.png)


然后，您可以输入您的自定义模型名称，选择要使用的模型自定义技术并配置您的作业超参数：

![\[包含精选模型定制技术的图像。\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/screenshot-open-model-8.png)


![\[包含精选模型定制技术的图像。\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/screenshot-open-model-7.png)


## 使用 SDK 提交 AI 模型自定义作业
<a name="model-customize-open-weight-job-sdk"></a>

你也可以使用 SageMaker AI Python SDK 提交模型自定义任务：

```
# Submit a DPO model customization job

from sagemaker.modules.train.dpo_trainer import DPOTrainer
from sagemaker.modules.train.common import TrainingType

trainer = DPOTrainer(
    model=BASE_MODEL,
    training_type=TrainingType.LORA,
    model_package_group_name=MODEL_PACKAGE_GROUP_NAME,
    training_dataset=TRAINING_DATASET,
    s3_output_path=S3_OUTPUT_PATH,
    sagemaker_session=sagemaker_session,
    role=ROLE_ARN
)
```

## 监控您的自定义作业
<a name="model-customize-open-weight-monitor"></a>

提交作业后，您将立即被重定向到模型定制训练作业页面。

![\[包含精选模型定制技术的图像。\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/screenshot-open-model-6.png)


作业完成后，您可以通过单击右上角的转到**自定义模型按钮进入自定义模型**详细信息页面。

![\[包含精选模型定制技术的图像。\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/screenshot-open-model-5.png)


在自定义模型详细信息页面中，您可以通过以下方式进一步使用您的自定义模型：

1. 检查有关性能、生成的构件位置、训练配置超参数和训练日志的信息。

1. 使用不同的数据集启动评估作业（继续自定义）。

1. 使用 SageMaker AI 推理终端节点或 Amazon Bedrock 自定义模型导入来部署模型。  
![\[包含精选模型定制技术的图像。\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/screenshot-open-model-4.png)

# 模型评估作业提交
<a name="model-customize-open-weight-evaluation"></a>

本节介绍开放权重自定义模型评估。它可以让你开始逐步完成评估工作提交流程。为更高级的评估作业提交用例提供了更多资源。

**Topics**
+ [

# 开始使用
](model-customize-evaluation-getting-started.md)
+ [

# 评估类型和 Job 提交
](model-customize-evaluation-types.md)
+ [

# 评估指标格式
](model-customize-evaluation-metrics-formats.md)
+ [

# Bring-Your-Own-Dataset(BYOD) 任务支持的数据集格式
](model-customize-evaluation-dataset-formats.md)
+ [

# 使用预设和自定义评分器进行评估
](model-customize-evaluation-preset-custom-scorers.md)

# 开始使用
<a name="model-customize-evaluation-getting-started"></a>

## 通过 SageMaker Studio 提交评估任务
<a name="model-customize-evaluation-studio"></a>

### 第 1 步：从您的模型卡片导航至 “评估”
<a name="model-customize-evaluation-studio-step1"></a>

自定义模型后，从模型卡片导航到评估页面。

有关开放式权重自定义模型训练的信息：[https://docs.aws.amazon.com/sagemaker/latest/dg/model-customize-open-weight-job](https://docs.aws.amazon.com/sagemaker/latest/dg/model-customize-open-weight-job.html) .html

SageMaker 在 “我的模型” 选项卡上可视化您的自定义模型：

![\[注册模特卡页面\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/getting-started-registered-model-card.png)


选择查看最新版本，然后选择评估：

![\[模型定制页面\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/getting-started-evaluate-from-model-card.png)


### 第 2 步：提交您的评估 Job
<a name="model-customize-evaluation-studio-step2"></a>

选择 “提交” 按钮并提交您的评估任务。这将提交一个最低限度的 MMLU 基准测试作业。

有关支持的评估任务类型的信息，请参阅[评估类型和 Job 提交](model-customize-evaluation-types.md)。

![\[评估作业提交页面\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/getting-started-benchmark-submission.png)


### 第 3 步：跟踪您的评估 Job 进度
<a name="model-customize-evaluation-studio-step3"></a>

您的评估工作进度将在评估步骤选项卡中进行跟踪：

![\[您的评估工作进度\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/getting-started-benchmark-tracking.png)


### 步骤 4：查看您的评估 Job 结果
<a name="model-customize-evaluation-studio-step4"></a>

您的评估任务结果显示在 “评估结果” 选项卡中：

![\[您的评估工作指标\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/getting-started-benchmark-results.png)


### 步骤 5：查看已完成的评估
<a name="model-customize-evaluation-studio-step5"></a>

您完成的评估任务将显示在模型卡的评估中：

![\[你已完成的评估工作\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/getting-started-benchmark-completed-model-card.png)


## 通过 SageMaker Python SDK 提交你的评估任务
<a name="model-customize-evaluation-sdk"></a>

### 第 1 步：创建你的 BenchMarkEvaluator
<a name="model-customize-evaluation-sdk-step1"></a>

将您注册的训练模型、 AWS S3 输出位置和 MLFlow 资源 ARN 传递给，`BenchMarkEvaluator`然后对其进行初始化。

```
from sagemaker.train.evaluate import BenchMarkEvaluator, Benchmark  
  
evaluator = BenchMarkEvaluator(  
    benchmark=Benchmark.MMLU,  
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",  
    s3_output_path="s3://<bucket-name>/<prefix>/eval/",  
    mlflow_resource_arn="arn:aws:sagemaker:<region>:<account-id>:mlflow-tracking-server/<tracking-server-name>",  
    evaluate_base_model=False  
)
```

### 第 2 步：提交您的评估 Job
<a name="model-customize-evaluation-sdk-step2"></a>

调用`evaluate()`方法提交评估作业。

```
execution = evaluator.evaluate()
```

### 第 3 步：跟踪您的评估 Job 进度
<a name="model-customize-evaluation-sdk-step3"></a>

调用执行`wait()`方法以获取评估任务进度的实时更新。

```
execution.wait(target_status="Succeeded", poll=5, timeout=3600)
```

### 步骤 4：查看您的评估 Job 结果
<a name="model-customize-evaluation-sdk-step4"></a>

调用该`show_results()`方法以显示您的评估作业结果。

```
execution.show_results()
```

# 评估类型和 Job 提交
<a name="model-customize-evaluation-types"></a>

## 使用标准化数据集进行基准测试
<a name="model-customize-evaluation-benchmarking"></a>

使用基准评估类型在标准化基准数据集（包括 MMLU 和 BBH 等热门数据集）中评估模型的质量。


| 基准 | 支持自定义数据集 | 模式 | 说明 | 指标 | Strategy | 可用子任务 | 
| --- | --- | --- | --- | --- | --- | --- | 
| mmlu | 否 | 文本 | 多任务语言理解：考核 57 个科目的知识。 | 准确性 | zs\$1cot | 是 | 
| mmlu\$1pro | 否 | 文本 | MMLU（专业子集），专注于法律、医学、会计和工程等专业领域。 | 准确性 | zs\$1cot | 否 | 
| bbh | 否 | 文本 | 高级推理任务：一系列具有挑战性的问题，用于考核更高级别的认知和解决问题的能力。 | 准确性 | fs\$1cot | 是 | 
| gpqa | 否 | 文本 | 一般物理问题解答：评测对物理概念和相关问题解决能力的理解情况。 | 准确性 | zs\$1cot | 否 | 
| math | 否 | 文本 | 数学问题解决：衡量在代数、微积分及应用题等领域的数学推理能力。 | exact\$1match | zs\$1cot | 是 | 
| strong\$1reject | 否 | 文本 | 质量控制任务-测试模型检测和拒绝不当、有害或不正确内容的能力。 | deflection | zs | 是 | 
| ifeval | 否 | 文本 | 指令跟随评估：衡量模型遵循给定指令并按照规范完成任务的准确程度。 | 准确性 | zs | 否 | 

有关 BYOD 格式的更多信息，请参阅[Bring-Your-Own-Dataset(BYOD) 任务支持的数据集格式](model-customize-evaluation-dataset-formats.md)。

### 可用子任务
<a name="model-customize-evaluation-benchmarking-subtasks"></a>

以下列出了跨多个领域进行模型评估的可用子任务，包括 MMLU（大规模多任务语言理解）、BBH（Big Bench Hard）和 MATH。 StrongReject这些子任务让您能够评测模型在特定功能和知识领域的表现。

**MMLU 子任务**

```
MMLU_SUBTASKS = [
    "abstract_algebra",
    "anatomy",
    "astronomy",
    "business_ethics",
    "clinical_knowledge",
    "college_biology",
    "college_chemistry",
    "college_computer_science",
    "college_mathematics",
    "college_medicine",
    "college_physics",
    "computer_security",
    "conceptual_physics",
    "econometrics",
    "electrical_engineering",
    "elementary_mathematics",
    "formal_logic",
    "global_facts",
    "high_school_biology",
    "high_school_chemistry",
    "high_school_computer_science",
    "high_school_european_history",
    "high_school_geography",
    "high_school_government_and_politics",
    "high_school_macroeconomics",
    "high_school_mathematics",
    "high_school_microeconomics",
    "high_school_physics",
    "high_school_psychology",
    "high_school_statistics",
    "high_school_us_history",
    "high_school_world_history",
    "human_aging",
    "human_sexuality",
    "international_law",
    "jurisprudence",
    "logical_fallacies",
    "machine_learning",
    "management",
    "marketing",
    "medical_genetics",
    "miscellaneous",
    "moral_disputes",
    "moral_scenarios",
    "nutrition",
    "philosophy",
    "prehistory",
    "professional_accounting",
    "professional_law",
    "professional_medicine",
    "professional_psychology",
    "public_relations",
    "security_studies",
    "sociology",
    "us_foreign_policy",
    "virology",
    "world_religions"
]
```

**BBH 子任务**

```
BBH_SUBTASKS = [
    "boolean_expressions",
    "causal_judgement",
    "date_understanding",
    "disambiguation_qa",
    "dyck_languages",
    "formal_fallacies",
    "geometric_shapes",
    "hyperbaton",
    "logical_deduction_five_objects",
    "logical_deduction_seven_objects",
    "logical_deduction_three_objects",
    "movie_recommendation",
    "multistep_arithmetic_two",
    "navigate",
    "object_counting",
    "penguins_in_a_table",
    "reasoning_about_colored_objects",
    "ruin_names",
    "salient_translation_error_detection",
    "snarks",
    "sports_understanding",
    "temporal_sequences",
    "tracking_shuffled_objects_five_objects",
    "tracking_shuffled_objects_seven_objects",
    "tracking_shuffled_objects_three_objects",
    "web_of_lies",
    "word_sorting"
]
```

**数学子任务**

```
MATH_SUBTASKS = [
    "algebra", 
    "counting_and_probability", 
    "geometry",
    "intermediate_algebra", 
    "number_theory", 
    "prealgebra", 
    "precalculus"
]
```

**StrongReject 子任务**

```
STRONG_REJECT_SUBTASKS = [
    "gcg_transfer_harmbench", 
    "gcg_transfer_universal_attacks",
    "combination_3", 
    "combination_2", 
    "few_shot_json", 
    "dev_mode_v2",
    "dev_mode_with_rant",
    "wikipedia_with_title", 
    "distractors",
    "wikipedia",
     "style_injection_json", 
    "style_injection_short",
    "refusal_suppression", 
    "prefix_injection", 
    "distractors_negated",
    "poems", 
    "base64", 
    "base64_raw", "
    base64_input_only",
    "base64_output_only", 
    "evil_confidant", 
    "aim", 
    "rot_13",
    "disemvowel", 
    "auto_obfuscation", 
    "auto_payload_splitting", 
    "pair",
    "pap_authority_endorsement", 
    "pap_evidence_based_persuasion",
    "pap_expert_endorsement", 
    "pap_logical_appeal", 
    "pap_misrepresentation"
]
```

### 提交您的基准测试作业
<a name="model-customize-evaluation-benchmarking-submit"></a>

------
#### [ SageMaker Studio ]

![\[通过 SageMaker Studio 进行基准测试的最低配置\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/benchmark-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.train.evaluate import get_benchmarks
from sagemaker.train.evaluate import BenchMarkEvaluator

Benchmark = get_benchmarks()

# Create evaluator with MMLU benchmark
evaluator = BenchMarkEvaluator(
benchmark=Benchmark.MMLU,
model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
s3_output_path="s3://<bucket-name>/<prefix>/",
evaluate_base_model=False
)

execution = evaluator.evaluate()
```

有关通过 SageMaker Python SDK 提交评估作业的更多信息，请参阅：[https://sagemaker.readthedocs.io/en/stable/model\$1customization/evaluation.html](https://sagemaker.readthedocs.io/en/stable/model_customization/evaluation.html)

------

## 大型语言模型作为评判 (LLMAJ) 评估
<a name="model-customize-evaluation-llmaj"></a>

使用 LLM-as-a-judge (LLMAJ) 评估来利用另一个前沿模型对目标模型的响应进行评分。你可以通过调用 `create_evaluation_job` API 启动评估作业，使用 B AWS edrock 模型作为评委。

有关支持的评判模型的更多信息，请参阅：[https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html](https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html)

您可以使用两种不同的指标格式来定义评估：
+ **内置指标：**利用 B AWS edrock 内置指标来分析模型推理响应的质量。欲了解更多信息，请参阅：[https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-type-judge-prompt .html](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-type-judge-prompt.html)
+ **自定义指标：**使用 Bedrock 评估自定义指标格式定义自己的自定义指标，使用您自己的指令分析模型推理响应的质量。欲了解更多信息，请参阅：[https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html)

### 提交内置指标 LLMAJ 作业
<a name="model-customize-evaluation-llmaj-builtin"></a>

------
#### [ SageMaker Studio ]

![\[通过 Studio 进行 LLMAJ 基准测试的最低配置 SageMaker\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/llmaj-as-judge-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.train.evaluate import LLMAsJudgeEvaluator

evaluator = LLMAsJudgeEvaluator(
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    evaluator_model="<bedrock-judge-model-id>",
    dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl",
    builtin_metrics=["<builtin-metric-1>", "<builtin-metric-2>"],
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

有关通过 SageMaker Python SDK 提交评估作业的更多信息，请参阅：[https://sagemaker.readthedocs.io/en/stable/model\$1customization/evaluation.html](https://sagemaker.readthedocs.io/en/stable/model_customization/evaluation.html)

------

### 提交自定义指标 LLMAJ 作业
<a name="model-customize-evaluation-llmaj-custom"></a>

定义您的自定义指标：

```
{
    "customMetricDefinition": {
        "name": "PositiveSentiment",
        "instructions": (
            "You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. "
            "Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\n\n"
            "Consider the following:\n"
            "- Does the response have a positive, encouraging tone?\n"
            "- Is the response helpful and constructive?\n"
            "- Does it avoid negative language or criticism?\n\n"
            "Rate on this scale:\n"
            "- Good: Response has positive sentiment\n"
            "- Poor: Response lacks positive sentiment\n\n"
            "Here is the actual task:\n"
            "Prompt: {{prompt}}\n"
            "Response: {{prediction}}"
        ),
        "ratingScale": [
            {"definition": "Good", "value": {"floatValue": 1}},
            {"definition": "Poor", "value": {"floatValue": 0}}
        ]
    }
}
```

欲了解更多信息，请参阅：[https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html)

------
#### [ SageMaker Studio ]

![\[通过 “自定义指标” > “添加自定义指标” 上传自定义指标\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/custom-llmaj-metrics-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
evaluator = LLMAsJudgeEvaluator(
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    evaluator_model="<bedrock-judge-model-id>",
    dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl",
    custom_metrics=custom_metric_dict = {
        "customMetricDefinition": {
            "name": "PositiveSentiment",
            "instructions": (
                "You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. "
                "Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\n\n"
                "Consider the following:\n"
                "- Does the response have a positive, encouraging tone?\n"
                "- Is the response helpful and constructive?\n"
                "- Does it avoid negative language or criticism?\n\n"
                "Rate on this scale:\n"
                "- Good: Response has positive sentiment\n"
                "- Poor: Response lacks positive sentiment\n\n"
                "Here is the actual task:\n"
                "Prompt: {{prompt}}\n"
                "Response: {{prediction}}"
            ),
            "ratingScale": [
                {"definition": "Good", "value": {"floatValue": 1}},
                {"definition": "Poor", "value": {"floatValue": 0}}
            ]
        }
    },
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)
```

------

## 自定义记分器
<a name="model-customize-evaluation-custom-scorers"></a>

定义您自己的自定义记分器函数以启动评估作业。该系统提供了两个内置得分器：素数和素码。你也可以自带记分器功能。您可以直接复制您的记分器函数代码，也可以使用关联的 ARN 自带自己的 Lambda 函数定义。默认情况下，两种得分手类型都会生成评估结果，其中包括标准指标，例如 F1 分数、ROUGE 和 BLEU。

有关内置和自定义评分器及其各自要求/合同的更多信息，请参阅。[使用预设和自定义评分器进行评估](model-customize-evaluation-preset-custom-scorers.md)

### 注册您的数据集
<a name="model-customize-evaluation-custom-scorers-register-dataset"></a>

通过将自己的数据集注册为 SageMaker Hub 内容数据集，为自定义评分者带来自己的数据集。

------
#### [ SageMaker Studio ]

在 Studio 中，使用专用的数据集页面上传您的数据集。

![\[SageMaker Studio 中注册的评估数据集\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/dataset-registration-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

在 SageMaker Python SDK 中，使用专用的数据集页面上传您的数据集。

```
from sagemaker.ai_registry.dataset import DataSet

dataset = DataSet.create(
    name="your-bring-your-own-dataset",
    source="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl"
)
dataset.refresh()
```

------

### 提交内置记分员作业
<a name="model-customize-evaluation-custom-scorers-builtin"></a>

------
#### [ SageMaker Studio ]

![\[从代码执行或数学答案中选择内置自定义评分\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/builtin-scorer-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.train.evaluate import CustomScorerEvaluator
from sagemaker.train.evaluate import get_builtin_metrics

BuiltInMetric = get_builtin_metrics()

evaluator_builtin = CustomScorerEvaluator(
    evaluator=BuiltInMetric.PRIME_MATH,
    dataset="arn:aws:sagemaker:<region>:<account-id>:hub-content/<hub-content-id>/DataSet/your-bring-your-own-dataset/<version>",
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

从 “内置评分” 中选择`BuiltInMetric.PRIME_MATH`或`BuiltInMetric.PRIME_CODE`。

------

### 提交自定义记分员作业
<a name="model-customize-evaluation-custom-scorers-custom"></a>

定义自定义奖励函数。有关更多信息，请参阅 [自定义评分器（自带指标）](model-customize-evaluation-preset-custom-scorers.md#model-customize-evaluation-custom-scorers-byom)。

**注册自定义奖励功能**

------
#### [ SageMaker Studio ]

![\[导航到 SageMaker Studio > 资产 > 评估器 > 创建评估器 > 创建奖励函数\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/custom-scorer-submission-sagemaker-studio.png)


![\[在 Custom Scorer > Custom Metrics 中提交 Custom Scorer 评估作业，引用注册的预设\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/custom-scorer-benchmark-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.ai_registry.evaluator import Evaluator
from sagemaker.ai_registry.air_constants import REWARD_FUNCTION

evaluator = Evaluator.create(
    name = "your-reward-function-name",
    source="/path_to_local/custom_lambda_function.py",
    type = REWARD_FUNCTION
)
```

```
evaluator = CustomScorerEvaluator(
    evaluator=evaluator,
    dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl",
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

------

# 评估指标格式
<a name="model-customize-evaluation-metrics-formats"></a>

使用以下指标格式评估模型的质量：
+ 模型评估摘要
+ MLFlow
+ TensorBoard

## 模型评估摘要
<a name="model-customize-evaluation-metrics-summary"></a>

提交评估任务时，您需要指定 S AWS 3 的输出位置。 SageMaker 自动将评估摘要.json 文件上传到该位置。基准摘要 S3 路径如下：

```
s3://<your-provide-s3-location>/<training-job-name>/output/output/<evaluation-job-name>/eval_results/
```

**通过 S AWS 3 地点**

------
#### [ SageMaker Studio ]

![\[传递到输出项目位置 (AWS S3 URI)\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/s3-output-path-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/eval/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

------

`.json`从 AWS S3 位置直接读取或在 UI 中自动可视化：

```
{
  "results": {
    "custom|gen_qa_gen_qa|0": {
      "rouge1": 0.9152812653966208,
      "rouge1_stderr": 0.003536439199232507,
      "rouge2": 0.774569918517409,
      "rouge2_stderr": 0.006368825746765958,
      "rougeL": 0.9111255645823356,
      "rougeL_stderr": 0.003603841524881021,
      "em": 0.6562150055991042,
      "em_stderr": 0.007948251702846893,
      "qem": 0.7522396416573348,
      "qem_stderr": 0.007224355240883467,
      "f1": 0.8428757602152095,
      "f1_stderr": 0.005186300690881584,
      "f1_score_quasi": 0.9156170336744968,
      "f1_score_quasi_stderr": 0.003667700152375464,
      "bleu": 100.00000000000004,
      "bleu_stderr": 1.464411857851008
    },
    "all": {
      "rouge1": 0.9152812653966208,
      "rouge1_stderr": 0.003536439199232507,
      "rouge2": 0.774569918517409,
      "rouge2_stderr": 0.006368825746765958,
      "rougeL": 0.9111255645823356,
      "rougeL_stderr": 0.003603841524881021,
      "em": 0.6562150055991042,
      "em_stderr": 0.007948251702846893,
      "qem": 0.7522396416573348,
      "qem_stderr": 0.007224355240883467,
      "f1": 0.8428757602152095,
      "f1_stderr": 0.005186300690881584,
      "f1_score_quasi": 0.9156170336744968,
      "f1_score_quasi_stderr": 0.003667700152375464,
      "bleu": 100.00000000000004,
      "bleu_stderr": 1.464411857851008
    }
  }
}
```

![\[在 Studio 中可视化的自定义 gen-qa 基准测试的性能指标示例 SageMaker\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/gen-qa-metrics-visualization-sagemaker-studio.png)


## MLFlow 日志记录
<a name="model-customize-evaluation-metrics-mlflow"></a>

**提供您的 SageMaker MLFlow 资源 ARN**

SageMaker 当您首次使用模型自定义功能时，Studio 使用在每个 Studio 域上配置的默认 MLFlow 应用程序。 SageMaker Studio 在提交评估作业时使用与 MLflow 应用关联的默认ARN。

您也可以提交评估任务并明确提供 MLFlow 资源 ARN，以便将指标流式传输到上述关联追踪中 server/app 进行实时分析。

**SageMaker Python SD**

```
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/eval/",
    mlflow_resource_arn="arn:aws:sagemaker:<region>:<account-id>:mlflow-tracking-server/<tracking-server-name>",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

模型级别和系统级指标可视化：

![\[MMLU 基准测试任务的样本模型级误差和精度\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/model-metrics-mlflow.png)


![\[LLMAJ 基准测试任务的内置指标示例\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/llmaj-metrics-mlflow.png)


![\[MMLU 基准测试任务的系统级指标示例\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/system-metrics-mlflow.png)


## TensorBoard
<a name="model-customize-evaluation-metrics-tensorboard"></a>

使用 AWS S3 输出位置提交您的评估任务。 SageMaker 自动将 TensorBoard 文件上传到该位置。

SageMaker 将 TensorBoard 文件上传到以下位置的 AWS S3：

```
s3://<your-provide-s3-location>/<training-job-name>/output/output/<evaluation-job-name>/tensorboard_results/eval/
```

**按如下方式传递 AWS S3 位置**

------
#### [ SageMaker Studio ]

![\[传递到输出项目位置 (AWS S3 URI)\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/s3-output-path-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/eval/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

------

**模型级别指标示例**

![\[SageMaker TensorBoard 显示基准测试工作的结果\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/metrics-in-tensorboard.png)


# Bring-Your-Own-Dataset(BYOD) 任务支持的数据集格式
<a name="model-customize-evaluation-dataset-formats"></a>

自定义评分器和 LLM-as-judge评估类型需要位于 S3 中的自定义数据集 JSONL 文件。 AWS 您必须以符合以下支持的格式之一的 JSON Lines 文件形式提供该文件。为清楚起见，本文档中的示例进行了扩展。

每种格式都有自己的细微差别，但至少所有格式都需要用户提示。


**必填字段**  

| 字段 | 必需 | 
| --- | --- | 
| 用户提示 | 是 | 
| 系统提示 | 否 | 
| 实地真相 | 仅适用于自定义得分手 | 
| 类别 | 否 | 

**1。OpenAI 格式**

```
{
    "messages": [
        {
            "role": "system",    # System prompt (looks for system role)
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",       # Query (looks for user role)
            "content": "Hello!"
        },
        {
            "role": "assistant",  # Ground truth (looks for assistant role)
            "content": "Hello to you!"
        }
    ]
}
```

**2。 SageMaker **评估

```
{
   "system":"You are an English major with top marks in class who likes to give minimal word responses: ",
   "query":"What is the symbol that ends the sentence as a question",
   "response":"?", # Ground truth
   "category": "Grammar"
}
```

**3。 HuggingFace 立即完成**

支持标准格式和对话格式。

```
# Standard

{
    "prompt" : "What is the symbol that ends the sentence as a question", # Query
    "completion" : "?" # Ground truth
}

# Conversational
{
    "prompt": [
        {
            "role": "user", # Query (looks for user role)
            "content": "What is the symbol that ends the sentence as a question"
        }
    ],
    "completion": [
        {
            "role": "assistant", # Ground truth (looks for assistant role)
            "content": "?"
        }
    ]
}
```

**4。 HuggingFace 偏好**

Support 支持标准格式（字符串）和对话格式（消息数组）。

```
# Standard: {"prompt": "text", "chosen": "text", "rejected": "text"}
{
     "prompt" : "The sky is", # Query
     "chosen" : "blue", # Ground truth
     "rejected" : "green"
}

# Conversational:
{
    "prompt": [
        {
            "role": "user", # Query (looks for user role)
            "content": "What color is the sky?"
        }
    ],
    "chosen": [
        {
            "role": "assistant", # Ground truth (looks for assistant role)
            "content": "It is blue."
        }
    ],
    "rejected": [
        {
            "role": "assistant",
            "content": "It is green."
        }
    ]
}
```

**5。Verl 格式**

强化学习用例支持 Verl 格式（当前格式和传统格式）。Verl 文档供参考：[https://verl.readthedocs.io/en/latest/preparation/prepare\$1data.html](https://verl.readthedocs.io/en/latest/preparation/prepare_data.html)

VERL 格式的用户通常不提供真实答复。无论如何都要提供一个，请`extra_info`优先使用`extra_info.answer`或`reward_model.ground_truth`; 中的任何一个字段。

SageMaker 如果存在以下特定于 VERL 的字段，则保留为元数据：
+ `id`
+ `data_source`
+ `ability`
+ `reward_model`
+ `extra_info`
+ `attributes`
+ `difficulty`

```
# Newest VERL format where `prompt` is an array of messages.
{
  "data_source": "openai/gsm8k",
  "prompt": [
    {
      "content": "You are a helpful math tutor who explains solutions to questions step-by-step.",
      "role": "system"
    },
    {
      "content": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? Let's think step by step and output the final answer after \"####\".",
      "role": "user"
    }
  ],
  "ability": "math",
  "extra_info": {
    "answer": "Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72",
    "index": 0,
    "question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",
    "split": "train"
  },
  "reward_model": {
    "ground_truth": "72" # Ignored in favor of extra_info.answer
  }
}

# Legacy VERL format where `prompt` is a string. Also supported.
{
  "data_source": "openai/gsm8k",
  "prompt": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? Let's think step by step and output the final answer after \"####\".",
  "extra_info": {
    "answer": "Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72",
    "index": 0,
    "question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",
    "split": "train"
  }
}
```

# 使用预设和自定义评分器进行评估
<a name="model-customize-evaluation-preset-custom-scorers"></a>

使用自定义评分器评估类型时， SageMaker 评估支持两个内置得分器（也称为 “奖励函数”）Prime Math 和 Prime Code，取自 v [olcengine/verl RL](https://github.com/volcengine/verl) 训练库，或者您自己的自定义评分器实现为 Lambda 函数。

## 内置记分器
<a name="model-customize-evaluation-builtin-scorers"></a>

**初级数学**

主要数学得分手期望有一个自定义 JSONL 条目数据集，其中包含数学问题作为基本真相， prompt/query 正确答案作为基本事实。数据集可以是中提到的任何一种支持的格式[Bring-Your-Own-Dataset(BYOD) 任务支持的数据集格式](model-customize-evaluation-dataset-formats.md)。

数据集条目示例（为清楚起见进行了扩展）：

```
{
    "system":"You are a math expert: ",
    "query":"How many vertical asymptotes does the graph of $y=\\frac{2}{x^2+x-6}$ have?",
    "response":"2" # Ground truth aka correct answer
}
```

**Prime 代码**

Prime code scorer 需要一个自定义 JSONL 数据集，该数据集包含该字段中指定的编码问题和测试用例。`metadata`使用每个条目的预期函数名称、样本输入和预期输出来构造测试用例。

数据集条目示例（为清楚起见进行了扩展）：

```
{
    "system":"\\nWhen tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process.\\n\\n[ASSESS]\\n\\n[ADVANCE]\\n\\n[VERIFY]\\n\\n[SIMPLIFY]\\n\\n[SYNTHESIZE]\\n\\n[PIVOT]\\n\\n[OUTPUT]\\n\\nYou should strictly follow the format below:\\n\\n[ACTION NAME]\\n\\n# Your action step 1\\n\\n# Your action step 2\\n\\n# Your action step 3\\n\\n...\\n\\nNext action: [NEXT ACTION NAME]\\n\\n",
    "query":"A number N is called a factorial number if it is the factorial of a positive integer. For example, the first few factorial numbers are 1, 2, 6, 24, 120,\\nGiven a number N, the task is to return the list/vector of the factorial numbers smaller than or equal to N.\\nExample 1:\\nInput: N = 3\\nOutput: 1 2\\nExplanation: The first factorial number is \\n1 which is less than equal to N. The second \\nnumber is 2 which is less than equal to N,\\nbut the third factorial number is 6 which \\nis greater than N. So we print only 1 and 2.\\nExample 2:\\nInput: N = 6\\nOutput: 1 2 6\\nExplanation: The first three factorial \\nnumbers are less than equal to N but \\nthe fourth factorial number 24 is \\ngreater than N. So we print only first \\nthree factorial numbers.\\nYour Task:  \\nYou don't need to read input or print anything. Your task is to complete the function factorialNumbers() which takes an integer N as an input parameter and return the list/vector of the factorial numbers smaller than or equal to N.\\nExpected Time Complexity: O(K), Where K is the number of factorial numbers.\\nExpected Auxiliary Space: O(1)\\nConstraints:\\n1<=N<=10^{18}\\n\\nWrite Python code to solve the problem. Present the code in \\n```python\\nYour code\\n```\\nat the end.",
    "response": "", # Dummy string for ground truth. Provide a value if you want NLP metrics like ROUGE, BLEU, and F1.
    ### Define test cases in metadata field
    "metadata": {
        "fn_name": "factorialNumbers",
        "inputs": ["5"],
        "outputs": ["[1, 2]"]
    }
}
```

## 自定义评分器（自带指标）
<a name="model-customize-evaluation-custom-scorers-byom"></a>

使用自定义的后处理逻辑完全自定义您的模型评估工作流程，该逻辑允许您计算根据自己的需求量身定制的自定义指标。您必须将自定义评分器实现为 Lamb AWS da 函数，该函数接受模型响应并返回奖励分数。

### Lambda 输入负载示例
<a name="model-customize-evaluation-custom-scorers-lambda-input"></a>

您的自定义 AWS Lambda 要求输入采用 OpenAI 格式。示例：

```
{
    "id": "123",
    "messages": [
        {
            "role": "user",
            "content": "Do you have a dedicated security team?"
        },
        {
            "role": "assistant",
            "content": "As an AI developed by Amazon, I do not have a dedicated security team..."
        }
    ],
    "reference_answer": {
        "compliant": "No",
        "explanation": "As an AI developed by Company, I do not have a traditional security team..."
    }
}
```

### Lambda 输出负载示例
<a name="model-customize-evaluation-custom-scorers-lambda-output"></a>

 SageMaker 评估容器希望您的 Lambda 响应遵循以下格式：

```
{
    "id": str,                              # Same id as input sample
    "aggregate_reward_score": float,        # Overall score for the sample
    "metrics_list": [                       # OPTIONAL: Component scores
        {
            "name": str,                    # Name of the component score
            "value": float,                 # Value of the component score
            "type": str                     # "Reward" or "Metric"
        }
    ]
}
```

### 自定义 Lambda 定义
<a name="model-customize-evaluation-custom-scorers-lambda-definition"></a>

在:[https://docs.aws.amazon.com/sagemaker/latest/dg/nova-.html\$1-implementing-reward-functions](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-implementing-reward-functions.html#nova-reward-llm-judge-example) example 中查找带有示例输入和预期输出的完全实现的自定义评分器的示例 nova-reward-llm-judge

使用以下骨架作为你自己的函数的起点。

```
def lambda_handler(event, context):
    return lambda_grader(event)

def lambda_grader(samples: list[dict]) -> list[dict]:
    """
    Args:
        Samples: List of dictionaries in OpenAI format
            
        Example input:
        {
            "id": "123",
            "messages": [
                {
                    "role": "user",
                    "content": "Do you have a dedicated security team?"
                },
                {
                    "role": "assistant",
                    "content": "As an AI developed by Company, I do not have a dedicated security team..."
                }
            ],
            # This section is the same as your training dataset
            "reference_answer": {
                "compliant": "No",
                "explanation": "As an AI developed by Company, I do not have a traditional security team..."
            }
        }
        
    Returns:
        List of dictionaries with reward scores:
        {
            "id": str,                              # Same id as input sample
            "aggregate_reward_score": float,        # Overall score for the sample
            "metrics_list": [                       # OPTIONAL: Component scores
                {
                    "name": str,                    # Name of the component score
                    "value": float,                 # Value of the component score
                    "type": str                     # "Reward" or "Metric"
                }
            ]
        }
    """
```

### 输入和输出字段
<a name="model-customize-evaluation-custom-scorers-fields"></a>

**输入字段**


| 字段 | 说明 | 附加说明 | 
| --- | --- | --- | 
| id | 样品的唯一标识符 | 在输出中回声。字符串格式 | 
| 消息 | 以 OpenAI 格式排序的聊天记录 | 消息对象数组 | 
| 消息 [] .role | 留言的发言人 | 常用值：“用户”、“助手”、“系统” | 
| 消息 [] .content | 消息的文字内容 | 纯字符串 | 
| 元数据 | 有助于评分的自由格式信息 | 对象；从训练数据传递的可选字段 | 

**输出字段**


**输出字段**  

| 字段 | 说明 | 附加说明 | 
| --- | --- | --- | 
| id | 与输入样本相同的标识符 | 必须匹配输入 | 
| 聚合\$1奖励\$1分数 | 样本的总分数 | 浮动（例如，0.0—1.0 或任务定义的范围） | 
| 指标列表 | 构成汇总的分量分数 | 指标对象数组 | 

### 所需权限
<a name="model-customize-evaluation-custom-scorers-permissions"></a>

确保您用于运行评估的 SageMaker 执行角色具有 AWS Lambda 权限。

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "lambda:InvokeFunction"
            ],
            "Resource": "arn:aws:lambda:region:account-id:function:function-name"
        }
    ]
}
```

确保您的 AWS Lambda 函数的执行角色具有基本的 Lambda 执行权限，以及任何下游调用可能需要的额外权限。 AWS 

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:*"
    }
  ]
}
```

# 模型部署
<a name="model-customize-open-weight-deployment"></a>

在自定义模型详情页面中，您还可以使用 SageMaker AI 推理终端节点或 Amazon Bedrock 部署您的自定义模型。

![\[包含精选模型定制技术的图像。\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/screenshot-open-model-1.png)


# 样本数据集和赋值器
<a name="model-customize-open-weight-samples"></a>

## 监督微调 (SFT)
<a name="model-customize-open-weight-samples-sft"></a>
+ 名称：TAT-QA
+ 许可证：CC-BY-4.0
+ 链接：[https://huggingface。 co/datasets/next-tat/TAT-QA](https://huggingface.co/datasets/next-tat/TAT-QA)
+ 预处理-格式化

**一个样本**

```
{
    "prompt": "Given a table and relevant text descriptions, answer the following question.\n\nTable:\n<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td></td>\n      <td>2019</td>\n      <td>2018</td>\n    </tr>\n    <tr>\n      <td></td>\n      <td>$'000</td>\n      <td>$'000</td>\n    </tr>\n    <tr>\n      <td>Revenue from external customers</td>\n      <td></td>\n      <td></td>\n    </tr>\n    <tr>\n      <td>Australia</td>\n      <td>144,621</td>\n      <td>129,431</td>\n    </tr>\n    <tr>\n      <td>New Zealand</td>\n      <td>13,036</td>\n      <td>8,912</td>\n    </tr>\n    <tr>\n      <td>Total</td>\n      <td>157,657</td>\n      <td>138,343</td>\n    </tr>\n  </tbody>\n</table>\n\nParagraphs:\n    4. SEGMENT INFORMATION\n\n    During the 2019 and 2018 financial years, the Group operated wholly within one business segment being the operation and management of storage centres in Australia and New Zealand.\n\n    The Managing Director is the Group\u2019s chief operating decision maker and monitors the operating results on a portfolio wide basis. Monthly management reports are evaluated based upon the overall performance of NSR consistent with the presentation within the consolidated financial statements. The Group\u2019s financing (including finance costs and finance income) are managed on a Group basis and not allocated to operating segments.\n\n    The operating results presented in the statement of profit or loss represent the same segment information as reported in internal management information.\n\n    The revenue information above excludes interest income and is based on the location of storage centres.\n\nQuestion: How many business segments were present in 2019 and 2018?\n\nAnswer:",
    "completion": "one",
    "data_idx": "2951"
}
```

## 直接偏好优化（DPO）
<a name="model-customize-open-weight-samples-dpo"></a>
+ 名称：超反馈
+ 执照：麻省理工
+ 链接：[https://huggingface。 co/datasets/openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback)
+ 预处理步骤：格式化和二次采样。

**一个样本**

```
{
    "source": "evol_instruct",
    "prompt": "Can you write a C++ program that prompts the user to enter the name of a country and checks if it borders the Mediterranean Sea? Here's some starter code to help you out:\n#include <iostream>\n#include <string>\nusing namespace std;\nint main() {\n    string country;\n    // prompt user for input\n    cout << \"Enter the name of a country: \";\n    cin >> country;\n    // check if country borders the Mediterranean Sea\n    // [C++ code]\n    return 0;\n}",
    "chosen": "Here's a C++ program that prompts the user to enter the name of a country and checks if it borders the Mediterranean Sea:\n\n#include <iostream>\n#include <string>\n#include <set>\n#include <map>\n#include <algorithm>\n\nusing namespace std;\n\nint main() {\n    // store countries and their bordering seas in a map\n    map<string, set<string>> countries;\n    countries[\"Algeria\"] = {\"Mediterranean Sea\", \"North African Coast\"};\n    countries[\"France\"] = {\"Mediterranean Sea\", \"English Channel\"};\n    countries[\"Italy\"] = {\"Mediterranean Sea\", \"Adriatic Sea\"};\n    countries[\"Malta\"] = {\"Mediterranean Sea\", \"North African Coast\"};\n    countries[\"Spain\"] = {\"Mediterranean Sea\", \"Atlantic Ocean\"};\n    countries[\"Tunisia\"] = {\"Mediterranean Sea\", \"North African Coast\"};\n\n    string country;\n    cout << \"Enter the name of a country: \";\n    cin >> country;\n\n    // check if country is in the map\n    if (countries.find(country)!= countries.end()) {\n        // check if country borders the Mediterranean Sea\n        set<string> borders = countries.at(country);\n        if (borders.find(\"Mediterranean Sea\")!= borders.end()) {\n            cout << country << \" borders the Mediterranean Sea.\" << endl;\n        } else {\n            cout << country << \" does not border the Mediterranean Sea.\" << endl;\n        }\n    } else {\n        cout << country << \" is not a recognized country.\" << endl;\n    }\n\n    return 0;\n}\n\nIn this program, we first define a map called countries that stores each country's name as a key and a set of its surrounding seas as the value. We use a set to store each sea name so that we can easily check if a country borders a certain sea.\n\nWe then present the user with a prompt to enter the name of a country. We use a while loop to ensure that the user enters a valid country name. We check if the country is in the map, and if so, we retrieve its bordering seas from the map.\n\nFinally, we check if the Mediterranean Sea is among the set of borders. If it is, we inform the user that the country borders the Mediterranean Sea. If not, we inform the user that the country does not border the Mediterranean Sea.",
    "chosen-rating": 5.0,
    "chosen-model": "starchat",
    "rejected": "Sure, here is the program using the C++11 algorithm \"cds::algorithm::GreaterEqual\":\n#include <iostream>\n#include <string>\n#include <algorithm>\n#include <vector>\n#include <cctype>\n\nusing namespace std;\n\nint main() {\n    string country;\n    cout << \"Enter the name of a country: \";\n    cin >> country;\n    std::vector<string> vec;\n    vec.push_back(country);\n    size_t index = std::find_if(vec.begin(), vec.end(), [](const string& s) {\n        return std::any_of(s.begin(), s.end(), [](const char& c) {\n            return c == '}}';\n    });\n    if (index != vec.end()) {\n        if ((*index)[0] == 'M') {\n            cout << \"Country is bordered by the Mediterranean Sea.\" << endl;\n        } else {\n            cout << \"Country does not border the Mediterranean Sea.\" << endl;\n        }\n    } else {\n        cout << \"Country is not found.\" << endl;\n    }\n    return 0;\n}",
    "rejected-rating": 1.25,
    "rejected-model": "pythia-12b"
}
```

## 通过人工智能反馈进行强化学习 (RLAIF)-成对评判
<a name="model-customize-open-weight-samples-rlaif"></a>

**输入数据集**

源数据集：[https://github.com/WeOpenML/PandalM](https://github.com/WeOpenML/PandaLM)

**一个样本**

```
{
    "data_source": "WeOpenML/PandaLM",
    "prompt": [
        {
            "role": "user",
            "content": "Below are two responses for a given task. The task is defined by the Instruction with an Input that provides further context. Evaluate the responses and generate a reference answer for the task.\n\n
            ### Instruction:\nCompare the given products.\n\n### Input:\niPhone 11 and Google Pixel 4\n\n
            ### Response 1:\nThe iPhone 11 has a larger screen size and a longer battery life than the Google Pixel 4.\n\n
            ### Response 2:\nThe iPhone 11 and Google Pixel 4 are both flagship smartphones released in 2018. The iPhone 11 has a 6.1-inch LCD display, while the Google Pixel 4 has a 5.7-inch OLED display. The iPhone 11 has an A13 Bionic chipset, while the Google Pixel 4 has a Qualcomm Snapdragon 845 chipset. The iPhone 11 has a dual-camera system, while the Google Pixel 4 has a single camera system. The iPhone 11 has a longer battery life, while the Google Pixel 4 has a faster processor.\n\n### Evaluation:\n"
        }
    ],
    "ability": "pairwise-judging",
    "reward_model": {
        "style": "llmj",
        "ground_truth": "2\n\n### Reason: Response 2 provides a more detailed and comprehensive comparison of the two products, including their specifications and features. Response 1 only mentions two aspects of the products and does not provide as much information.\n\n### Reference: The iPhone 11 and Google Pixel 4 are both flagship smartphones released in 2018. The iPhone 11 has a 6.1-inch LCD display, while the Google Pixel 4 has a 5.7-inch OLED display. The iPhone 11 has an A13 Bionic chipset, while the Google Pixel 4 has a Qualcomm Snapdragon 845 chipset. The iPhone 11 has a dual-camera system, while the Google Pixel 4 has a single camera system. The iPhone 11 has a longer battery life, while the Google Pixel 4 has a faster processor."
    },
    "extra_info": {
        "split": "train",
        "index": 0,
        "raw_output_sequence": "2\n\n### Reason: Response 2 provides a more detailed and comprehensive comparison of the two products, including their specifications and features. Response 1 only mentions two aspects of the products and does not provide as much information.\n\n### Reference: The iPhone 11 and Google Pixel 4 are both flagship smartphones released in 2018. The iPhone 11 has a 6.1-inch LCD display, while the Google Pixel 4 has a 5.7-inch OLED display. The iPhone 11 has an A13 Bionic chipset, while the Google Pixel 4 has a Qualcomm Snapdragon 845 chipset. The iPhone 11 has a dual-camera system, while the Google Pixel 4 has a single camera system. The iPhone 11 has a longer battery life, while the Google Pixel 4 has a faster processor.\n",
        "llmj": {
            "question": "Below are two responses for a given task. The task is defined by the Instruction with an Input that provides further context. Evaluate the responses and generate a reference answer for the task.\n\n### Instruction:\nCompare the given products.\n\n### Input:\niPhone 11 and Google Pixel 4\n\n### Response 1:\nThe iPhone 11 has a larger screen size and a longer battery life than the Google Pixel 4.\n\n### Response 2:\nThe iPhone 11 and Google Pixel 4 are both flagship smartphones released in 2018. The iPhone 11 has a 6.1-inch LCD display, while the Google Pixel 4 has a 5.7-inch OLED display. The iPhone 11 has an A13 Bionic chipset, while the Google Pixel 4 has a Qualcomm Snapdragon 845 chipset. The iPhone 11 has a dual-camera system, while the Google Pixel 4 has a single camera system. The iPhone 11 has a longer battery life, while the Google Pixel 4 has a faster processor.\n\n### Evaluation:\n",
            "ground_truth": "2\n\n### Reason: Response 2 provides a more detailed and comprehensive comparison of the two products, including their specifications and features. Response 1 only mentions two aspects of the products and does not provide as much information.\n\n### Reference: The iPhone 11 and Google Pixel 4 are both flagship smartphones released in 2018. The iPhone 11 has a 6.1-inch LCD display, while the Google Pixel 4 has a 5.7-inch OLED display. The iPhone 11 has an A13 Bionic chipset, while the Google Pixel 4 has a Qualcomm Snapdragon 845 chipset. The iPhone 11 has a dual-camera system, while the Google Pixel 4 has a single camera system. The iPhone 11 has a longer battery life, while the Google Pixel 4 has a faster processor.",
            "document_in_context": null
        },
        "sample_size": 1980
    }
}
```

## RLAIF-思想链
<a name="model-customize-open-weight-samples-rlaif2"></a>

**输入数据集**

来源数据：[https://huggingface。 co/datasets/thesven/gsm8k-reasoning/tree/main/data](https://huggingface.co/datasets/thesven/gsm8k-reasoning/tree/main/data)

**一个样本**

```
{
    "data_source": "openai/gsm8k",
    "prompt": [
        {
            "role": "system",
            "content": "\nYou are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries.\n\nFollow these steps:\n- Think through the problem step by step within the ‹thinking> tags.\n- Reflect on your thinking\n\nto check for any errors or improvements within the ‹reflection› tags.\n- Make any necessary adjustments based on your reflection.\n- Provide your final, concise answer within the ‹output> tags.\n\nImportant: The <thinking> and ‹reflection› sections are for your internal reasoning process only.\nDo not include any part of the final answer in these sections.\nThe actual response to the query must be entirely contained within the ‹output› tags.\n\nUse the following format for your response:\n<thinking>\n[Your initial thought process goes here]\n</thinking›\n\n<reasoning>\n[Your step-by-step reasoning goes here. This is your internal thought process, not the final answer. You can create as many reasoning steps as necessary in your process.]\n</reasoning>\n\n‹reflection>\n[Your reflection on your reasoning, checking for errors or improvements. You can create as many reflection steps as necessary in your process.]\n</ reflection>\n\n<adjustment>\n[Any adjustments to your thinking based on your reflection]\n</adjustment>\n\n<output>\n[Your final, concise answer to the query. This is the only part that will be shown to the user.]\n</output>\n"
        },
        {
            "role": "user",
            "content": "A craft store makes a third of its sales in the fabric section, a quarter of its sales in the jewelry section, and the rest in the stationery section. They made 36 sales today. How many sales were in the stationery section?"
        }
    ],
    "ability": "chain-of-thought",
    "reward_model": {
        "style": "llmj-cot",
        "ground_truth": "Thus, there were 36 - 12 - 9 = <<36-12-9=15>>15 sales in the stationery section."
    },
    "extra_info": {
        "split": "train",
        "index": 0,
        "question": "A craft store makes a third of its sales in the fabric section, a quarter of its sales in the jewelry section, and the rest in the stationery section. They made 36 sales today. How many sales were in the stationery section?",
        "short_answer": "15",
        "model_output": "<thinking>\nTo find the number of sales in the stationery section, I need to first calculate the number of sales in the fabric and jewelry sections. The store makes a third of its sales in the fabric section, which is 1/3 of 36 sales. I can calculate this as 36 * 1/3 = 12 sales in the fabric section. The store makes a quarter of its sales in the jewelry section, which is 1/4 of 36 sales. I can calculate this as 36 * 1/4 = 9 sales in the jewelry section. Since the rest of the sales are in the stationery section, I can subtract the sales from the other two sections from the total number of sales to find the number of sales in the stationery section.\n</thinking>\n\n<reasoning>\n1. Calculate the number of sales in the fabric section: 36 * 1/3 = 12\n2. Calculate the number of sales in the jewelry section: 36 * 1/4 = 9\n3. Calculate the number of sales in the stationery section: 36 - 12 - 9 = 15\n</reasoning>\n\n‹reflection>\nUpon reviewing my reasoning, I noticed that my calculations seem correct, but I should ensure that I understand the question correctly. The question asks for the number of sales in the stationery section, and I believe I have correctly calculated that. However, I should double-check my math to make sure it's accurate.\n</reflection>\n\n<adjustment>\nUpon re-examining my math, I am confident that my calculations are correct. The fabric section makes up 1/3 of the sales, the jewelry section makes up 1/4, and the rest is the stationery section. My math checks out, and I am confident in my answer.\n</adjustment>\n\n<output>\n15\n</output>"
    }
}
```

## RLAIF-Faithfulnesss
<a name="model-customize-open-weight-samples-rlaif3"></a>

**输入数据集**

来源：[https://huggingface。 co/datasets/rajpurkar/squad\$1v2/blob/main/squad\$1v2/train-00000-of-00001.parquet](https://huggingface.co/datasets/rajpurkar/squad_v2/blob/main/squad_v2/train-00000-of-00001.parquet)

**一个样本**

```
{
    "data_source": "squad_v2",
    "prompt": [
        {
            "role": "system",
            "content": "You are a helpful assistant that answers questions based on the provided context. Only use information from the context."
        },
        {
            "role": "user",
            "content": "Context: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles \"Crazy in Love\" and \"Baby Boy\".\n\nQuestion: When did Beyonce start becoming popular?"
        }
    ],
    "ability": "faithfulness",
    "reward_model": {
        "style": "llmj-faithfulness",
        "ground_truth": "Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles \"Crazy in Love\" and \"Baby Boy\"."
    },
    "extra_info": {
        "question": "When did Beyonce start becoming popular?",
        "split": "train",
        "index": 0
    }
}
```

## RLAIF-摘要
<a name="model-customize-open-weight-samples-rlaif4"></a>

**输入数据集**

[来源：已清理的 gsm8k 数据集 https://huggingface。 co/datasets/thesven/gsm8k-reasoning/tree/main/data](https://huggingface.co/datasets/thesven/gsm8k-reasoning/tree/main/data)

**一个样本**

```
{
    "data_source": "cnn_dailymail",
    "prompt": [
        {
            "role": "system",
            "content": "You are a helpful assistant that creates concise, accurate summaries of news articles. Focus on the key facts and main points."
        },
        {
            "role": "user",
            "content": "Summarize the following article:\n\nLONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in \"Harry Potter and the Order of the Phoenix\" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. \"I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar,\" he told an Australian interviewer earlier this month. \"I don't think I'll be particularly extravagant. \"The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs.\" At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film \"Hostel: Part II,\" currently six places below his number one movie on the UK box office chart. Details of how he'll mark his landmark birthday are under wraps. His agent and publicist had no comment on his plans. \"I'll definitely have some sort of party,\" he said in an interview. \"Hopefully none of you will be reading about it.\" Radcliffe's earnings from the first five Potter films have been held in a trust fund which he has not been able to touch. Despite his growing fame and riches, the actor says he is keeping his feet firmly on the ground. \"People are always looking to say 'kid star goes off the rails,'\" he told reporters last month. \"But I try very hard not to go that way because it would be too easy for them.\" His latest outing as the boy wizard in \"Harry Potter and the Order of the Phoenix\" is breaking records on both sides of the Atlantic and he will reprise the role in the last two films.  Watch I-Reporter give her review of Potter's latest » . There is life beyond Potter, however. The Londoner has filmed a TV movie called \"My Boy Jack,\" about author Rudyard Kipling and his son, due for release later this year. He will also appear in \"December Boys,\" an Australian film about four boys who escape an orphanage. Earlier this year, he made his stage debut playing a tortured teenager in Peter Shaffer's \"Equus.\" Meanwhile, he is braced for even closer media scrutiny now that he's legally an adult: \"I just think I'm going to be more sort of fair game,\" he told Reuters. E-mail to a friend . Copyright 2007 Reuters. All rights reserved.This material may not be published, broadcast, rewritten, or redistributed."
        }
    ],
    "ability": "summarization",
    "reward_model": {
        "style": "llmj-summarization",
        "ground_truth": "Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .\nYoung actor says he has no plans to fritter his cash away .\nRadcliffe's earnings from first five Potter films have been held in trust fund ."
    },
    "extra_info": {
        "question": "LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in \"Harry Potter and the Order of the Phoenix\" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. \"I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar,\" he told an Australian interviewer earlier this month. \"I don't think I'll be particularly extravagant. \"The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs.\" At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film \"Hostel: Part II,\" currently six places below his number one movie on the UK box office chart. Details of how he'll mark his landmark birthday are under wraps. His agent and publicist had no comment on his plans. \"I'll definitely have some sort of party,\" he said in an interview. \"Hopefully none of you will be reading about it.\" Radcliffe's earnings from the first five Potter films have been held in a trust fund which he has not been able to touch. Despite his growing fame and riches, the actor says he is keeping his feet firmly on the ground. \"People are always looking to say 'kid star goes off the rails,'\" he told reporters last month. \"But I try very hard not to go that way because it would be too easy for them.\" His latest outing as the boy wizard in \"Harry Potter and the Order of the Phoenix\" is breaking records on both sides of the Atlantic and he will reprise the role in the last two films.  Watch I-Reporter give her review of Potter's latest » . There is life beyond Potter, however. The Londoner has filmed a TV movie called \"My Boy Jack,\" about author Rudyard Kipling and his son, due for release later this year. He will also appear in \"December Boys,\" an Australian film about four boys who escape an orphanage. Earlier this year, he made his stage debut playing a tortured teenager in Peter Shaffer's \"Equus.\" Meanwhile, he is braced for even closer media scrutiny now that he's legally an adult: \"I just think I'm going to be more sort of fair game,\" he told Reuters. E-mail to a friend . Copyright 2007 Reuters. All rights reserved.This material may not be published, broadcast, rewritten, or redistributed.",
        "split": "train",
        "index": 0,
        "source_id": "42c027e4ff9730fbb3de84c1af0d2c50"
    }
}
```

## RLAIF-自定义提示音
<a name="model-customize-open-weight-samples-rlaif5"></a>

在这个例子中，我们[RLAIF-思想链](#model-customize-open-weight-samples-rlaif2)用来讨论自定义 jinja 提示如何取代其中一个预设提示。

**以下是 CoT 的自定义提示示例：**

```
You are an expert logical reasoning evaluator specializing in Chain-of-Thought (CoT) analysis. 

Given: A problem prompt and a model's reasoning-based response. 

Goal: Assess the quality and structure of logical reasoning, especially for specialized domains (law, medicine, finance, etc.).

Scoring rubric (start at 0.0, then add or subtract):

Core Components:

Structural Completeness (0.3 max)
- Clear problem statement: +0.05
- Defined variables/terminology: +0.05
- Organized given information: +0.05
- Explicit proof target: +0.05
- Step-by-step reasoning: +0.05
- Clear conclusion: +0.05

Logical Quality (0.4 max)
- Valid logical flow: +0.1
- Proper use of if-then relationships: +0.1
- Correct application of domain principles: +0.1
- No logical fallacies: +0.1

Technical Accuracy (0.3 max)
- Correct use of domain terminology: +0.1
- Accurate application of domain rules: +0.1
- Proper citation of relevant principles: +0.1

Critical Deductions:
A. Invalid logical leap: -0.3
B. Missing critical steps: -0.2
C. Incorrect domain application: -0.2
D. Unclear/ambiguous reasoning: -0.1

Additional Instructions:
- Verify domain-specific terminology and principles
- Check for logical consistency throughout
- Ensure conclusions follow from premises
- Flag potential domain-specific compliance issues
- Consider regulatory/professional standards where applicable

Return EXACTLY this JSON (no extra text):
{
    "score": <numerical score 0.0-1.0>,
    "component_scores": {
        "structural_completeness": <score>,
        "logical_quality": <score>,
        "technical_accuracy": <score>
    },
    "steps_present": {
        "problem_statement": <true/false>,
        "variable_definitions": <true/false>,
        "given_information": <true/false>,
        "proof_target": <true/false>,
        "step_reasoning": <true/false>,
        "conclusion": <true/false>
    },
    "reasoning": "<explain scoring decisions and identify any logical gaps>",
    "domain_flags": ["<any domain-specific concerns or compliance issues>"]
}

### (Prompt field from dataset)
Problem Prompt: {{ prompt }}

Model's Response: {{ model_output }}

### Ground truth (if applicable):
{{ ground_truth }}
```

## 通过可验证奖励 (RLVR) 进行强化学习-精确匹配
<a name="model-customize-open-weight-samples-RLVR"></a>

**输入数据集**

来源：[https://huggingface。 co/datasets/openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k)

**样本**

```
{
  "data_source": "openai/gsm8k",
  "prompt": [
    {
      "content": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? Let\'s think step by step and output the final answer after \\"####\\".",
      "role": "user"
    }
  ],
  "ability": "math",
  "reward_model": {
    "ground_truth": "72",
    "style": "rule"
  },
  "extra_info": {
    "answer": "Natalia sold 48\\/2 = <<48\\/2=24>>24 clips in May.\\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\\n#### 72",
    "index": 0,
    "question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",
    "split": "train"
  }
}
```

## RLVR-代码执行
<a name="model-customize-open-weight-samples-RLVR2"></a>

**输入数据集**

来源：[https://huggingface。 co/datasets/open-r1/codeforces](https://huggingface.co/datasets/open-r1/codeforces)

**样本**

```
{
  "data_source": "codeforces",
  "prompt": [
    {
      "content": "\nWhen tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process.\n\n[ASSESS]\n\n[ADVANCE]\n\n[VERIFY]\n\n[SIMPLIFY]\n\n[SYNTHESIZE]\n\n[PIVOT]\n\n[OUTPUT]\n\nYou should strictly follow the format below:\n\n[ACTION NAME]\n\n# Your action step 1\n\n# Your action step 2\n\n# Your action step 3\n\n...\n\nNext action: [NEXT ACTION NAME]\n\n",
      "role": "system"
    },
    {
      "content": "Title: Zebras\n\nTime Limit: None seconds\n\nMemory Limit: None megabytes\n\nProblem Description:\nOleg writes down the history of the days he lived. For each day he decides if it was good or bad. Oleg calls a non-empty sequence of days a zebra, if it starts with a bad day, ends with a bad day, and good and bad days are alternating in it. Let us denote bad days as 0 and good days as 1. Then, for example, sequences of days 0, 010, 01010 are zebras, while sequences 1, 0110, 0101 are not.\n\nOleg tells you the story of days he lived in chronological order in form of string consisting of 0 and 1. Now you are interested if it is possible to divide Oleg's life history into several subsequences, each of which is a zebra, and the way it can be done. Each day must belong to exactly one of the subsequences. For each of the subsequences, days forming it must be ordered chronologically. Note that subsequence does not have to be a group of consecutive days.\n\nInput Specification:\nIn the only line of input data there is a non-empty string *s* consisting of characters 0 and 1, which describes the history of Oleg's life. Its length (denoted as |*s*|) does not exceed 200<=000 characters.\n\nOutput Specification:\nIf there is a way to divide history into zebra subsequences, in the first line of output you should print an integer *k* (1<=\u2264<=*k*<=\u2264<=|*s*|), the resulting number of subsequences. In the *i*-th of following *k* lines first print the integer *l**i* (1<=\u2264<=*l**i*<=\u2264<=|*s*|), which is the length of the *i*-th subsequence, and then *l**i* indices of days forming the subsequence. Indices must follow in ascending order. Days are numbered starting from 1. Each index from 1 to *n* must belong to exactly one subsequence. If there is no way to divide day history into zebra subsequences, print -1.\n\nSubsequences may be printed in any order. If there are several solutions, you may print any of them. You do not have to minimize nor maximize the value of *k*.\n\nDemo Input:\n['0010100\\n', '111\\n']\n\nDemo Output:\n['3\\n3 1 3 4\\n3 2 5 6\\n1 7\\n', '-1\\n']\n\nNote:\nnone\n\nWrite Python code to solve the problem. Present the code in \n```python\nYour code\n```\nat the end.",
      "role": "user"
    }
  ],
  "ability": "code",
  "reward_model": {
    "ground_truth": "{\"inputs\": [\"0010100\", \"111\", \"0\", \"1\", \"0101010101\", \"010100001\", \"000111000\", \"0101001000\", \"0000001000\", \"0101\", \"000101110\", \"010101010\", \"0101001010\", \"0100101100\", \"0110100000\", \"0000000000\", \"1111111111\", \"0010101100\", \"1010000\", \"0001110\", \"0000000000011001100011110101000101000010010111000100110110000011010011110110001100100001001001010010\", \"01010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010\", \"0010011100000000\"], \"outputs\": [\"3\\n1 1\\n5 2 3 4 5 6\\n1 7\", \"-1\", \"1\\n1 1\", \"-1\", \"-1\", \"-1\", \"3\\n3 1 6 7\\n3 2 5 8\\n3 3 4 9\", \"4\\n5 1 2 3 4 5\\n3 6 7 8\\n1 9\\n1 10\", \"8\\n1 1\\n1 2\\n1 3\\n1 4\\n1 5\\n3 6 7 8\\n1 9\\n1 10\", \"-1\", \"-1\", \"1\\n9 1 2 3 4 5 6 7 8 9\", \"2\\n5 1 2 3 4 5\\n5 6 7 8 9 10\", \"2\\n5 1 2 3 8 9\\n5 4 5 6 7 10\", \"-1\", \"10\\n1 1\\n1 2\\n1 3\\n1 4\\n1 5\\n1 6\\n1 7\\n1 8\\n1 9\\n1 10\", \"-1\", \"2\\n3 1 8 9\\n7 2 3 4 5 6 7 10\", \"-1\", \"-1\", \"22\\n1 1\\n1 2\\n1 3\\n1 4\\n1 5\\n1 6\\n1 7\\n1 8\\n7 9 24 25 26 27 28 29\\n7 10 13 14 17 18 23 30\\n11 11 12 15 16 19 22 31 32 33 34 35\\n3 20 21 36\\n3 37 46 47\\n9 38 39 40 45 48 57 58 75 76\\n17 41 42 43 44 49 50 51 54 55 56 59 72 73 74 77 80 81\\n9 52 53 60 71 78 79 82 83 84\\n7 61 64 65 66 67 70 85\\n5 62 63 68 69 86\\n3 87 88 89\\n3 90 91 92\\n5 93 94 95 96 97\\n3 98 99 100\", \"1\\n245 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 ...\", \"8\\n3 1 8 9\\n5 2 3 4 7 10\\n3 5 6 11\\n1 12\\n1 13\\n1 14\\n1 15\\n1 16\"]}",
    "style": "rule"
  },
  "extra_info": {
    "index": 49,
    "split": "train"
  }
}
```

**奖励功能**

奖励功能：[https://github.com/volcengine/verl/tree/main/verl/utils/reward\$1score/prime\$1cod](https://github.com/volcengine/verl/tree/main/verl/utils/reward_score/prime_code) e

## RLVR-数学答案
<a name="model-customize-open-weight-samples-RLVR3"></a>

**输入数据集**

[来源：已清理的 gsm8k 数据集 https://huggingface。 co/datasets/thesven/gsm8k-reasoning/tree/main/data](https://huggingface.co/datasets/thesven/gsm8k-reasoning/tree/main/data)

**样本**

```
[
    {
        "content": "\nYou are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries...",
        "role": "system"
    },
    {
        "content": "A craft store makes a third of its sales in the fabric section, a quarter of its sales in the jewelry section, and the rest in the stationery section. They made 36 sales today. How many sales were in the stationery section?",
        "role": "user"
    },
    {
        "content": "\nTo find the number of sales in the stationery section, I need to first calculate the number of sales in the fabric and jewelry sections...\n\n\n<reasoning>\n1. Calculate the number of sales in the fabric section: 36 * 1/3 = 12\n2. Calculate the number of sales in the jewelry section: 36 * 1/4 = 9\n3. Calculate the number of sales in the stationery section: 36 - 12 - 9 = 15\n</reasoning>\n\n<reflection>\nUpon reviewing my reasoning, I noticed that my calculations seem correct...\n</reflection>\n\n<adjustment>\nUpon re-examining my math, I am confident that my calculations are correct...\n</adjustment>\n\n<output>\n15\n</output>",
        "role": "assistant"
    }
]
```

**奖励计算**

奖励功能：[https://github.com/volcengine/verl/blob/main/verl/utils/reward\$1score/gsm8k.py](https://github.com/volcengine/verl/blob/main/verl/utils/reward_score/gsm8k.py)

## RLVR-自定义 Lambda
<a name="model-customize-open-weight-samples-RLVR4"></a>

**输入数据集**

[来源：已清理的 gsm8k 数据集 https://huggingface。 co/datasets/thesven/gsm8k-reasoning/tree/main/data](https://huggingface.co/datasets/thesven/gsm8k-reasoning/tree/main/data)

**样本**

```
[
  {
    "content": "\nYou are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries.\n\nFollow these steps:\n- Think through the problem step by step within the ‹thinking> tags.\n- Reflect on your thinking\n\nto check for any errors or improvements within the ‹reflection› tags.\n- Make any necessary adjustments based on your reflection.\n- Provide your final, concise answer within the ‹output> tags.\n\nImportant: The <thinking> and ‹reflection› sections are for your internal reasoning process only.\nDo not include any part of the final answer in these sections.\nThe actual response to the query must be entirely contained within the ‹output› tags.\n\nUse the following format for your response:\n<thinking>\n[Your initial thought process goes here]\n</thinking›\n\n<reasoning>\n[Your step-by-step reasoning goes here. This is your internal thought process, not the final answer. You can create as many reasoning steps as necessary in your process.]\n</reasoning>\n\n‹reflection>\n[Your reflection on your reasoning, checking for errors or improvements. You can create as many reflection steps as necessary in your process.]\n</ reflection>\n\n<adjustment>\n[Any adjustments to your thinking based on your reflection]\n</adjustment>\n\n<output>\n[Your final, concise answer to the query. This is the only part that will be shown to the user.]\n</output>\n",
    "role": "system"
  },
  {
    "content": "A craft store makes a third of its sales in the fabric section, a quarter of its sales in the jewelry section, and the rest in the stationery section. They made 36 sales today. How many sales were in the stationery section?",
    "role": "user"
  },
  {
    "content": "<thinking>\nTo find the number of sales in the stationery section, I need to first calculate the number of sales in the fabric and jewelry sections. The store makes a third of its sales in the fabric section, which is 1/3 of 36 sales. I can calculate this as 36 * 1/3 = 12 sales in the fabric section. The store makes a quarter of its sales in the jewelry section, which is 1/4 of 36 sales. I can calculate this as 36 * 1/4 = 9 sales in the jewelry section. Since the rest of the sales are in the stationery section, I can subtract the sales from the other two sections from the total number of sales to find the number of sales in the stationery section.\n</thinking>\n\n<reasoning>\n1. Calculate the number of sales in the fabric section: 36 * 1/3 = 12\n2. Calculate the number of sales in the jewelry section: 36 * 1/4 = 9\n3. Calculate the number of sales in the stationery section: 36 - 12 - 9 = 15\n</reasoning>\n\n‹reflection>\nUpon reviewing my reasoning, I noticed that my calculations seem correct, but I should ensure that I understand the question correctly. The question asks for the number of sales in the stationery section, and I believe I have correctly calculated that. However, I should double-check my math to make sure it's accurate.\n</reflection>\n\n<adjustment>\nUpon re-examining my math, I am confident that my calculations are correct. The fabric section makes up 1/3 of the sales, the jewelry section makes up 1/4, and the rest is the stationery section. My math checks out, and I am confident in my answer.\n</adjustment>\n\n<output>\n15\n</output>",
    "role": "assistant"
  }
]
```

**奖励计算示例**

```
# RLVR Evaluator for OSS

# lambda_grader.py
import json
import re
import uuid
from typing import Any, Dict, List
 
def custom_reward(assistant_answer: str, ground_truth: str) -> float:
    """
    Add custom reward computation here
 
    Example:-
    Reward = fraction of ground-truth words that are correct
    in the correct position.
 
    Example:
      gt:   "the cat sat"
      ans:  "the dog sat"
 
      word-by-word:
        "the" == "the"  -> correct
        "dog" != "cat"  -> wrong
        "sat" == "sat"  -> correct
 
      correct = 2 out of 3 -> reward = 2/3 ≈ 0.67
    """
    ans_words = assistant_answer.strip().lower().split()
    gt_words = ground_truth.strip().lower().split()
 
    if not gt_words:
        return 0.0
 
    correct = 0
    for aw, gw in zip(ans_words, gt_words):
        if aw == gw:
            correct += 1
 
    return correct / len(gt_words)
 
 
# Lambda utility functions
def _ok(body: Any, code: int = 200) -> Dict[str, Any]:
    return {
        "statusCode": code,
        "headers": {
            "content-type": "application/json",
            "access-control-allow-origin": "*",
            "access-control-allow-methods": "POST,OPTIONS",
            "access-control-allow-headers": "content-type",
        },
        "body": json.dumps(body),
    }
 
def _assistant_text(sample: Dict[str, Any]) -> str:
    """Extract assistant text from sample messages."""
    for m in reversed(sample.get("messages", [])):
        if m.get("role") == "assistant":
            return (m.get("content") or "").strip()
    return ""
 
def _sample_id(sample: Dict[str, Any]) -> str:
    """Generate or extract sample ID."""
    if isinstance(sample.get("id"), str) and sample["id"]:
        return sample["id"]
 
    return str(uuid.uuid4())
 
def _ground_truth(sample: Dict[str, Any]) -> str:
    """Extract ground truth from sample or metadata if available"""
 
    if isinstance(sample.get("reference_answer"), str) and sample["reference_answer"]:
        return sample["reference_answer"].strip()
 
    md = sample.get("metadata") or {}
    gt = md.get("reference_answer", None) or md.get("ground_truth", None)
    if gt is None:
        return ""
    return str(gt).strip()
 
 
def _score_and_metrics(sample: Dict[str, Any]) -> Dict[str, Any]:
    sid = _sample_id(sample)
    solution_text = _assistant_text(sample)
 
    # Extract ground truth
    gt = _ground_truth(sample)
 
    metrics_list: List[Dict[str, Any]] = []
 
    # Custom rlvr scoring
    if solution_text and gt:
        
        # Compute score
        reward_score = custom_reward(
            assistant_answer=solution_text,
            ground_truth=gt
        )
        
        # Add detailed metrics
        metrics_list.append({
            "name": "custom_reward_score", 
            "value": float(reward_score), 
            "type": "Reward"
        })
       
        # The aggregate reward score is the custom reward score
        aggregate_score = float(reward_score)
        
    else:
        # No solution text or ground truth - default to 0
        aggregate_score = 0.0
        metrics_list.append({
            "name": "default_zero", 
            "value": 0.0, 
            "type": "Reward"
        })
    print("detected score", {
        "id": sid,
        "aggregate_reward_score": float(aggregate_score),
        "metrics_list": metrics_list,
    })
    return {
        "id": sid,
        "aggregate_reward_score": float(aggregate_score),
        "metrics_list": metrics_list,
    }
 
def lambda_handler(event, context):
    """AWS Lambda handler for custom reward lambda grading."""
    # CORS preflight
    if event.get("requestContext", {}).get("http", {}).get("method") == "OPTIONS":
        return _ok({"ok": True})
 
    # Body may be a JSON string (API GW/Function URL) or already a dict (Invoke)
    raw = event.get("body") or "{}"
    try:
        body = json.loads(raw) if isinstance(raw, str) else raw
    except Exception as e:
        return _ok({"error": f"invalid JSON body: {e}"}, 400)
 
    # Accept top-level list, {"batch":[...]}, or single sample object
    if isinstance(body, dict) and isinstance(body.get("batch"), list):
        samples = body["batch"]
    else:
        return _ok({
            "error": "Send a sample object, or {'batch':[...]} , or a top-level list of samples."
        }, 400)
 
    try:
        results = [_score_and_metrics(s) for s in samples]
    except Exception as e:
        return _ok({"error": f"Custom scoring failed: {e}"}, 500)
 
    return _ok(results)
```

**奖励函数代码示例**

```
# RLVR Evaluator for OSS
# lambda_grader.py

import json
import re
import uuid from typing 
import Any, Dict, List
 
def custom_reward(assistant_answer: str, ground_truth: str) -> float:
    """
    Add custom reward computation here
 
    Example:-
    Reward = fraction of ground-truth words that are correct
    in the correct position.
 
    Example:
      gt:   "the cat sat"
      ans:  "the dog sat"
 
      word-by-word:
        "the" == "the"  -> correct
        "dog" != "cat"  -> wrong
        "sat" == "sat"  -> correct
 
      correct = 2 out of 3 -> reward = 2/3 ≈ 0.67
    """
    ans_words = assistant_answer.strip().lower().split()
    gt_words = ground_truth.strip().lower().split()
 
    if not gt_words:
        return 0.0
 
    correct = 0
    for aw, gw in zip(ans_words, gt_words):
        if aw == gw:
            correct += 1
 
    return correct / len(gt_words)
 
 
# Lambda utility functions
def _ok(body: Any, code: int = 200) -> Dict[str, Any]:
    return {
        "statusCode": code,
        "headers": {
            "content-type": "application/json",
            "access-control-allow-origin": "*",
            "access-control-allow-methods": "POST,OPTIONS",
            "access-control-allow-headers": "content-type",
        },
        "body": json.dumps(body),
    }
 
def _assistant_text(sample: Dict[str, Any]) -> str:
    """Extract assistant text from sample messages."""
    for m in reversed(sample.get("messages", [])):
        if m.get("role") == "assistant":
            return (m.get("content") or "").strip()
    return ""
 
def _sample_id(sample: Dict[str, Any]) -> str:
    """Generate or extract sample ID."""
    if isinstance(sample.get("id"), str) and sample["id"]:
        return sample["id"]
 
    return str(uuid.uuid4())
 
def _ground_truth(sample: Dict[str, Any]) -> str:
    """Extract ground truth from sample or metadata if available"""
 
    if isinstance(sample.get("reference_answer"), str) and sample["reference_answer"]:
        return sample["reference_answer"].strip()
 
    md = sample.get("metadata") or {}
    gt = md.get("reference_answer", None) or md.get("ground_truth", None)
    if gt is None:
        return ""
    return str(gt).strip()
 
 
def _score_and_metrics(sample: Dict[str, Any]) -> Dict[str, Any]:
    sid = _sample_id(sample)
    solution_text = _assistant_text(sample)
 
    # Extract ground truth
    gt = _ground_truth(sample)
 
    metrics_list: List[Dict[str, Any]] = []
 
    # Custom rlvr scoring
    if solution_text and gt:
        
        # Compute score
        reward_score = custom_reward(
            assistant_answer=solution_text,
            ground_truth=gt
        )
        
        # Add detailed metrics
        metrics_list.append({
            "name": "custom_reward_score", 
            "value": float(reward_score), 
            "type": "Reward"
        })
       
        # The aggregate reward score is the custom reward score
        aggregate_score = float(reward_score)
        
    else:
        # No solution text or ground truth - default to 0
        aggregate_score = 0.0
        metrics_list.append({
            "name": "default_zero", 
            "value": 0.0, 
            "type": "Reward"
        })
    print("detected score", {
        "id": sid,
        "aggregate_reward_score": float(aggregate_score),
        "metrics_list": metrics_list,
    })
    return {
        "id": sid,
        "aggregate_reward_score": float(aggregate_score),
        "metrics_list": metrics_list,
    }
 
def lambda_handler(event, context):
    """AWS Lambda handler for custom reward lambda grading."""
    # CORS preflight
    if event.get("requestContext", {}).get("http", {}).get("method") == "OPTIONS":
        return _ok({"ok": True})
 
    # Body may be a JSON string (API GW/Function URL) or already a dict (Invoke)
    raw = event.get("body") or "{}"
    try:
        body = json.loads(raw) if isinstance(raw, str) else raw
    except Exception as e:
        return _ok({"error": f"invalid JSON body: {e}"}, 400)
 
    # Accept top-level list, {"batch":[...]}, or single sample object
    if isinstance(body, dict) and isinstance(body.get("batch"), list):
        samples = body["batch"]
    else:
        return _ok({
            "error": "Send a sample object, or {'batch':[...]} , or a top-level list of samples."
        }, 400)
 
    try:
        results = [_score_and_metrics(s) for s in samples]
    except Exception as e:
        return _ok({"error": f"Custom scoring failed: {e}"}, 500)
 
    return _ok(results)
```

**奖励提示示例**

```
You are an expert RAG response evaluator specializing in faithfulness and relevance assessment.
Given: Context documents, a question, and response statements.
Goal: Evaluate both statement-level faithfulness and overall response relevance to the question.

Scoring rubric (start at 0.0, then add or subtract):

Core Components:

Faithfulness Assessment (0.6 max)
Per statement evaluation:
- Direct support in context: +0.2
- Accurate inference from context: +0.2
- No contradictions with context: +0.2
Deductions:
- Hallucination: -0.3
- Misrepresentation of context: -0.2
- Unsupported inference: -0.1

Question Relevance (0.4 max)
- Direct answer to question: +0.2
- Appropriate scope/detail: +0.1
- Proper context usage: +0.1
Deductions:
- Off-topic content: -0.2
- Implicit/meta responses: -0.2
- Missing key information: -0.1

Critical Flags:
A. Complete hallucination
B. Context misalignment
C. Question misinterpretation
D. Implicit-only responses

Additional Instructions:
- Evaluate each statement independently
- Check for direct textual support
- Verify logical inferences
- Assess answer completeness
- Flag any unsupported claims

Return EXACTLY this JSON (no extra text):
{
    "statements_evaluation": [
        {
            "statement": "<statement_text>",
            "verdict": <0 or 1>,
            "reason": "<detailed explanation>",
            "context_support": "<relevant context quote or 'None'>"
        }
    ],
    "overall_assessment": {
        "question_addressed": <0 or 1>,
        "reasoning": "<explanation>",
        "faithfulness_score": <0.0-1.0>,
        "relevance_score": <0.0-1.0>
    },
    "flags": ["<any critical issues>"]
}

## Current Evaluation Task

### Context
{{ ground_truth }}

### Question
{{ extra_info.question }}

### Model's Response
{{ model_output }}
```

# 发行说明
<a name="model-customize-release-note"></a>

**A SageMaker I 模型自定义图片**

**Support 计划**
+ 主要版本：下一个主要版本发布后 12 个月
+ 次要版本：下一个次要版本发布后 6 个月
+ 补丁版本：不保证支持（升级到最新补丁）

以下是适用于亚马逊 EKS (EKS) 的 Base Deep Learning Containers (EKS) 和 SageMaker 人工智能训练作业 (SMTJ) 的发行说明：


****  

| 版本 | Type | 服务 | 图片网址 | 
| --- | --- | --- | --- | 
|  1.0.0  | CUDA | EKS |  `652744875666.dkr.ecr.amazonaws.com/hyperpod-model-customization:verl-eks-v1.0.0`  | 
|  1.0.0  | CUDA | SMTJ |  `652744875666.dkr.ecr.amazonaws.com/hyperpod-model-customization:verl-smtj-v1.0.0`  | 
|  1.0.0  | CUDA | SMJT |  `652744875666.dkr.ecr.amazonaws.com/hyperpod-model-customization:v1-v1.0.0`  | 
|  1.0.0  | CUDA | SMTJ |  `652744875666.dkr.ecr.amazonaws.com/hyperpod-model-customization:llama-90b-v1.0.0`  | 

**AWS 区域 支持**


****  

| Region | 代码 | 无服务器 SMTJ 支持 | 
| --- | --- | --- | 
| 亚太地区（孟买） | ap-south-1 | 否 | 
| 亚太地区（新加坡） | ap-southeast-1 | 否 | 
| 亚太地区（悉尼） | ap-southeast-2 | 否 | 
| 亚太地区（东京) | ap-northeast-1 | 是 | 
| 欧洲地区（法兰克福） | eu-central-1 | 否 | 
| 欧洲地区（爱尔兰） | eu-west-1 | 是 | 
| 欧洲地区（斯德哥尔摩） | eu-north-1 | 否 | 
| 南美洲（圣保罗） | sa-east-1 | 否 | 
| 美国东部（弗吉尼亚州北部） | us-east-1 | 是 | 
| 美国东部（俄亥俄州） | us-east-2 | 否 | 
| 美国西部（北加利福尼亚） | us-west-1 | 否 | 
| 美国西部（俄勒冈州） | us-west-2 | 是 | 