为 Amazon Bedrock 在其中为您调用模型的评测作业准备数据集使用自己的推理响应数据准备用于评测作业的数据集

为使用模型作为评判工具的模型评测作业创建提示数据集

要创建使用模型作为评判工具的模型评测作业，您必须指定提示数据集。此提示数据集使用与自动模型评测作业相同的格式，并且在推理期间用于您选择评测的模型。

如果您想使用已生成的响应来评测非 Amazon Bedrock 模型，请按使用自己的推理响应数据准备用于评测作业的数据集中所述，将这些响应包含在提示数据集内。当您提供自己的推理响应数据时，Amazon Bedrock 会跳过模型-调用步骤，并使用您提供的数据执行评测作业。

自定义提示数据集必须存储在 Amazon S3 中，且使用 JSON 行格式和 .jsonl 文件扩展名。每行必须是有效的 JSON 对象。每个评测作业的数据集内最多可以有 1000 条提示。

LLM-as-a-judge 评估作业不需要配置 CORS。对于基于人工的评估任务，需要在 S3 输出存储桶上使用 CORS。要了解更多信息，请参阅必需的 S3 存储桶的跨源资源共享（CORS）权限。

为 Amazon Bedrock 在其中为您调用模型的评测作业准备数据集

要运行 Amazon Bedrock 在其中为您调用模型的评测作业，请创建包含以下键-值对的提示数据集：

prompt – 您希望模型进行响应的提示。
referenceResponse –（可选）基础事实响应。
category–（可选）生成每个类别报告的评估分数。

注意

如果您选择提供基础事实响应（referenceResponse)），Amazon Bedrock 将在计算完整性（Builtin.Completeness）和正确性（Builtin.Correctness）指标时使用此参数。您也可以在不提供基础事实响应的情况下使用这些指标。要查看这两种场景的评判工具提示，请参阅Built-in 指标评估器提示模型即判断评估作业中针对所选评判模型的部分。

下面是一个包含 6 个输入并使用了 JSON 行格式的自定义数据集示例。


{"prompt":"Provide the prompt you want the model to use during inference","category":"(Optional) Specify an optional category","referenceResponse":"(Optional) Specify a ground truth response."}
{"prompt":"Provide the prompt you want the model to use during inference","category":"(Optional) Specify an optional category","referenceResponse":"(Optional) Specify a ground truth response."}
{"prompt":"Provide the prompt you want the model to use during inference","category":"(Optional) Specify an optional category","referenceResponse":"(Optional) Specify a ground truth response."}
{"prompt":"Provide the prompt you want the model to use during inference","category":"(Optional) Specify an optional category","referenceResponse":"(Optional) Specify a ground truth response."}
{"prompt":"Provide the prompt you want the model to use during inference","category":"(Optional) Specify an optional category","referenceResponse":"(Optional) Specify a ground truth response."}
{"prompt":"Provide the prompt you want the model to use during inference","category":"(Optional) Specify an optional category","referenceResponse":"(Optional) Specify a ground truth response."}

为清楚起见，以下示例显示了一个展开的条目。在实际提示数据集内，每一行都必须是一个有效的 JSON 对象。


{
  "prompt": "What is high intensity interval training?",
  "category": "Fitness",
  "referenceResponse": "High-Intensity Interval Training (HIIT) is a cardiovascular exercise approach that involves short, intense bursts of exercise followed by brief recovery or rest periods."
}

使用自己的推理响应数据准备用于评测作业的数据集

要使用已生成的响应运行评测作业，请创建一个包含以下键-值对的提示数据集：

prompt – 您的模型用来生成响应的提示。
referenceResponse –（可选）基础事实响应。
category–（可选）生成每个类别报告的评估分数。
modelResponses – 来自您自己的推理的响应，您希望 Amazon Bedrock 评测该响应。使用模型作为评判工具的评测作业仅支持每个提示对应一个模型响应，该响应通过以下键定义：
- response – 包含模型推理响应的字符串。
- modelIdentifier – 标识生成了响应的模型的字符串。您只能在评测作业中使用一个唯一的 modelIdentifier，并且数据集内的每个提示都必须使用此标识符。

注意

下面是一个包含 6 个采用 JSON 行格式的输入的自定义数据集示例。


{"prompt":"The prompt you used to generate the model response","referenceResponse":"(Optional) a ground truth response","category":"(Optional) a category for the prompt","modelResponses":[{"response":"The response your model generated","modelIdentifier":"A string identifying your model"}]}
{"prompt":"The prompt you used to generate the model response","referenceResponse":"(Optional) a ground truth response","category":"(Optional) a category for the prompt","modelResponses":[{"response":"The response your model generated","modelIdentifier":"A string identifying your model"}]}
{"prompt":"The prompt you used to generate the model response","referenceResponse":"(Optional) a ground truth response","category":"(Optional) a category for the prompt","modelResponses":[{"response":"The response your model generated","modelIdentifier":"A string identifying your model"}]}
{"prompt":"The prompt you used to generate the model response","referenceResponse":"(Optional) a ground truth response","category":"(Optional) a category for the prompt","modelResponses":[{"response":"The response your model generated","modelIdentifier":"A string identifying your model"}]}
{"prompt":"The prompt you used to generate the model response","referenceResponse":"(Optional) a ground truth response","category":"(Optional) a category for the prompt","modelResponses":[{"response":"The response your model generated","modelIdentifier":"A string identifying your model"}]}
{"prompt":"The prompt you used to generate the model response","referenceResponse":"(Optional) a ground truth response","category":"(Optional) a category for the prompt","modelResponses":[{"response":"The response your model generated","modelIdentifier":"A string identifying your model"}]}

为清楚起见，下面的示例显示了提示数据集内一个已展开的条目。


{
    "prompt": "What is high intensity interval training?",
    "referenceResponse": "High-Intensity Interval Training (HIIT) is a cardiovascular exercise approach that involves short, intense bursts of exercise followed by brief recovery or rest periods.",
    "category": "Fitness",
     "modelResponses": [
        {
            "response": "High intensity interval training (HIIT) is a workout strategy that alternates between short bursts of intense, maximum-effort exercise and brief recovery periods, designed to maximize calorie burn and improve cardiovascular fitness.",
            "modelIdentifier": "my_model"
        }
    ]
}

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

LLM as a judge 的模型评测作业

评测指标