View a markdown version of this page

定义评估方法 - Amazon Bedrock

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

定义评估方法

概述

为每个提示模板选择一种评估方法,或者省略所有可选的评估字段作为系统默认值。同一个作业中的不同模板可以使用不同的方法。评估会引导即时优化,因此请尽可能精确地定义您的方法和标准。

默认评估

省略所有可选的评估字段(steeringCriteriacustomLLMJConfigevaluationMetricLambdaArn)。该服务使用 LLM-as-judge 由 Anthropic Claude Sonnet 4.6 提供支持的内置通用工具,它评估三个默认标准:答案准确性、答案完整性和表达质量。根据提示、目标模型的答案和参考答案,评委为每个维度分配一个分数,并为任务动态分配适当的权重,然后得出加权总分。

我们建议您定义自己的评估方法以获得最佳结果。

系统提供的默认裁判提示

以下是 Anthropic Claude Sonnet 4.6 的默认评估中使用的系统提供的完整评判提示:

Please act as an impartial judge and evaluate the quality of an answer to a user question, with the help of a reference answer. You will be given: (1) a user question, enclosed in <user_question></user_question> tags (2) an answer, enclosed in <answer></answer> tags (3) a reference answer, enclosed in <reference_answer></reference_answer> tags ## Universal Evaluation Dimensions Evaluate the answer across these core dimensions: **(1) Answer Accuracy:** examines correctness, consistency, and factuality alignment between the <answer> and the <user_question>; examines if the <answer> contains irrelevant or wrongful information/hallucination. **(2) Answer Completeness:** examines if the <answer> is fully addressing the <user_question>; examines if the <answer> is good at relevance/informativeness: selection of important/key content from <user_question> **(3) Expression Quality:** examines if the <answer> is concise at answering the <user_question>. NOTE that unless there is special instruction, more concise <answer> is always better, and explanation or rational is strictly NOT needed - THIS IS THE MOST IMPORTANT! examines the alignment on instruction following, e.g., if the <answer> adheres to both explicit guidelines and implicit guidelines (like few-shot examples) in the <user_question>; ## Scoring Rubric For each dimension, assign one score: - **3 points**: Fully satisfies the dimension requirements - **2 points**: Mostly satisfies with minor issues or gaps - **1 point**: Partially satisfies but has notable limitations - **0 points**: Does not satisfy the dimension requirements ## Evaluation Process 1. First, identify the task type from the user question 2. Consider any additional criteria provided 3. Score each dimension independently 4. Determine appropriate weights and calculate final weighted score ## Dimension Weighting and Final Scoring **Weight Determination Process:** Assign weights (must sum to 1.0) based on: - Explicit weights in evaluation_criteria (if provided) - Task analysis and question requirements (if no explicit weights) - Default weights (Answer Accuracy: 0.35, Answer Completeness: 0.30, Expression Quality: 0.35) as fallback **Weight Guidelines:** - **High Accuracy Weight (0.4-0.6)**: Factual questions, multiple choice, technical problems - **High Completeness Weight (0.4-0.6)**: Complex explanatory tasks, multi-part questions - **High Expression Weight (0.4-0.6)**: Creative tasks, presentation-focused questions, format-specific requirements {custom_eval_weight_guideline} **Overall Score Calculation:** Overall = (Answer_Accuracy x Weight_A) + (Answer_Completeness x Weight_C) + (Expression_Quality x Weight_E) ## Output Format Provide your evaluation in this exact format: <Task_Analysis>Brief analysis of task type and appropriate weight rationale</Task_Analysis> <Weights>Answer Accuracy: 0.XX, Answer Completeness: 0.XX, Expression Quality: 0.XX</Weights> <Answer Accuracy>X</Answer Accuracy> <Answer Completeness>X</Answer Completeness> <Expression Quality>X</Expression Quality> <Calculation>(X x 0.XX) + (X x 0.XX) + (X x 0.XX) = X.XX</Calculation> <Overall>X.XX</Overall> <Justification> **Answer Accuracy**: [Evaluate factual accuracy, alignment with reference answer, absence of errors/hallucinations, and logical consistency] **Answer Completeness**: [Assess whether all aspects of the question are addressed, necessary information is included, and content stays relevant] **Expression Quality**: [Examine formatting/style adherence, appropriate detail level, communication clarity, and instruction following] **Weight Application**: [Explain how the chosen weights reflect the task requirements and impact the final score] </Justification> --- ## Current Evaluation Task <user_question> {prompt} </user_question> <answer> {prediction} </answer> <reference_answer> {gold} </reference_answer> Based on the above guidelines and criteria, provide your evaluation:

转向标准

指导标准是指导优化方向的简短自然语言描述符。

  • Format: "steeringCriteria": ["string1", "string2"]

  • 它们可能是什么:从一个单词到几个句子,对你想要的模型反应进行定性或定量描述。

  • 限制:每个提示模板最多 5 个。

示例

"steeringCriteria": ["PROFESSIONAL", "CONCISE"]

自定义 LLM-as-a-judge

提供完整的评分量规,其中包含您定义的评分等级。您的自定义判断提示将与服务的系统判断提示合并,并赋予更强的权重。

配置

  • 格式:p "customLLMJConfig": {"customLLMJPrompt": "...", "customLLMJModelId": "..."} lus "customEvaluationMetricLabel": "My Metric"

  • 可用的评委模型:anthropic.claude-opus-4-6-v1、anthropic.claude-sonnet-4-5-20250929-v 1:0、anthropic.claude-sonnet-4-6

  • 裁判提示中的占位符:

    • {{prompt}}: 完全呈现的提示(提示模板加上评估样本的组合)

    • {{response}}: 模型输出

    • {{referenceResponse}}: 实地真相

  • 评分:定义您的评分等级,以便分数越高越好。该服务对最终结果的所有分数进行标准化。

  • 如果您有多个评分量规,请将它们合并为一个评判提示。

撰写评委提示的最佳实践

使用明确定义的评分量规,其中包含明确的评分标准和每个分数级别的具体示例。用行为描述而不是主观形容词来锚定每个评分量规级别。至少包括一个显示非完美分数的有效示例,以使评委摆脱默认评分为高评分。指示模型在计算分数之前提供书面理由。在分配总分之前,请考虑独立评估特定维度。最值得信赖和最有帮助的 LLM-as-a-judge 评估者通常是那些你同意评委模型提供的答案的人,因此使用你已经审查过的评估可能会有所帮助。

运行时如何将您的自定义判断提示符与系统提示符合并

当您提供自己的 LLM-as-a-judge 评估者提示时,它会与通用服务提供的评判提示合并,其中包含有关格式的具体说明和其他有助于优化进展的最佳实践。在最终判决中,你的自定义裁判提示比通用标准更具权重。具体而言,该服务:

  • 从您的自定义提示中提取意图

  • 标准化比例以匹配系统的 0 到 3 评分量规

  • 将其作为命名维度注入到 CUSTOM_CRITERIA_DESCRITIA 标签中

  • 偏离加权指令以提高自定义标准的重要性(0.3 到 0.6)

  • 添加优先规则,说明自定义标准覆盖与其他维度冲突

  • 保留评估的原始语义

示例:您可以提供以下用于评估忠诚度的自定义 LLM-as-a-judge 提示:

You are given a task in some context (Input), and a candidate answer. Is the candidate answer faithful to the task description and context? A response is unfaithful only when (1) it clearly contradicts the context, or (2) the task implies that the response must be based on the context, like in a summarization task. If the task does not ask to respond based on the context, the model is allowed to use its own knowledge to provide a response, even if its claims are not verifiable. Task: {{prompt}} Candidate Response: {{response}} First provide your explanation, then state your final answer. Use the following format: Explanation: [Explanation], Answer: [Answer], where '[Answer]' must be one of: none is faithful some is faithful approximately half is faithful most is faithful all is faithful

然后将其与默认 LLM-as-a-judge 提示合并,并赋予很强的权重。最终效果:你的单一标准忠诚度提示在多维评估中变成了一个权重较高的轴,而系统则在它周围添加了结构(准确性、完整性、表达式)。

以下是由此产生的合并裁判提示:

"""Please act as an impartial judge and evaluate the quality of an answer to a user question, with the help of a reference answer. You will be given: (1) a user question, enclosed in <user_question></user_question> tags (2) an answer, enclosed in <answer></answer> tags (3) a reference answer, enclosed in <reference_answer></reference_answer> tags (4) custom evaluation criteria that have been integrated into the evaluation dimensions below **IMPORTANT**: - Custom criteria requirements take absolute precedence over user requirements specified inside <user_question> </user_question> **IMPORTANT**: - If there is any conflict between custom criteria and user question requirements, prioritize custom criteria ## Universal Evaluation Dimensions Evaluate the answer across these core dimensions: **(1) Answer Accuracy:** examines correctness, consistency, and factuality alignment between the <answer> and the <user_question>; examines if the <answer> contains irrelevant or wrongful information/hallucination. **(2) Answer Completeness:** examines if the <answer> is fully addressing the <user_question>; examines if the <answer> is good at relevance/informativeness: selection of important/key content from <user_question> **(3) Expression Quality:** examines if the <answer> is concise at answering the <user_question>. NOTE that unless there is special instruction, more concise <answer> is always better, and explanation or rational is strictly NOT needed - THIS IS THE MOST IMPORTANT! examines the alignment on instruction following, e.g., if the <answer> adheres to both explicit guidelines and implicit guidelines (like few-shot examples) in the <user_question>; <CUSTOM_CRITERIA_DESCRIPTION> **(4) Faithfulness to Context:** examines whether the candidate answer is faithful to the task description and context provided in the user question. A response is unfaithful only when (1) it clearly contradicts the context, or (2) the task implies that the response must be based on the context (like in a summarization task). If the task does not ask to respond based on the context, the model is allowed to use its own knowledge to provide a response, even if its claims are not verifiable. Evaluate the degree of faithfulness on the following scale: - **3 points**: All content is faithful (no contradictions, fully grounded when required) - **2 points**: Most content is faithful (minor deviations or unverifiable claims when context-grounding is required) - **1 point**: Some content is faithful or approximately half is faithful (notable contradictions or significant departures from context when required) - **0 points**: None or minimal content is faithful (clear contradictions or complete disregard of context when grounding is required) </CUSTOM_CRITERIA_DESCRIPTION> ## Dimension Weighting and Final Scoring **Weight Determination Process:** Assign weights (must sum to 1.0) based on: - Explicit weights in evaluation_criteria (if provided) - Task analysis and question requirements (if no explicit weights) - Default weights (Answer Accuracy: 0.25, Answer Completeness: 0.25, Expression Quality: 0.25, Faithfulness to Context: 0.25) as fallback **Weight Guidelines:** - **High Accuracy Weight (0.3-0.5)**: Factual questions, multiple choice, technical problems - **High Completeness Weight (0.3-0.5)**: Complex explanatory tasks, multi-part questions - **High Expression Weight (0.3-0.5)**: Creative tasks, presentation-focused questions, format-specific requirements - [*IMPORTANT*] **High Custom Criteria Weight**: The custom criteria (Faithfulness to Context) should always be *prioritized*. Assign it significant weight (0.3-0.6) and adjust other weights accordingly. **Overall Score Calculation:** Overall = (Answer_Accuracy x Weight_A) + (Answer_Completeness x Weight_C) + (Expression_Quality x Weight_E) + (Faithfulness_to_Context x Weight_F) ## Current Evaluation Task <user_question> {prompt} </user_question> <answer> {prediction} </answer> <reference_answer> {gold} </reference_answer> Based on the above guidelines and criteria, provide your evaluation:"""

自定义 Lambda 评估器

将您自己的评分函数作为 Lambda 函数使用。

配置

在您的输入 JSONL 文件中,为每个应使用它的提示模板指定 Lambda ARN。您还提供用于命名指标的字customEvaluationMetricLabel段:

"evaluationMetricLambdaArn": "arn:aws:lambda:us-west-2:123456789012:function:my-eval-function", "customEvaluationMetricLabel": "My Custom Metric"

当你通过 API 创建任务时,不需要在CreateAdvancedPromptOptimizationJob请求本身中进行额外的评估配置。评估方法是根据输入 JSONL 文件中的每个模板确定的。

Lambda 要求

  • 包含所有代码的单个.py文件

  • 处理程序设置为 lambda_function.lambda_handler

  • 必须实现compute_score(preds, golds)返回 {"score": float, "scores": [float, ...]}

  • golds参数包含referenceResponse值。如果您未referenceResponse在输入数据集中提供,则无需传golds入您的compute_score函数。

  • 永远不要崩溃;出错时返回 0.0 而不是引发异常

  • 优先使用连续分数(0.0 到 1.0)而不是二进制分数, 0/1 以加快优化收敛速度

  • 将大批量的超时设置为最大 15 分钟 (900 秒),以避免提前超时

  • 添加基于资源的策略,允许调用您bedrock.amazonaws.com的 Lambda

Lambda 模板

""" APO Custom Metric Lambda - Minimal Template Handler: lambda_function.lambda_handler """ import logging from typing import List, Dict, Any logger = logging.getLogger() logger.setLevel(logging.INFO) def compute_score(preds: List[str], golds: List[str]) -> Dict[str, Any]: """ Score predictions against ground truths. Args: preds: Model outputs (one per sample) golds: Expected answers (one per sample) Returns: Must contain: "score": float - aggregate score (higher is better) "scores": list[float] - per-instance scores """ # --- REPLACE THIS with your scoring logic --- scores = [] for pred, gold in zip(preds, golds): # Example: exact match (case-insensitive) scores.append(1.0 if pred.strip().lower() == gold.strip().lower() else 0.0) return { "score": sum(scores) / len(scores) if scores else 0.0, "scores": scores, } def lambda_handler(event, context): """ Lambda entry point. APO service sends: event = {"preds": ["output1", ...], "golds": ["truth1", ...]} """ logger.info(f"Received {len(event.get('preds', []))} predictions") try: preds = event.get("preds", []) golds = event.get("golds", []) if not preds: return {"score": 0.0, "scores": []} return compute_score(preds, golds) except Exception as e: logger.error(f"Error: {e}", exc_info=True) return {"score": 0.0, "scores": [0.0] * len(event.get("preds", [])), "error": str(e)}

有关更多详细示例,包括 GitHub 用于错误处理和 Lambda 函数输入验证的样板代码,请参阅 AWS 示例。