View a markdown version of this page

定義評估方法 - Amazon Bedrock

本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。

定義評估方法

概觀

每個提示範本選擇一個評估方法,或省略系統預設的所有選用評估欄位。相同任務中的不同範本可以使用不同的方法。評估會引導提示最佳化,因此請盡可能精確地定義您的方法和條件。

預設評估

省略所有選用的評估欄位 (steeringCriteriacustomLLMJConfigevaluationMetricLambdaArn)。此服務使用內建的一般 LLM-as-judge,採用 Anthropic Claude Sonnet 4.6 技術,可評估三個預設條件:回答準確性、回答完整性和表達品質。根據提示、目標模型的答案和參考答案,判斷器會為每個維度指派分數,並動態指派適當的權重給任務,然後產生加權整體分數。

我們建議您定義自己的評估方法,以獲得最佳結果。

預設系統提供的判斷提示

以下是 Anthropic Claude Sonnet 4.6 預設評估所使用的完整系統提供的判斷提示:

Please act as an impartial judge and evaluate the quality of an answer to a user question, with the help of a reference answer. You will be given: (1) a user question, enclosed in <user_question></user_question> tags (2) an answer, enclosed in <answer></answer> tags (3) a reference answer, enclosed in <reference_answer></reference_answer> tags ## Universal Evaluation Dimensions Evaluate the answer across these core dimensions: **(1) Answer Accuracy:** examines correctness, consistency, and factuality alignment between the <answer> and the <user_question>; examines if the <answer> contains irrelevant or wrongful information/hallucination. **(2) Answer Completeness:** examines if the <answer> is fully addressing the <user_question>; examines if the <answer> is good at relevance/informativeness: selection of important/key content from <user_question> **(3) Expression Quality:** examines if the <answer> is concise at answering the <user_question>. NOTE that unless there is special instruction, more concise <answer> is always better, and explanation or rational is strictly NOT needed - THIS IS THE MOST IMPORTANT! examines the alignment on instruction following, e.g., if the <answer> adheres to both explicit guidelines and implicit guidelines (like few-shot examples) in the <user_question>; ## Scoring Rubric For each dimension, assign one score: - **3 points**: Fully satisfies the dimension requirements - **2 points**: Mostly satisfies with minor issues or gaps - **1 point**: Partially satisfies but has notable limitations - **0 points**: Does not satisfy the dimension requirements ## Evaluation Process 1. First, identify the task type from the user question 2. Consider any additional criteria provided 3. Score each dimension independently 4. Determine appropriate weights and calculate final weighted score ## Dimension Weighting and Final Scoring **Weight Determination Process:** Assign weights (must sum to 1.0) based on: - Explicit weights in evaluation_criteria (if provided) - Task analysis and question requirements (if no explicit weights) - Default weights (Answer Accuracy: 0.35, Answer Completeness: 0.30, Expression Quality: 0.35) as fallback **Weight Guidelines:** - **High Accuracy Weight (0.4-0.6)**: Factual questions, multiple choice, technical problems - **High Completeness Weight (0.4-0.6)**: Complex explanatory tasks, multi-part questions - **High Expression Weight (0.4-0.6)**: Creative tasks, presentation-focused questions, format-specific requirements {custom_eval_weight_guideline} **Overall Score Calculation:** Overall = (Answer_Accuracy x Weight_A) + (Answer_Completeness x Weight_C) + (Expression_Quality x Weight_E) ## Output Format Provide your evaluation in this exact format: <Task_Analysis>Brief analysis of task type and appropriate weight rationale</Task_Analysis> <Weights>Answer Accuracy: 0.XX, Answer Completeness: 0.XX, Expression Quality: 0.XX</Weights> <Answer Accuracy>X</Answer Accuracy> <Answer Completeness>X</Answer Completeness> <Expression Quality>X</Expression Quality> <Calculation>(X x 0.XX) + (X x 0.XX) + (X x 0.XX) = X.XX</Calculation> <Overall>X.XX</Overall> <Justification> **Answer Accuracy**: [Evaluate factual accuracy, alignment with reference answer, absence of errors/hallucinations, and logical consistency] **Answer Completeness**: [Assess whether all aspects of the question are addressed, necessary information is included, and content stays relevant] **Expression Quality**: [Examine formatting/style adherence, appropriate detail level, communication clarity, and instruction following] **Weight Application**: [Explain how the chosen weights reflect the task requirements and impact the final score] </Justification> --- ## Current Evaluation Task <user_question> {prompt} </user_question> <answer> {prediction} </answer> <reference_answer> {gold} </reference_answer> Based on the above guidelines and criteria, provide your evaluation:

轉向條件

轉向條件是引導最佳化方向的簡短自然語言描述項。

  • 格式"steeringCriteria": ["string1", "string2"]

  • 內容:從單一單字到幾個句子的任何內容,包含您希望模型回應如何呈現的定性或量化描述。

  • 限制:每個提示範本最多 5 個。

範例

"steeringCriteria": ["PROFESSIONAL", "CONCISE"]

自訂 LLM-as-a-judge

使用您定義的分級比例來提供完整的 Rubric。您的自訂判斷提示會與服務的系統判斷提示合併,並提供更強大的權重。

Configuration

  • 格式:"customLLMJConfig": {"customLLMJPrompt": "...", "customLLMJModelId": "..."}加上 "customEvaluationMetricLabel": "My Metric"

  • 可用的判斷模型:anthropic.claude-opus-4-6-v1、anthropic.claude-sonnet-4-5-20250929-v1:0、anthropic.claude-sonnet-4-6

  • 您的判斷提示中的預留位置:

    • {{prompt}}:完全轉譯提示 (提示範本加上評估範例加總)

    • {{response}}:模型輸出

    • {{referenceResponse}}:基本事實

  • 評分:定義您的評分比例,讓數字越大越好。該服務會標準化最終結果的所有分數。

  • 如果您有多個 rubric,請將它們合併為單一判斷提示。

撰寫判斷提示的最佳實務

使用明確定義的 Rubric 搭配明確評分標準和每個分數等級的具體範例。使用行為描述錨定每個 rubric 關卡,而不是主觀形容詞。包含至少一個顯示不完美分數的工作範例,以將判斷從預設值校正為高評分。指示模型在數值分數之前提供書面理由。在指派整體分數之前,請考慮獨立評估特定維度。最受信任和實用的 LLM-as-a-judge 評估程式通常是您同意判斷模型提供的答案,因此使用您已審核的評估可能會有所幫助。

您的自訂判斷提示與執行時間的系統提示合併的方式

當您提供自己的 LLM-as-a-judge 評估器提示時,它會與一般服務提供的判斷提示合併,其中包含格式和其他有助於最佳化進度的最佳實務的特定指示。您的自訂判斷提示會獲得比最終判斷中一般條件更強的權重。具體而言,服務:

  • 從自訂提示擷取意圖

  • 標準化擴展以符合系統的 0 到 3 盧比

  • 在 CUSTOM_CRITERIA_DESCRIPTION 標籤中將其注入為具名維度

  • 偏差加權指示,讓自訂條件的重要性提高 (0.3 到 0.6)

  • 新增優先順序規則,說明自訂條件覆寫與其他維度的衝突

  • 保留評估的原始語意

範例:您可以提供下列自訂 LLM-as-a-judge 提示來評估忠誠度:

You are given a task in some context (Input), and a candidate answer. Is the candidate answer faithful to the task description and context? A response is unfaithful only when (1) it clearly contradicts the context, or (2) the task implies that the response must be based on the context, like in a summarization task. If the task does not ask to respond based on the context, the model is allowed to use its own knowledge to provide a response, even if its claims are not verifiable. Task: {{prompt}} Candidate Response: {{response}} First provide your explanation, then state your final answer. Use the following format: Explanation: [Explanation], Answer: [Answer], where '[Answer]' must be one of: none is faithful some is faithful approximately half is faithful most is faithful all is faithful

然後,這會與預設的 LLM-as-a-judge 提示合併,並給予強權重。淨效果:您的單一條件真實性提示會在多維度評估中成為一個高度加權軸,而系統會在其周圍新增結構 (準確性、完整性、表達式)。

以下是產生的合併判斷提示:

"""Please act as an impartial judge and evaluate the quality of an answer to a user question, with the help of a reference answer. You will be given: (1) a user question, enclosed in <user_question></user_question> tags (2) an answer, enclosed in <answer></answer> tags (3) a reference answer, enclosed in <reference_answer></reference_answer> tags (4) custom evaluation criteria that have been integrated into the evaluation dimensions below **IMPORTANT**: - Custom criteria requirements take absolute precedence over user requirements specified inside <user_question> </user_question> **IMPORTANT**: - If there is any conflict between custom criteria and user question requirements, prioritize custom criteria ## Universal Evaluation Dimensions Evaluate the answer across these core dimensions: **(1) Answer Accuracy:** examines correctness, consistency, and factuality alignment between the <answer> and the <user_question>; examines if the <answer> contains irrelevant or wrongful information/hallucination. **(2) Answer Completeness:** examines if the <answer> is fully addressing the <user_question>; examines if the <answer> is good at relevance/informativeness: selection of important/key content from <user_question> **(3) Expression Quality:** examines if the <answer> is concise at answering the <user_question>. NOTE that unless there is special instruction, more concise <answer> is always better, and explanation or rational is strictly NOT needed - THIS IS THE MOST IMPORTANT! examines the alignment on instruction following, e.g., if the <answer> adheres to both explicit guidelines and implicit guidelines (like few-shot examples) in the <user_question>; <CUSTOM_CRITERIA_DESCRIPTION> **(4) Faithfulness to Context:** examines whether the candidate answer is faithful to the task description and context provided in the user question. A response is unfaithful only when (1) it clearly contradicts the context, or (2) the task implies that the response must be based on the context (like in a summarization task). If the task does not ask to respond based on the context, the model is allowed to use its own knowledge to provide a response, even if its claims are not verifiable. Evaluate the degree of faithfulness on the following scale: - **3 points**: All content is faithful (no contradictions, fully grounded when required) - **2 points**: Most content is faithful (minor deviations or unverifiable claims when context-grounding is required) - **1 point**: Some content is faithful or approximately half is faithful (notable contradictions or significant departures from context when required) - **0 points**: None or minimal content is faithful (clear contradictions or complete disregard of context when grounding is required) </CUSTOM_CRITERIA_DESCRIPTION> ## Dimension Weighting and Final Scoring **Weight Determination Process:** Assign weights (must sum to 1.0) based on: - Explicit weights in evaluation_criteria (if provided) - Task analysis and question requirements (if no explicit weights) - Default weights (Answer Accuracy: 0.25, Answer Completeness: 0.25, Expression Quality: 0.25, Faithfulness to Context: 0.25) as fallback **Weight Guidelines:** - **High Accuracy Weight (0.3-0.5)**: Factual questions, multiple choice, technical problems - **High Completeness Weight (0.3-0.5)**: Complex explanatory tasks, multi-part questions - **High Expression Weight (0.3-0.5)**: Creative tasks, presentation-focused questions, format-specific requirements - [*IMPORTANT*] **High Custom Criteria Weight**: The custom criteria (Faithfulness to Context) should always be *prioritized*. Assign it significant weight (0.3-0.6) and adjust other weights accordingly. **Overall Score Calculation:** Overall = (Answer_Accuracy x Weight_A) + (Answer_Completeness x Weight_C) + (Expression_Quality x Weight_E) + (Faithfulness_to_Context x Weight_F) ## Current Evaluation Task <user_question> {prompt} </user_question> <answer> {prediction} </answer> <reference_answer> {gold} </reference_answer> Based on the above guidelines and criteria, provide your evaluation:"""

自訂 Lambda 評估器

使用您自己的計分函數做為 Lambda 函數。

Configuration

針對應該使用的每個提示範本,在您的輸入 JSONL 檔案中指定 Lambda ARN。您也可以提供 customEvaluationMetricLabel 欄位來命名指標:

"evaluationMetricLambdaArn": "arn:aws:lambda:us-west-2:123456789012:function:my-eval-function", "customEvaluationMetricLabel": "My Custom Metric"

當您透過 API 建立任務時,CreateAdvancedPromptOptimizationJob請求本身不需要額外的評估組態。評估方法根據範本從輸入 JSONL 檔案決定。

Lambda 要求

  • 具有所有程式碼的單一.py檔案

  • 處理常式設定為 lambda_function.lambda_handler

  • 必須實作compute_score(preds, golds)傳回 {"score": float, "scores": [float, ...]}

  • golds 參數包含referenceResponse值。如果您未在輸入資料集referenceResponse中提供 ,則不需要傳入goldscompute_score函數。

  • 永遠不會當機;錯誤時傳回 0.0,而不是引發例外狀況

  • 相較於二進位 0/1,偏好連續分數 (0.0 到 1.0),以加快最佳化收斂

  • 將大型批次的逾時設定為最長 15 分鐘 (900 秒),以避免提早逾時

  • 新增資源型政策bedrock.amazonaws.com,允許 叫用您的 Lambda

Lambda 範本

""" APO Custom Metric Lambda - Minimal Template Handler: lambda_function.lambda_handler """ import logging from typing import List, Dict, Any logger = logging.getLogger() logger.setLevel(logging.INFO) def compute_score(preds: List[str], golds: List[str]) -> Dict[str, Any]: """ Score predictions against ground truths. Args: preds: Model outputs (one per sample) golds: Expected answers (one per sample) Returns: Must contain: "score": float - aggregate score (higher is better) "scores": list[float] - per-instance scores """ # --- REPLACE THIS with your scoring logic --- scores = [] for pred, gold in zip(preds, golds): # Example: exact match (case-insensitive) scores.append(1.0 if pred.strip().lower() == gold.strip().lower() else 0.0) return { "score": sum(scores) / len(scores) if scores else 0.0, "scores": scores, } def lambda_handler(event, context): """ Lambda entry point. APO service sends: event = {"preds": ["output1", ...], "golds": ["truth1", ...]} """ logger.info(f"Received {len(event.get('preds', []))} predictions") try: preds = event.get("preds", []) golds = event.get("golds", []) if not preds: return {"score": 0.0, "scores": []} return compute_score(preds, golds) except Exception as e: logger.error(f"Error: {e}", exc_info=True) return {"score": 0.0, "scores": [0.0] * len(event.get("preds", [])), "error": str(e)}

如需更詳細的範例,包括用於錯誤處理和 Lambda 函數輸入驗證的樣板程式碼,請參閱 AWS 範例 GitHub。