ルーブリックベースの審査員

概要:

Rubric Judge は、Nova 2.LLM-as-a-judge 評価モデルです。優先判断のみを提供する元の判事モデル (A>B、B>A、またはタイ) とは異なり、Rubric Judge は各プロンプトに合わせたカスタム評価基準を動的に生成し、複数のディメンションにわたってきめ細かなスコアを割り当てます。

主要な機能

動的基準生成 – 入力プロンプトに基づいて関連する評価ディメンションを自動的に作成します。
加重スコアリング – 各基準に重要度の重みを割り当て、相対的な重要度を反映します
詳細な評価 – 各基準について、バイナリ (true/false) ベースまたはスケール (1～5) ベースで詳細なスコアを提供します。
品質メトリクス – レスポンス間の差の大きさを定量化する継続的な品質スコア (0～1 スケール) を計算します。

モデルによって生成された条件の例


price_validation:  
  description: "The response includes validation to ensure price is a positive value."  
  type: "scale"  
  weight: 0.3

このモデルは、生成されたすべての基準に対して両方のレスポンスを評価し、これらの基準レベルのスコアを使用して最終的な優先決定を通知します。

レシピ設定

Rubric Judge レシピ

レシピtask: rubric_llm_judgeでを設定して Rubric Judge を有効にします。


run:  
  name: nova-eval-job-name                              # [MODIFIABLE] Unique identifier for your evaluation job  
  model_type: amazon.nova-2-lite-v1:0:256k              # [FIXED] Rubric Judge model type  
  model_name_or_path: "nova-lite-2/prod"                # [FIXED] Path to model checkpoint or identifier  
  replicas: 1                                           # [MODIFIABLE] Number of replicas for SageMaker Training job  
  data_s3_path: ""                                      # [FIXED] Leave empty for SageMaker Training job  
  output_s3_path: ""                                    # [FIXED] Leave empty for SageMaker Training job  
    
evaluation:  
  task: rubric_llm_judge                                # [FIXED] Evaluation task - enables Rubric Judge  
  strategy: judge                                       # [FIXED] Evaluation strategy  
  metric: all                                           # [FIXED] Metric calculation method  
    
inference:  
  max_new_tokens: 12000                                 # [MODIFIABLE] Maximum tokens to generate  
  top_k: -1                                             # [MODIFIABLE] Top-k sampling parameter  
  top_p: 1.0                                            # [MODIFIABLE] Nucleus sampling parameter  
  temperature: 0                                        # [MODIFIABLE] Sampling temperature (0 = deterministic)

ジャッジレシピとしての元の LLM (比較用)

元の判事モデルはを使用しますtask: llm_judge。


run:  
  name: eval-job-name                                   # [MODIFIABLE] Unique identifier for your evaluation job  
  model_type: amazon.nova-micro-v1:0:128k               # [FIXED] Model type   
  model_name_or_path: "nova-micro/prod"                 # [FIXED] Path to model checkpoint or identifier  
  replicas: 1                                           # [MODIFIABLE] Number of replicas for SageMaker Training job  
  data_s3_path: ""                                      # [FIXED] Leave empty for SageMaker Training job  
  output_s3_path: ""                                    # [FIXED] Leave empty for SageMaker Training job  
    
evaluation:  
  task: llm_judge                                       # [FIXED] Original judge task  
  strategy: judge                                       # [FIXED] Evaluation strategy  
  metric: all                                           # [FIXED] Metric calculation method  
  
inference:  
  max_new_tokens: 12000                                 # [MODIFIABLE] Maximum tokens to generate  
  top_k: -1                                             # [MODIFIABLE] Top-k sampling parameter  
  top_p: 1.0                                            # [MODIFIABLE] Nucleus sampling parameter  
  temperature: 0                                        # [MODIFIABLE] Sampling temperature (0 = deterministic)

入力データセット形式

入力データセットの形式は元の判事モデルと同じです。

必須フィールド

prompt – 入力プロンプトと指示を含む文字列
response_A – ベースラインモデル出力を含む文字列
response_B – カスタマイズされたモデル出力を含む文字列

データセットの例 (JSONL 形式)


{"prompt": "What is the most effective way to combat climate change?", "response_A": "The most effective way to combat climate change is through a combination of transitioning to renewable energy sources and implementing strict carbon pricing policies. This creates economic incentives for businesses to reduce emissions while promoting clean energy adoption.", "response_B": "We should focus on renewable energy. Solar and wind power are good. People should drive electric cars. Companies need to pollute less."}  
{"prompt": "Explain how a computer's CPU works", "response_A": "CPU is like brain of computer. It does math and makes computer work fast. Has lots of tiny parts inside.", "response_B": "A CPU (Central Processing Unit) functions through a fetch-execute cycle, where instructions are retrieved from memory, decoded, and executed through its arithmetic logic unit (ALU). It coordinates with cache memory and registers to process data efficiently using binary operations."}  
{"prompt": "How does photosynthesis work?", "response_A": "Plants do photosynthesis to make food. They use sunlight and water. It happens in leaves.", "response_B": "Photosynthesis is a complex biochemical process where plants convert light energy into chemical energy. They utilize chlorophyll to absorb sunlight, combining CO2 and water to produce glucose and oxygen through a series of chemical reactions in chloroplasts."}

フォーマット要件

各エントリは 1 行の JSON オブジェクトである必要があります
エントリを改行で区切る
例に示すように、正確なフィールドの命名に従います。

評価出力

出力構造

Rubric Judge は、元の判事モデルと比較して強化された評価メトリクスを生成します。


{  
  "config_general": {  
    "lighteval_sha": "string",  
    "num_fewshot_seeds": "int",  
    "max_samples": "int | null",  
    "job_id": "int",  
    "start_time": "float",  
    "end_time": "float",  
    "total_evaluation_time_secondes": "string",  
    "model_name": "string",  
    "model_sha": "string",  
    "model_dtype": "string | null",  
    "model_size": "string"  
  },  
  "results": {  
    "custom|rubric_llm_judge_judge|0": {  
      "a_scores": "float",  
      "a_scores_stderr": "float",  
      "b_scores": "float",  
      "b_scores_stderr": "float",  
      "ties": "float",  
      "ties_stderr": "float",  
      "inference_error": "float",  
      "inference_error_stderr": "float",  
      "score": "float",  
      "score_stderr": "float",  
      "weighted_score_A": "float",  
      "weighted_score_A_stderr": "float",  
      "weighted_score_B": "float",  
      "weighted_score_B_stderr": "float",  
      "score_margin": "float",  
      "score_margin_stderr": "float",  
      "winrate": "float",  
      "lower_rate": "float",  
      "upper_rate": "float"  
    }  
  },  
  "versions": {  
    "custom|rubric_llm_judge_judge|0": "int"  
  }  
}

Rubric Judge の新しいメトリクス

次の 6 つのメトリクスは Rubric Judge に固有であり、きめ細かな品質評価を提供します。

メトリクス	説明
weighted_score_A	モデルが生成したすべての評価基準における response_A の平均正規化品質スコア。スコアは基準の重要度によって重み付けされ、0～1 のスケールに正規化されます (高 = 品質の向上)
weighted_score_A_stderr	weighted_score_A の平均の標準誤差。統計的不確実性を示します。
weighted_score_B	モデル生成のすべての評価基準における response_B の正規化された平均品質スコア。スコアは基準の重要度によって重み付けされ、0～1 のスケールに正規化されます (高 = 品質の向上)
weighted_score_B_stderr	weighted_score_B の平均の標準誤差。統計的不確実性を示します。
score_margin	加重スコアの差 (weighted_score_A - weighted_score_B として計算）。範囲: -1.0～1.0。正 = response_A の方が良い、負 = response_B の方が良い、ゼロに近い = 同様の品質
score_margin_stderr	score_margin の平均の標準誤差。品質差の測定の不確実性を示します。

加重スコアメトリクスについて

目的: 加重スコアは、バイナリプリファレンスの判定を補完する継続的な品質測定値を提供し、モデルのパフォーマンスをより深く洞察できるようにします。

元の判事との主な違い

元の審査員 – 個別の設定のみを出力します (A>B、B>A、A=B)
Rubric Judge – カスタム条件に基づいて、プリファレンスと継続的な品質スコア (0～1 スケール) の両方を出力します。

score_margin の解釈

score_margin = -0.128: Response_B スコアが response_A より 12.8 パーセントポイント高い
|score_margin| < 0.1: 狭い品質差 (クローズディシジョン)
|score_margin| > 0.2: 明確な品質差 (確実な決定)

ユースケース

モデルの改善 — モデルのパフォーマンスが低い特定の領域を特定する
品質の定量化 – ウィン/ロス比だけでなく、パフォーマンスギャップの大きさを測定します。
信頼度評価 – 密接な意思決定と明確な品質の違いを区別する

重要

最終的な判定は、全体的な推論を維持し、前方/後方評価を通じて適切な位置バイアスの軽減を確保するために、判事モデルの明示的な優先ラベルに基づいて行われます。加重スコアはオブザーバビリティツールとして機能し、主要評決の置き換えとして機能しません。

計算方法

加重スコアは、次のプロセスを通じて計算されます。

基準データの抽出 – 判事の YAML 出力を解析して基準スコアと重みを抽出します。
スコアを正規化します。
- スケールタイプの基準 (1～5): 計算して 0～1 に正規化する (スコア - 1) / 4
- バイナリ基準 (true/false): 1.0/0.0 に変換
重みを適用する – 正規化された各スコアに基準の重みを掛けます。
集計 – 各レスポンスのすべての加重スコアを合計します。
マージンの計算 – コンピューティング score_margin = weighted_score_A - weighted_score_B

例: response_A の加重合計が 0.65 で response_B の加重合計が 0.78 score_marginの場合、は -0.13 になり、すべての加重基準で response_B の品質が 13 パーセントポイント高いことを示します。

ブラウザで JavaScript が無効になっているか、使用できません。

AWS ドキュメントを使用するには、JavaScript を有効にする必要があります。手順については、使用するブラウザのヘルプページを参照してください。

ドキュメントの表記規則

評価

推論モデル評価