Amazon Bedrock がモデルを呼び出す評価ジョブのデータセットを準備する独自の推論レスポンスデータを使用して評価ジョブのデータセットを準備する

ジャッジとしてのモデルを使用するモデル評価ジョブのプロンプトデータセットを作成する

ジャッジとしてのモデルを使用するモデル評価ジョブを作成するには、プロンプトデータセットを指定する必要があります。このプロンプトデータセットは、自動モデル評価ジョブと同じ形式であり、評価対象として選択したモデルによる推論中に使用されます。

既に生成した応答を使用して Amazon Bedrock 以外のモデルを評価する場合は、「独自の推論レスポンスデータを使用して評価ジョブのデータセットを準備する」の説明に従って、それらの応答をプロンプトデータセットに含めます。独自の推論レスポンスデータを指定すると、Amazon Bedrock はモデル呼び出しステップをスキップし、指定したデータを使用して評価ジョブを実行します。

カスタムプロンプトデータセットは Amazon S3 に保存し、JSON Lines 形式と .jsonl ファイル拡張子を使用する必要があります。各行が有効な JSON オブジェクトである必要があります。評価ジョブごとに、データセット内のプロンプト数は最大 1000 個まで使用できます。

CORS 設定は、LLM-as-a-judge 評価ジョブには必要ありません。人間ベースの評価ジョブでは、S3 出力バケットに CORS が必要です。詳細についてはS3 バケットに必要なクロスオリジンリソース共有 (CORS) のアクセス許可を参照してください。

Amazon Bedrock がモデルを呼び出す評価ジョブのデータセットを準備する

Amazon Bedrock がモデルを呼び出す評価ジョブを実行するには、次のキーと値のペアを含むプロンプトデータセットを作成します。

prompt – モデルが応答するプロンプト。
referenceResponse – (オプション) グラウンドトゥルース応答。
category — (オプション) カテゴリごとに報告される評価スコアを生成します。

注記

グラウンドトゥルース応答 (referenceResponse) を指定した場合、Amazon Bedrock は完全性 (Builtin.Completeness) メトリクスと正確性 (Builtin.Correctness) メトリクスを計算するときにこのパラメータを使用します。これらのメトリクスは、グラウンドトゥルース応答を指定せずに使用することもできます。これらのシナリオの両方におけるジャッジプロンプトを確認するには、「Model-as-a-judge 評価ジョブの組み込みメトリクス評価プロンプト」で、選択したジャッジモデルのセクションを参照してください。

以下は、6 つの入力を含み、JSON Lines 形式を使用するカスタムデータセットの例です。


{"prompt":"Provide the prompt you want the model to use during inference","category":"(Optional) Specify an optional category","referenceResponse":"(Optional) Specify a ground truth response."}
{"prompt":"Provide the prompt you want the model to use during inference","category":"(Optional) Specify an optional category","referenceResponse":"(Optional) Specify a ground truth response."}
{"prompt":"Provide the prompt you want the model to use during inference","category":"(Optional) Specify an optional category","referenceResponse":"(Optional) Specify a ground truth response."}
{"prompt":"Provide the prompt you want the model to use during inference","category":"(Optional) Specify an optional category","referenceResponse":"(Optional) Specify a ground truth response."}
{"prompt":"Provide the prompt you want the model to use during inference","category":"(Optional) Specify an optional category","referenceResponse":"(Optional) Specify a ground truth response."}
{"prompt":"Provide the prompt you want the model to use during inference","category":"(Optional) Specify an optional category","referenceResponse":"(Optional) Specify a ground truth response."}

次の例は、わかりやすいように 1 つのエントリを展開したものです。実際のプロンプトデータセットでは、各行が有効な JSON オブジェクトである必要があります。


{
  "prompt": "What is high intensity interval training?",
  "category": "Fitness",
  "referenceResponse": "High-Intensity Interval Training (HIIT) is a cardiovascular exercise approach that involves short, intense bursts of exercise followed by brief recovery or rest periods."
}

独自の推論レスポンスデータを使用して評価ジョブのデータセットを準備する

既に生成した応答を使用して評価ジョブを実行するには、次のキーと値のペアを含むプロンプトデータセットを作成します。

prompt – モデルが応答の生成に使用したプロンプト。
referenceResponse – (オプション) グラウンドトゥルース応答。
category — (オプション) カテゴリごとに報告される評価スコアを生成します。
modelResponses – Amazon Bedrock で評価する独自の推論からの応答。ジャッジとしてのモデルを使用する評価ジョブは、プロンプトごとに 1 つのモデル応答のみをサポートします。このモデル応答は次のキーを使用して定義されます。
- response – モデル推論からの応答を含む文字列。
- modelIdentifier – 応答を生成したモデルを識別する文字列。評価ジョブで使用できる一意の modelIdentifier は 1 つだけで、データセット内の各プロンプトはこの識別子を使用する必要があります。

注記

以下は、6 つの入力を含む JSON Lines 形式のカスタムデータセットの例です。


{"prompt":"The prompt you used to generate the model response","referenceResponse":"(Optional) a ground truth response","category":"(Optional) a category for the prompt","modelResponses":[{"response":"The response your model generated","modelIdentifier":"A string identifying your model"}]}
{"prompt":"The prompt you used to generate the model response","referenceResponse":"(Optional) a ground truth response","category":"(Optional) a category for the prompt","modelResponses":[{"response":"The response your model generated","modelIdentifier":"A string identifying your model"}]}
{"prompt":"The prompt you used to generate the model response","referenceResponse":"(Optional) a ground truth response","category":"(Optional) a category for the prompt","modelResponses":[{"response":"The response your model generated","modelIdentifier":"A string identifying your model"}]}
{"prompt":"The prompt you used to generate the model response","referenceResponse":"(Optional) a ground truth response","category":"(Optional) a category for the prompt","modelResponses":[{"response":"The response your model generated","modelIdentifier":"A string identifying your model"}]}
{"prompt":"The prompt you used to generate the model response","referenceResponse":"(Optional) a ground truth response","category":"(Optional) a category for the prompt","modelResponses":[{"response":"The response your model generated","modelIdentifier":"A string identifying your model"}]}
{"prompt":"The prompt you used to generate the model response","referenceResponse":"(Optional) a ground truth response","category":"(Optional) a category for the prompt","modelResponses":[{"response":"The response your model generated","modelIdentifier":"A string identifying your model"}]}

次の例では、プロンプトデータセット内の 1 つのエントリをわかりやすいように展開して示しています。


{
    "prompt": "What is high intensity interval training?",
    "referenceResponse": "High-Intensity Interval Training (HIIT) is a cardiovascular exercise approach that involves short, intense bursts of exercise followed by brief recovery or rest periods.",
    "category": "Fitness",
     "modelResponses": [
        {
            "response": "High intensity interval training (HIIT) is a workout strategy that alternates between short bursts of intense, maximum-effort exercise and brief recovery periods, designed to maximize calorie burn and improve cardiovascular fitness.",
            "modelIdentifier": "my_model"
        }
    ]
}

ブラウザで JavaScript が無効になっているか、使用できません。

AWS ドキュメントを使用するには、JavaScript を有効にする必要があります。手順については、使用するブラウザのヘルプページを参照してください。

ドキュメントの表記規則

LLM-as-a-judge モデル評価ジョブ

評価メトリクス