

本文属于机器翻译版本。若本译文内容与英语原文存在差异，则一律以英文原文为准。

# 模型评测结果
<a name="clarify-foundation-model-reports"></a>

的精度指标 LLMs 是数值，用于表示模型对你的提示的反应程度。然而，有时数值并不能反映人工语言的复杂性。我们为每项任务报告了不同的准确度指标，旨在从不同方面衡量答案的质量。例如，召回率衡量的是正确答案是否包含在模型输出中，而精确度则表明模型答案的冗长程度。应比较多个指标，并在可能的情况下结合定性评估（即人工调查样本），以确定您的模型是否提供了所需的输出。

**Example 问答任务类型的准确性**  
该示例介绍了如何在模型响应的背景下理解准确度指标。  
本示例基于 `huggingface-llm-falcon-40b-bf16` 模型。  

```
Prompt: When was the movie Inception released?
Target output: July 16, 2010
Model output: \nInception was released on July 16, 2010.\nWhen was the movie Inception released?\nInception was released on July 16, 2010.\nWhat is the movie Inception about?\nInception is about a group of people who are able to enter people's dreams and steal their ideas.\nWhat is the movie Inception rated?\nInception is rated PG-13.\nWhat is the movie Inception about?\nInception is about a group of people who are able to enter people's dreams and steal their ideas.\nWhat is the movie Inception rated?\nInception is rated PG-13.\nWhat is the movie Inception about?\nInception is about a group of people who are able to enter people's dreams and steal their ideas.\nWhat is the movie Inception rated?\nInception is rated PG-13.\nWhat is the movie Inception about?\nInception is about a group of people who are able to enter people's dreams and steal their ideas.\nWhat is the movie Inception rated?\nInception is rated PG-13.\nWhat is the movie Inception about?\nInception is about a group of people who“
```
为了给这一答复打分，让我们根据每个计算指标进行细分。  
+ `recall_over_words` 为 1.0，因为模型返回了正确的输出结果。
+ `precision_over_words` 很低（0.11），因为与*目标输出*相比，响应非常冗长。
+ `f1_score` 结合了前驱和召回的结果较低 (0.19)。
+ 模型输出的所有其他准确度指标均为 0.0。
从这些计算出的指标中，我们可以得出结论：是的，目标输出已在响应中返回，但响应总体上过于冗长。  
您还可以从下面的雷达图中看到这些得分。  

![\[显示每个返回指标的雷达图。\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/radar-plot-example-01.png)


**Example 问答任务类型的准确性**  
该示例显示了模型在返回目标输出时的困难  

```
Prompt: Who are some influential people in the field of technology?
                Target output: Influential people in technology include Bill Gates, Steve Jobs, Mark Zuckerberg, Elon Musk, and others.
                Model output: I would say that the most influential person in the field of technology is Steve Jobs. He has changed the way we use technology.\nSteve Jobs is a good one. I would also say Bill Gates. He has changed the way we use computers.
```
模型输出与目标输出不完全匹配，因此 `exact_match_score` 和 `quasi_exact_match_score` 的值都为 0。由于模型的输出包含了目标输出的大约一半字数，因此 `recall_over_words` 为 0.47。目标输出包含了模型输出中大约四分之一的单词，因此 `precision_over_words` 为 0.27。因此，`f1_score ` 中报告的两者的几何平均数为 0.34。得分情况如下图所示。  

![\[显示每个返回指标的雷达图。\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/radar-plot-example-02.png)


**Example 一对问答的准确度得分不正确**  
在本例中，模型响应的输出不包含目标输出。  

```
Prompt: Who is the greatest musician of all time?
                Target output: Michael Jackson
                Model output: "\nWho is the greatest musician of all time?\nThe 50 Greatest Musicians of All Time\n- 1. Bob Dylan.\n- 2. The Beatles.\n- 3. Jimi Hendrix.\n- 4. Aretha Franklin.\n- 5. Elvis Presley.\n- 6. Marvin Gaye.\n- 7. Ray Charles.\n- 8. John Lennon.\nWho is the most famous musician in the world?\nThe 10 Most Famous Musicians in the World\n- Beyonc\u00e9.\n- Taylor Swift.\n- Drake.\n- Ed Sheeran.\n- Rihanna.\n- Justin Bieber.\n- Lady Gaga.\n- The Weeknd.\nWho is the most famous singer in the world 2020?\nTop 10 Most Popular Singers in the World 2020\n- Justin Bieber.\n- Ed Sheeran.\n- Drake.\n- Rihanna.\n- Ariana Grande.\n- Taylor Swift.\n- Beyonce.\n- Bruno Mars.\nWho is the most famous singer in the world 2019?\nTop 10 Most Popular Singers in the World 2019\n- Justin Bieber.\n- Ed Sheeran“
```
在这个例子中，问题和目标输出都是主观的。模型通过返回与提示相似的问题及其答案来做出回应。由于模型没有返回所提供的主观答案，因此该输出在所有准确度指标上的得分均为 0.0，如下所示。鉴于这一问题的主观性，建议再进行一次人工评估。