

本文為英文版的機器翻譯版本，如內容有任何歧義或不一致之處，概以英文版為準。

# 提示多模態輸入
<a name="prompting-multimodal"></a>

以下各節提供影像和影片理解的指引。如需音訊相關提示，請參閱[語音對話提示](sonic-system-prompts.md)一節。

## 一般多模式準則
<a name="general-multimodal-guidelines"></a>

### 使用者提示和系統提示
<a name="user-system-prompts"></a>

為了多模式了解使用案例，每個請求都應包含使用者提示文字。可能只包含文字的系統提示是選用的。

系統提示可用來為模型指定角色，並定義一般性格和回應樣式，但不應用於詳細的任務定義或輸出格式指示。

在使用者提示中包含任務定義、指示和格式詳細資訊，以獲得比多模式使用案例的系統提示更強大的效果。

### 內容順序
<a name="content-order"></a>

傳送至 Amazon Nova 的多模式理解請求應包含一或多個檔案和使用者提示。使用者文字提示應該是訊息中的最後一個項目，一律在影像、文件或影片內容之後。

```
message = {
  "role": "user",
  "content": [
    { "document|image|video|audio": {...} },
    { "document|image|video|audio": {...} },
    ...
    { "text": "<user prompt>" }
  ]
}
```

如果您想要在使用者提示中參考特定檔案，請使用「文字」元素來定義每個檔案區塊前面的標籤。

```
message = {
  "role": "user",
  "content": [
    { "text": "<label for item 1>" },
    { "document|image|video|audio": {...} },
    { "text": "<label for item 2>" },
    { "document|image|video|audio": {...} },
    ...
    { "text": "<user prompt>" }
  ]
}
```

## 文件和映像了解
<a name="document-image-understanding"></a>

下列各節提供指引，說明如何針對需要了解或分析映像和文件的任務建立提示。

### 從映像擷取文字
<a name="extracting-text-images"></a>

Amazon Nova 模型可以從影像擷取文字，這種功能稱為光學字元辨識 (OCR)。為了獲得最佳結果，請確定您提供給模型的映像輸入具有足夠高的解析度，以方便辨別文字字元。

對於文字擷取使用案例，我們建議使用下列推論組態：
+ **溫度：**預設 (0.7)
+ **topP：**預設 (0.9)
+ 請勿啟用推理

Amazon Nova 模型可以將文字擷取到 Markdown、HTML 或 LaTeX 格式。建議使用下列使用者提示範本：

```
## Instructions
Extract all information from this page using only {text_formatting} formatting. Retain the original layout and structure including lists, tables, charts and math formulae. 

## Rules
1. For math formulae, always use LaTeX syntax. 
2. Describe images using only text.
3. NEVER use HTML image tags `<img>` in the output.
4. NEVER use Markdown image tags `![]()` in the output.
5. Always wrap the entire output in ``` tags.
```

輸出會以完整或部分 Markdown 程式碼圍欄 (`````) 包裝。您可以使用類似下列的程式碼來分割程式碼圍欄：

```
def strip_outer_code_fences(text):
    lines = text.split("\n")
    # Remove only the outer code fences if present
    if lines and lines[0].startswith("```"):
        lines = lines[1:]
        if lines and lines[-1].startswith("```"):
            lines = lines[:-1]
    return "\n".join(lines).strip()
```

### 從影像或文字擷取結構化資訊
<a name="extracting-structured-info"></a>

Amazon Nova 模型可以將影像中的資訊擷取為機器可剖析的 JSON 格式，稱為金鑰資訊擷取 (KIE)。若要執行 KIE，請提供下列項目：
+ JSON 結構描述。遵循 JSON 結構描述規格的正式結構描述定義。
+ 下列一或多個項目：文件檔案或影像或文件文字

文件或映像必須一律放置在請求中的使用者提示之前。

對於 KIE 使用案例，我們建議使用下列推論組態：
+ **溫度：**0
+ **原因：**不需要原因，但在使用純影像輸入或複雜結構描述時可以改善結果。

#### 提示範本
<a name="kie-prompt-templates"></a>

##### 僅限文件或映像輸入
<a name="doc-or-image-only"></a>

```
Given the image representation of a document, extract information in JSON format according to the given schema.
     
Follow these guidelines:
- Ensure that every field is populated, provided the document includes the corresponding value. Only use null when the value is absent from the document.
- When instructed to read tables or lists, read each row from every page. Ensure every field in each row is populated if the document contains the field.

JSON Schema:
{json_schema}
```

##### 僅限文字輸入
<a name="text-only"></a>

```
Given the OCR representation of a document, extract information in JSON format according to the given schema.

Follow these guidelines:
- Ensure that every field is populated, provided the document includes the corresponding value. Only use null when the value is absent from the document.
- When instructed to read tables or lists, read each row from every page. Ensure every field in each row is populated if the document contains the field.

JSON Schema:
{json_schema}

OCR:
{document_text}
```

##### 文件或影像和文字輸入
<a name="doc-or-image-and-text"></a>

```
Given the image and OCR representations of a document, extract information in JSON format according to the given schema.
       
Follow these guidelines:
- Ensure that every field is populated, provided the document includes the corresponding value. Only use null when the value is absent from the document.
- When instructed to read tables or lists, read each row from every page. Ensure every field in each row is populated if the document contains the field.

JSON Schema:
{json_schema}

OCR:
{document_text}
```

#### 偵測影像中的物件及其位置
<a name="detecting-objects"></a>

Amazon Nova 2 模型提供識別影像中物件及其位置的能力，任務有時稱為影像接地或物件當地語系化。實際的應用程式包括影像分析和標記、使用者介面自動化、影像編輯等。

無論影像輸入解析度和長寬比為何，模型都會使用座標空間，將影像水平分割為 1，000 個單位，垂直分割為 1，000 個單位，其中 x：0 y：0 位置是影像的左上角。

邊界框使用分別`[x1, y1, x2, y2]`代表左側、頂部、右側和底部的格式來描述。二維座標是使用 格式表示`[x, y]`。

針對物件偵測使用案例，我們建議使用下列推論參數值：
+ **溫度：**0
+ 請勿啟用推理

##### 提示範本：一般物件偵測
<a name="general-object-detection-templates"></a>

我們建議使用下列使用者提示範本。

**使用週框方塊偵測多個執行個體：**

```
Please identify {target_description} in the image and provide the bounding box coordinates for each one you detect. Represent the bounding box as the [x1, y1, x2, y2] format, where the coordinates are scaled between 0 and 1000 to the image width and height, respectively.
```

**使用週框方塊偵測單一區域：**

```
Please generate the bounding box coordinates corresponding to the region described in this sentence: {target_description}. Represent the bounding box as the [x1, y1, x2, y2] format, where the coordinates are scaled between 0 and 1000 to the image width and height, respectively.
```

**使用中心點偵測多個執行個體：**

```
Please identify {target_description} in the image and provide the center point coordinates for each one you detect. Represent the point as the [x, y] format, where the coordinates are scaled between 0 and 1000 to the image width and height, respectively.
```

**偵測具有中心點的單一區域：**

```
Please generate the center point coordinates corresponding to the region described in this sentence: {target_description}. Represent the center point as the [x, y] format, where the coordinates are scaled between 0 and 1000 to the image width and height, respectively.
```

**剖析模型輸出：**

上述每個建議提示都會產生逗號分隔字串，其中包含類似下列格式的一或多個週框方塊描述。是否「」可能會有些微變化。包含在字串結尾。例如 `[356, 770, 393, 872], [626, 770, 659, 878].`

您可以使用規則表達式剖析模型產生的座標資訊，如下列 Python 程式碼範例所示。

##### 程式碼範例
<a name="parse-coord-code"></a>

```
def parse_coord_text(text):
    """Parses a model response which uses array formatting ([x, y, ...])
    to describe points and bounding boxes. Returns an array of tuples."""
    pattern = r"\[([^\[\]]*?)\]"
    return [
        tuple(int(x.strip()) for x in match.split(","))
        for match in re.findall(pattern, text)
    ]
```

##### 程式碼範例
<a name="remap-bbox-code"></a>

若要將週框方塊的標準化座標重新對應至輸入影像的座標空間，您可以使用類似下列 Python 範例的函數。

```
def remap_bbox_to_image(bounding_box, image_width, image_height):
    return [
        bounding_box[0] * image_width / 1000,
        bounding_box[1] * image_height / 1000,
        bounding_box[2] * image_width / 1000,
        bounding_box[3] * image_height / 1000,
    ]
```

##### 提示範本：偵測具有 位置的多個物件類別
<a name="multiple-object-classes-templates"></a>

當您想要在影像中識別多個類別的項目時，您可以使用下列其中一種格式方法，在提示中包含類別清單。

對於模型可能充分理解的常見類別，請在方括號內列出類別名稱 （不含引號）：

```
[car, traffic light, road sign, pedestrian]
```

對於細微、不常見或來自模型可能不熟悉的特殊網域的類別，請在括號中包含每個類別的定義。由於此任務具有挑戰性，因此 預期模型的效能會降低。

```
[taraxacum officinale (Dandelion - bright yellow flowers, jagged basal leaves, white puffball seed heads), digitaria spp (Crabgrass - low spreading grass with coarse blades and finger-like seed heads), trifolium repens (White Clover - three round leaflets and small white pom-pom flowers), plantago major (Broadleaf Plantain - wide oval rosette leaves with tall narrow seed stalks), stellaria media (Chickweed - low mat-forming plant with tiny star-shaped white flowers)]
```

根據您偏好的 JSON 輸出格式，使用下列其中一個**使用者提示**範本。

##### 提示選項 1
<a name="multiple-objects-positions-example-1"></a>

```
Detect all objects with their bounding boxes in the image from the provided class list. Normalize the bounding box coordinates to be scaled between 0 and 1000 to the image width and height, respectively.

Classes: {candidate_class_list}

Include separate entries for each detected object as an element of a list. 

Formulate your output as JSON format:
[
  {
  	"class 1": [x1, y1, x2, y2]
  },
  ...
]
```

##### 提示選項 2
<a name="multiple-objects-positions-example-2"></a>

```
Detect all objects with their bounding boxes in the image from the provided class list. Normalize the bounding box coordinates to be scaled between 0 and 1000 to the image width and height, respectively.

Classes: {candidate_class_list}

Include separate entries for each detected object as an element of a list.

Formulate your output as JSON format:
[
    {
        "class": class 1,
        "bbox": [x1, y1, x2, y2]
    },
    ...
]
```

##### 提示選項 3
<a name="multiple-objects-positions-example-3"></a>

```
Detect all objects with their bounding boxes in the image from the provided class list. Normalize the bounding box coordinates to be scaled between 0 and 1000 to the image width and height, respectively.

Classes: {candidate_class_list}

Group all detected bounding boxes by class.

Formulate your output as JSON format:
{
    "class 1": [[x1, y1, x2, y2], [x1, x2, y1, y2], ...],
    ...
}
```

##### 提示選項 4
<a name="multiple-objects-positions-example-4"></a>

```
Detect all objects with their bounding boxes in the image from the provided class list. Normalize the bounding box coordinates to be scaled between 0 and 1000 to the image width and height, respectively.

Classes: {candidate_class_list}

Group all detected bounding boxes by class.

Formulate your output as JSON format:
[
    {
        "class": class 1,
        "bbox": [[x1, y1, x2, y2], [x1, x2, y1, y2], ...]
    },
    ...
]
```

**剖析模型輸出**

輸出將編碼為 JSON，可與任何 JSON 剖析程式庫進行剖析。

#### 提示範本：螢幕擷取畫面 UI 邊界偵測
<a name="screenshot-ui-bounds-templates"></a>

我們建議使用下列使用者提示範本。

**根據目標偵測 UI 元素位置：**

```
In this UI screenshot, what is the location of the element if I want to {goal}? Express the location coordinates using the [x1, y1, x2, y2] format, scaled between 0 and 1000.
```

**根據文字偵測 UI 元素位置：**

```
In this UI screenshot, what is the location of the element if I want to click on "{text}"? Express the location coordinates using the [x1, y1, x2, y2] format, scaled between 0 and 1000.
```

**剖析模型輸出：**

對於上述每個 UI 邊界偵測提示，您可以使用規則表達式剖析模型產生的座標資訊，如以下 Python 程式碼範例所示。

##### 程式碼範例
<a name="parse-coord-screenshot-code"></a>

```
def parse_coord_text(text):
    """Parses a model response which uses array formatting ([x, y, ...]) 
    to describe points and bounding boxes. Returns an array of tuples."""
    pattern = r"\[([^\[\]]*?)\]"
    return [
        tuple(int(x.strip()) for x in match.split(","))
        for match in re.findall(pattern, text)
    ]
```

## 影片理解
<a name="video-understanding"></a>

下列各節提供指引，說明如何針對需要了解或分析影片的任務建立提示。

### 摘要影片
<a name="summarizing-videos"></a>

Amazon Nova 模型可以產生影片內容的摘要。

對於影片摘要使用案例，我們建議使用下列推論參數值：
+ **溫度：**0
+ 有些使用案例可能受益於啟用模型推理

不需要特定的提示範本。您的使用者提示應該明確指定您關心的影片層面。以下是幾個有效提示的範例：

```
Can you create an executive summary of this video's content?
```

```
Can you distill the essential information from this video into a concise summary?
```

```
Could you provide a summary of the video, focusing on its key points?
```

### 產生影片的詳細字幕
<a name="video-captions"></a>

Amazon Nova 模型可以產生影片的詳細字幕，稱為密集字幕的任務。

對於影片字幕使用案例，我們建議使用下列推論參數值：
+ **溫度：**0
+ 有些使用案例可能受益於啟用模型推理

不需要特定的提示範本。您的使用者提示應該明確指定您關心的影片層面。以下是幾個有效提示的範例：

```
Provide a detailed, second-by-second description of the video content.
```

```
Break down the video into key segments and provide detailed descriptions for each.
```

```
Generate a rich textual representation of the video, covering aspects like movement, color and composition.
```

```
Describe the video scene-by-scene, including details about characters, actions and settings.
```

```
Offer a detailed narrative of the video, including descriptions of any text, graphics, or special effects used.
```

```
Create a dense timeline of events occurring in the video, with timestamps if possible.
```

### 分析安全影片片段
<a name="security-footage"></a>

Amazon Nova 模型可以偵測安全影片中的事件。

針對安全影片使用案例，我們建議使用下列推論參數值：
+ **溫度：**0
+ 有些使用案例可能受益於啟用模型推理

```
You are a security assistant for a smart home who is given security camera footage in natural setting. You will examine the video and describe the events you see. You are capable of identifying important details like people, objects, animals, vehicles, actions and activities. This is not a hypothetical, be accurate in your responses. Do not make up information not present in the video.
```

### 使用時間戳記擷取視訊事件
<a name="video-timestamps"></a>

Amazon Nova 模型可以識別與影片中事件相關的時間戳記。您可以請求以秒或 MM：SS 格式格式化時間戳記。例如，在影片中 1 分鐘 25 秒發生的事件可以表示為 `85`或 `01:25`。

針對此使用案例，我們建議使用下列推論參數值：
+ **溫度：**0
+ 請勿使用推理

我們建議您使用類似下列的提示：

#### 識別事件的開始和結束時間
<a name="event-localization-prompts"></a>

```
Please localize the moment that the event "{event_description}" happens in the video. Answer with the starting and ending time of the event in seconds, such as [[72, 82]]. If the event happen multiple times, list all of them, such as [[40, 50], [72, 82]].
```

```
Locate the segment where "{event_description}" happens. Specify the start and end times of the event in MM:SS.
```

```
Answer the starting and end time of the event "{event_description}". Provide answers in MM:SS
```

```
When does "{event_description}" in the video? Specify the start and end timestamps, e.g. [[9, 14]]
```

#### 識別多個事件發生的事件
<a name="multiple-event-occurrences"></a>

```
Please localize the moment that the event "{event_description}" happens in the video. Answer with the starting and ending time of the event in seconds. e.g. [[72, 82]]. If the event happen multiple times, list all of them. e.g. [[40, 50], [72, 82]]
```

#### 產生具有時間戳記的視訊區段日誌
<a name="video-segment-log"></a>

```
Segment a video into different scenes and generate caption per scene. The output should be in the format: [STARTING TIME-ENDING TIMESTAMP] CAPTION. Timestamp in MM:SS format
```

```
For a video clip, segment it into chapters and generate chapter titles with timestamps. The output should be in the format: [STARTING TIME] TITLE. Time in MM:SS
```

```
Generate video captions with timestamp.
```

### 分類影片
<a name="classifying-videos"></a>

您可以使用 Amazon Nova 模型，根據您提供的預先定義類別清單來分類影片。

針對此使用案例，我們建議使用下列推論參數值：
+ **溫度：**0
+ 不應使用原因

使用以下提示範本：

```
What is the most appropriate category for this video? Select your answer from the options provided:
{class1}
{class2}
{...}
```

**範例**：

```
What is the most appropriate category for this video? Select your answer from the options provided:
Arts
Technology
Sports
Education
```