准备数据以用于多模态微调

重要

在开始准备数据集之前，请确认监督式微调（SFT）方法适合您的使用案例。SFT 向模型传授新行为、响应格式和推理模式，但不会向模型传授新的事实知识。如果您的主要目的是引入模型尚未见过的特定领域事实、术语或知识，可在推理阶段通过检索增强生成（RAG）提供相关上下文。有关如何 SFT、强化式微调（RFT）和 RAG 的指导，请参阅基于 SageMaker 训练作业自定义 Amazon Nova。

以下是为微调理解模型准备数据要遵循的指南和要求：

微调所需的最小数据量取决于任务（即是复杂任务还是简单任务），但建议至少为希望模型学习的每项任务提供 100 个数据样本。
建议在训练和推理期间，在零样本设置中使用经过优化的提示，以便获得最佳结果。
训练数据集和验证数据集必须是 JSONL 文件，其中的每一行都是与一条记录对应的一个 JSON 对象。这些文件名只能包含字母数字字符、下划线、连字符、斜杠和句点。
图像和视频限制
1. 数据集不能包含不同的媒体模态。也就是说，数据集可以是带图像的文本，也可以是带视频的文本。
2. 一个样本（消息中的一条记录）可以有多张图像
3. 一个样本（消息中的一条记录）只能有一个视频
schemaVersion 可以是任何字符串值
（可选）system 轮次可以是客户提供的自定义系统提示。
支持的角色为 user 和 assistant。
messages 中的第一轮应始终以 "role": "user" 开头。最后一轮是机器人的响应，以 "role": "assistant" 表示。
image.source.s3Location.uri 和 video.source.s3Location.uri 必须可供 Amazon Bedrock 访问。
Amazon Bedrock 服务角色必须能够访问 Amazon S3 中的图像文件。有关授予访问权限的更多信息，请参阅 Create a service role for model customization
图像或视频必须与数据集位于同一个 Amazon S3 存储桶中。例如，若数据集位于 s3://amzn-s3-demo-bucket/train/train.jsonl 中，则图像或视频必须位于 s3://amzn-s3-demo-bucket 中
术语 User:、Bot:、Assistant:、System:、<image>、<video> 和 [EOS] 是保留关键字。如果用户提示或系统提示以其中任何一个关键字开头，或在提示中的任意位置使用这些关键字，则训练作业将会因数据问题而失败。如果您的使用案例需要使用这些关键字，则必须将其替换为具有近似含义的其他关键字，然后才能继续进行训练。

注意

要在提交微调作业之前验证数据集，可以使用 GitHub 上提供的数据集验证脚本。

设计有效的训练示例

您的训练数据应会表现出您希望模型表现的行为。SFT 仅传授模型如何响应，而不会传授需要知道的知识。如果您发现自己创建的训练示例主要是为了注入事实性知识（例如，“错误代码 E-45 是什么意思？”，答案为“E-45 表示传感器超时”），则应考虑 RAG 或提示工程是否能够在不进行微调的情况下获得相同的结果。

将源数据转换为对话格式时，应遵循以下原则：

从真实用户查询出发。编写能够反映最终用户在生产中实际向模型发出提示的方式的用户轮次。避免不能反映真实使用模式的人造问题或过度简化的问题。
编写黄金标准的助手响应。每个助手轮次都应是您希望模型产生的理想响应——在所有示例中都正确、格式恰当、语气和结构一致。
策略性地使用系统提示。将输出格式、角色和领域上下文等永久说明放入系统轮次。在训练和推理期间使用相同的系统提示，以获得最佳结果。
确保每个示例专注于一项技能。如果需要模型执行分类、回答问题和进行总结，应为每项技能创建单独的示例集，而不是将多个技能组合到单个示例中。
质量的重要性高于数量。一般而言，几百个高质量、一致的示例优于数千个杂乱或相互矛盾的示例。在训练之前删除重复数据并检查数据。

常用数据转换模式

以下模式演示了如何将常用的企业数据格式转换为有效的 SFT 训练示例。

模式 1：将结构化数据转换为分类或查询

当源数据是由具有诸如 issue、cause 和 resolution 之类字段的记录组成的表或数据库时，则将每条具有相关输入字段的记录作为用户问题，并将具有目标输出字段的记录作为助手的响应。


// train.jsonl
{
  "schemaVersion": "bedrock-conversation-2024",
  "system": [
    {
      "text": "You are a technical support assistant. Given an issue description, identify the most likely root cause and recommended resolution."
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "text": "A user reports: 'Application crashes on startup after the latest update.' What is the likely cause and resolution?"
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "text": "Cause: Incompatible configuration file from the previous version.\nResolution: Delete the local configuration cache and restart the application to regenerate default settings."
        }
      ]
    }
  ]
}

模式 2：将层次结构或分类法数据转换为结构化问答

当源数据是诸如产品类别树或分类系统之类的代码层次结构或分类法数据时，则将生成要求模型浏览层次结构的问题。在系统提示或用户轮次中包含分类上下文。


// train.jsonl
{
  "schemaVersion": "bedrock-conversation-2024",
  "system": [
    {
      "text": "You are a product catalog assistant. Answer questions about product categories and classifications accurately."
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "text": "What top-level category does product code 'EL-4520' belong to, and what subcategories does it include?"
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "text": "Product code EL-4520 belongs to the Electronics category. It includes the subcategories: Audio Equipment, Portable Speakers, and Wireless Accessories."
        }
      ]
    }
  ]
}

模式 3：从流程文档转换为指令性对话

当源数据由标准操作程序、运行手册或分步指南组成时，则会将流程转换为以任务为导向的问题。当用户询问如何执行某个任务时，助手以一致的格式提供操作步骤。


// train.jsonl
{
  "schemaVersion": "bedrock-conversation-2024",
  "system": [
    {
      "text": "You are an IT operations assistant. Provide step-by-step instructions based on approved runbooks and standard operating procedures."
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "text": "How do I reset a user's multi-factor authentication when they have lost access to their device?"
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "text": "Follow these steps to reset a user's multi-factor authentication:\n1. Verify the user's identity using the approved verification procedure.\n2. Disable the existing MFA configuration in the identity management console.\n3. Generate a new temporary access code and send it to the user's verified email.\n4. Instruct the user to log in with the temporary code and re-enroll their new device.\n5. Confirm the new MFA device is working by requesting a test authentication.\n6. Document the reset in the support ticket."
        }
      ]
    }
  ]
}

注意

在模式 1 和 2 中，如果模型的预训练数据中还没有事实内容（例如代码、类别或解决方法），可考虑在推理时使用 RAG 提供这些信息，而不是完全依靠 SFT 来记住这些信息。SFT 对于向模型传授响应格式和推理模式最为有效，而 RAG 则负责事实基础。

示例数据集格式

以下示例数据集格式可供参考。

以下示例仅适用于对文本进行自定义微调。


// train.jsonl
{
  "schemaVersion": "bedrock-conversation-2024",
  "system": [
    {
      "text": "You are a digital assistant with a friendly personality"
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "text": "What is the capital of Mars?"
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "text": "Mars does not have a capital. Perhaps it will one day."
        }
      ]
    }
  ]
}

以下示例适用于对文本和单张图像进行自定义微调。


// train.jsonl{
    "schemaVersion": "bedrock-conversation-2024",
    "system": [{
        "text": "You are a smart assistant that answers questions respectfully"
    }],
    "messages": [{
            "role": "user",
            "content": [{
                    "text": "What does the text in this image say?"
                },
                {
                    "image": {
                        "format": "png",
                        "source": {
                            "s3Location": {
                                "uri": "s3://your-bucket/your-path/your-image.png",
                                "bucketOwner": "your-aws-account-id"
                            }
                        }
                    }
                }
            ]
        },
        {
            "role": "assistant",
            "content": [{
                "text": "The text in the attached image says 'LOL'."
            }]
        }
    ]
}

以下示例适用于对文本和视频进行自定义微调。


{
    "schemaVersion": "bedrock-conversation-2024",
    "system": [{
        "text": "You are a helpful assistant designed to answer questions crisply and to the point"
    }],
    "messages": [{
            "role": "user",
            "content": [{
                    "text": "How many white items are visible in this video?"
                },
                {
                    "video": {
                        "format": "mp4",
                        "source": {
                            "s3Location": {
                                "uri": "s3://your-bucket/your-path/your-video.mp4",
                                "bucketOwner": "your-aws-account-id"
                            }
                        }
                    }
                }
            ]
        },
        {
            "role": "assistant",
            "content": [{
                "text": "There are at least eight visible items that are white"
            }]
        }
    ]
}

数据集限制

Amazon Nova 对理解模型的模型自定义应用以下限制。

模型	最小样本数	最大样本数	上下文长度
Amazon Nova Micro	8	20k	32k
Amazon Nova Lite	8	20k	32k
Amazon Nova Pro	8	20k	32k

图像和视频限制
最大图像数	10 张/样本
图像文件最大大小	10 MB
最大视频数	1 个/样本
视频最大长度/时长	90 秒
视频文件最大大小	50 MB

支持的媒体格式

图像 – png、jpeg、gif、webp
视频 – mov、mkv、mp4、webm

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

微调 Nova 1.0

蒸馏