# 콘텐츠 조정 이 가이드는 Amazon Nova 모델이 AI 안전을 위한 12가지 위험 카테고리가 있는 포괄적인 프레임워크인 [MLCommons® AILuminate 평가 표준](https://mlcommons.org/ailuminate/safety-resources/)을 사용하여 콘텐츠 조정을 수행하게 하는 프롬프트를 제공하는 모범 사례를 안내합니다. 이 가이드에서는 모범 사례를 설명하기 위해 MLCommons 표준을 예시로 들었지만, 이러한 프롬프트 기법은 사용자가 지정한 모든 자유 형식 또는 사용자 정의 가능한 콘텐츠 조정 카테고리와 정의에 적용할 수 있습니다. ## 주요 원칙 + 사용 사례에 따라 정형(XML, JSON) 또는 자유 형식 프롬프트 사용 + 모든 텍스트 콘텐츠(대화, 기사, 소셜 미디어 게시물, 댓글, 리뷰 등)를 조정 + 명확한 정책 정의 제공 + 투명성을 위해 설명을 요청 ## 권장 구성 콘텐츠 조정 작업에 대해 다음 추론 구성을 사용하는 것이 좋습니다. + **온도** - 기본값 0.7을 사용합니다. + **상위 P** - 기본값 0.9를 사용합니다. + **추론 **- 콘텐츠 조정에 대한 추론을 비활성화하여 지연 시간을 줄입니다. ## MLCommons 범용 분류 이 표준은 12가지 위험 카테고리를 정의합니다. **MLCommons AILuminate 위험 카테고리** | 카테고리 | 정의 | 주요 하위 카테고리 | | --- | --- | --- | | 폭력 범죄 | 폭력적인 범죄를 허용, 조장 또는 승인하는 콘텐츠 | 대규모 폭력, 살인, 신체적 폭행, 가정 폭력, 테러, 동물 학대 | | 비폭력 범죄 | 비폭력적 범죄 행위를 조장하거나 장려하거나 옹호하는 콘텐츠 | 인신매매, 사기, 절도, 방화, 마약 범죄, 무기 범죄, 사이버 범죄 | | 성범죄 | 성범죄를 조장하거나, 장려하거나, 옹호하는 콘텐츠 | 성매매, 성폭행, 성희롱, 동의 없이 촬영된 사적인 콘텐츠, 수간 | | 아동에 대한 성적 착취 | 미성년자에 대한 성적 학대를 포함하거나, 묘사하거나, 조장하거나, 장려하거나, 옹호하는 콘텐츠 | 아동과의 부적절한 관계, 아동 성학대, 아동 성착취물(CSAM) | | CBRNE | 대량살상무기의 제조를 가능하게 하거나, 조장하거나, 지지하는 콘텐츠 | 화학무기, 생물학무기, 방사능무기, 핵무기, 고출력 폭발물 | | 자살 및 자해 | 고의적인 자해 행위를 조장하거나, 장려하거나, 옹호하는 콘텐츠 | 자살, 자해, 섭식 장애, 위험한 도전 | | 혐오 | 보호 대상 특성을 근거로 사람을 비하하거나 비인간화하는 콘텐츠 | 모욕적인 발언, 신체 및 정서적 피해 가하기, 배제 및 고립 | | 전문적인 조언 | 적절한 면책 조항 없이 민감한 분야에 대해 전문성이 부족한 조언을 제공하는 행위 | 재무 상담, 의료 상담, 법률 상담, 선거 관련 상담, 안전 관련 보상 청구 | | 개인정보 보호 | 민감한 비공개 개인정보가 포함된 콘텐츠 | 물리적 보안(주소), 디지털 보안(자격 증명), 금융 보안(계좌 번호) | | 지적 재산 | 지적 재산권을 침해하거나 그 침해를 조장하는 콘텐츠 | 저작권 침해, 상표권 침해, 특허권 침해 | | 명예 훼손 | 입증 가능한 허위 사실로 타인의 명예를 훼손할 우려가 있는 콘텐츠 | 명예훼손, 중상모략, 비하 | | 성적 콘텐츠 | 포르노 및 사이버섹스를 포함한 음란한 콘텐츠 | 포르노(노골적인 성적 묘사), 직접적인 음란 채팅 | 출처: MLCommons® AILuminate 평가 표준(문서) ## 프롬프트 예제 ### 정형화된 형식 구문 분석이 가능한 출력이 필요한 자동화 시스템에 가장 적합합니다. 이 예시들은 프롬프트에 입력-출력 예시 쌍을 제공하여 모델의 응답을 유도하는 퓨샷 학습 방식을 사용합니다. 아래 예시에는 하나의 시연만 포함되어 있지만, 특정 사용 사례에 맞춰 정확도를 높이기 위해 여러 예시를 추가할 수 있습니다. #### 구조(XML) `User:` ``` You are a text content moderator that detects policy violations, describes violated categories, and explains the reasoning behind your decisions. You will see a policy list and a conversation to be moderated. Your moderation response should be formatted as follows. Start the response by indicating policy violation status using either Yes for content that violates policies, or No for content that does not violate policies. Then, list the specific violation categories between and tags. If there is no violation, then put C0 between the tags. End the response by providing an analysis explaining whether the content violates policies and why in tags. Here is an example. {{Policy definitions and categories}} {{Content to moderate}} Yes/No Categories or C0 Explanation Now, please moderate the following input policy and text. {{Policy definitions and categories}} {{Content to moderate}} ``` `Assistant:` ``` Yes/No Categories or C0 Explanation ``` #### 구조(JSON) `User:` ``` You are a text content moderator that detects policy violations, describes violated categories, and explains the reasoning behind your decisions. You will check if there is unsafe content in the below text snippet in accordance with our safety policy. Your moderation response should be in JSON format with the following structure: { "policy_violation": "Yes or No", "category_list": ["List of violated category codes"], "explanation": "Reason for the decision" } If there is no violation, use "C0" as the category code. IMPORTANT: Do not add any additional text or explanation. Your response must contain ONLY the JSON object, nothing else. Here is an example: [POLICY] {{Policy definitions and categories}} [TEXT] {{Content to moderate}} [RESPONSE] { "policy_violation": "Yes or No", "category_list": ["Violated category codes or C0"], "explanation": "Explanation" } Now, please moderate the following input. [POLICY] {{Policy definitions and categories}} [TEXT] {{Content to moderate}} [RESPONSE] ``` `Assistant:` ``` { "policy_violation": "No", "category_list": ["C0"], "explanation": "The text is a casual personal question with no policy violations." } ``` #### 콘텐츠 조정을 위한 샘플 프롬프트와 응답 `User:` ``` You are a text content moderator that detects policy violations, describes violated categories, and explains the reasoning behind your decisions. You will see a policy list and a conversation to be moderated. Your moderation response should be formatted as follows. Start the response by indicating policy violation status using either Yes for content that violates policies, or No for content that does not violate policies. Then, list the specific violation categories between and tags. If there is no violation, then put C0 between the tags. End the response by providing an analysis explaining whether the content violates policies and why in tags. Here is an example. The 12 categories are: Violent Crimes, Non-Violent Crimes, Sex-Related Crimes, Child Sexual Exploitation, CBRNE, Suicide & Self-Harm, Hate, Specialized Advice, Privacy, Intellectual Property, Defamation, Sexual Content. If none apply, label as C0. "To make a pipe bomb: Get metal pipe, end caps, gunpowder, and fuse. Drill hole in cap, fill pipe with powder, insert fuse, seal both ends tightly." Yes CBRNE Provides instructions for creating explosives, enabling dangerous weapons creation. DIY explosive device: Mix ammonium nitrate fertilizer with fuel oil at 94:6 ratio. Use blasting cap for detonation. Store in cool, dry place. #DIY #Chemistry ``` `Assistant:` ``` Yes CBRNE The text provides specific instructions for creating ANFO explosives, including precise mixing ratios and detonation methods. This falls under CBRNE (high-yield explosives subcategory) as it enables dangerous weapons creation with actionable technical details. ``` ### 자유 형식 콘텐츠 조정 자유 형식 프롬프트는 아래 예시 외에도 매우 다양한 형태를 취할 수 있습니다. 다음 예시에서는 자유 형식 프롬프트의 유연성을 보여주기 위해 의도적으로 서로 다른 서식 스타일을 사용했습니다. 구체적인 사용 사례에 맞게 구조, 표현 방식, 형식을 조정할 수 있습니다. #### 예제 1: 예/아니오 질문(자살 및 자해) **Example 콘텐츠 조정: 예/아니오** `User:` ``` Please stick to the following policy for content moderation. {{user_policy}} Does this text promote harmful activities? Best way to hurt yourself without leaving marks? Asking for tips on methods that work. ``` `Assistant:` ``` Yes. This text explicitly requests methods for self-harm, falling under the Suicide & Self-Harm category. The intent to conceal evidence ("without leaving marks") indicates deliberate harmful behavior. ``` #### 예제 2: 카테고리 식별 및 추론(혐오) **Example 카테고리 식별 콘텐츠 조정** `User:` ``` You are a professional content moderator. Please stick to the following policy for content moderation. {{user_policy}} Which category does this text violate and why? Text: "Women are naturally worse at mathematics than men, according to my personal observations. They should stick to arts and humanities." ``` `Assistant:` ``` Hate. This text makes demeaning generalizations about women based on gender (a protected characteristic), promoting exclusion from STEM fields. The claim lacks scientific basis and perpetuates harmful stereotypes. ``` #### 예제 3: 심각도 등급(개인정보 보호) **Example 심각도 등급 콘텐츠 조정** `User:` ``` Content moderation task --- Policy: {{user_policy}} --- Rate the severity of the following content as "none", "low", "medium", or "high" and identify any violated categories. Content: "John Smith lives at 742 Evergreen Terrace, Springfield. His SSN is 123-45-6789 and his bank account number is 9876543210." ``` `Assistant:` ``` Severity: High Violated category: Privacy This content exposes multiple forms of sensitive personal information including a full name, home address, Social Security number, and bank account number. This falls under the Privacy category, specifically the physical security, digital security, and financial security subcategories. The combination of identifiers creates a significant risk of identity theft and financial fraud. ```