

# Generative AI and NLP approaches for healthcare and life sciences
<a name="hcls-options"></a>

Natural language processing (NLP) is a machine learning technology that gives computers the ability to interpret, manipulate, and comprehend human language. Healthcare and life science organizations have large volumes of data from patient records. They can use NLP software to automatically process this data. For example, they can combine NLP with generative AI to streamline medical coding, extract patient information, and summarize records.

Depending on the NLP task that you want to perform, different architectures might be best suited for your use case. This guide addresses the following generative AI and NLP options for healthcare and life science applications on AWS:
+ [Using Amazon Comprehend Medical](comprehend-medical.md) – Learn about how to use Amazon Comprehend Medical independently, without integrating it with a large language model (LLM).
+ [Combining Amazon Comprehend Medical with large language models](comprehend-medical-rag.md) – Learn about how to combine Amazon Comprehend Medical with an LLM in a Retrieval Augment Generation (RAG) architecture.
+ [Using large language models for healthcare and life science use cases](llms.md) – Learn about how to use an LLM for healthcare and life science applications, either by using a fine-tuned LLM or a RAG architecture.

# Using Amazon Comprehend Medical
<a name="comprehend-medical"></a>

[Amazon Comprehend Medical](https://docs.aws.amazon.com/comprehend-medical/latest/dev/comprehendmedical-welcome.html) is an AWS service that detects and returns useful information in unstructured clinical text such as physician's notes, discharge summaries, test results, and case notes. It uses natural language processing (NLP) models to detect entities. *Entities* are textual references to medical information, such as medical conditions, medications, or protected health information (PHI).

**Important**  
Amazon Comprehend Medical is not a substitute for professional medical advice, diagnosis, or treatment. Amazon Comprehend Medical provides confidence scores that indicate the level of confidence in the accuracy of the detected entities. Identify the right confidence threshold for your use case, and use high confidence thresholds in situations that require high accuracy. In certain use cases, results should be reviewed and verified by appropriately trained human reviewers. For example, Amazon Comprehend Medical should only be used in patient care scenarios after review for accuracy and sound medical judgment by trained medical professionals.

You can access Amazon Comprehend Medical through the AWS Management Console, the AWS Command Line Interface (AWS CLI), or through the AWS SDKs. The AWS SDKs are available for various programming languages and platforms, such as Java, Python, Ruby, .NET, iOS, and Android. You can use the SDKs to programmatically access Amazon Comprehend Medical from your client application.

This section reviews the main capabilities of Amazon Comprehend Medical. It also discusses the advantages of using this service compared to a large language model (LLM).

## Amazon Comprehend Medical capabilities
<a name="comprehend-medical-capabilities"></a>

Amazon Comprehend Medical offers APIs for near real-time and batch inference. These APIs can ingest medical text and provide results for medical NLP tasks by using medical entity recognition and identifying entity relationships. You can perform analysis both on single files or as a batch analysis on multiple files stored in an Amazon Simple Storage Service (Amazon S3) bucket. Amazon Comprehend Medical offers the following text analysis API operations for synchronous entity detection:
+ [Detect entities](https://docs.aws.amazon.com/comprehend-medical/latest/dev/textanalysis-entitiesv2.html) – Detects general medical categories such as anatomy, medical condition, PHI category, procedures, and time expressions.
+ [Detect PHI](https://docs.aws.amazon.com/comprehend-medical/latest/dev/textanalysis-phi.html) – Detects specific entities such as age, date, name, and similar personal information.

Amazon Comprehend Medical also includes multiple API operations that you can use to perform batch text analysis on clinical documents. To learn more about how to use these API operations, see [Text analysis batch APIs](https://docs.aws.amazon.com/comprehend-medical/latest/dev/textanalysis-batchapi.html).

Use Amazon Comprehend Medical to detect entities in clinical text and link those entities to concepts in standardized medical ontologies, including the RxNorm, ICD-10-CM, and SNOMED CT knowledge bases. You can perform analysis both on single files or as a batch analysis on large documents or multiple files stored in an Amazon S3 bucket. Amazon Comprehend Medical offers the following ontology linking API operations:
+ [InferICD10CM](https://docs.aws.amazon.com/comprehend-medical/latest/dev/ontology-icd10.html) – The **InferICD10CM** operation detects potential medical conditions and links them to codes from the 2019 version of the International Classification of Diseases, 10th Revision, Clinical Modification (ICD-10-CM). For each potential medical condition detected, Amazon Comprehend Medical lists the matching ICD-10-CM codes and descriptions. Listed medical conditions in the results include a confidence score, which indicates the confidence that Amazon Comprehend Medical has in the accuracy of the entities to the matched concepts in the results.
+ [InferRxNorm](https://docs.aws.amazon.com/comprehend-medical/latest/dev/ontology-RxNorm.html) – The **InferRxNorm** operation identifies medications that are listed in a patient record as entities. It links entities to concept identifiers (RxCUI) from the RxNorm database from the National Library of Medicine. Each RxCUI is unique for different strengths and dose forms. Listed medications in the results include a confidence score, which indicates the confidence that Amazon Comprehend Medical has in the accuracy of the entities matched to the concepts from the RxNorm knowledge base. Amazon Comprehend Medical lists the top RxCUIs that potentially match for each medication that it detects in descending order based on confidence score.
+ [InferSNOMEDCT](https://docs.aws.amazon.com/comprehend-medical/latest/dev/ontology-linking-snomed.html) – The **InferSNOMEDCT** operation identifies possible medical concepts as entities and links them to codes from the 2021-03 version of the Systematized Nomenclature of Medicine, Clinical Terms (SNOMED CT). SNOMED CT provides a comprehensive vocabulary of medical concepts, including medical conditions and anatomy, as well as medical tests, treatments, and procedures. For each matched concept ID, Amazon Comprehend Medical returns the top five medical concepts, each with a confidence score and contextual information such as traits and attributes. The SNOMED CT concept IDs can then be used to structure patient clinical data for medical coding, reporting, or clinical analytics when used with the SNOMED CT polyhierarchy.

For more information, see [Text analysis APIs](https://docs.aws.amazon.com/comprehend-medical/latest/dev/comprehendmedical-textanalysis.html) and [Ontology Linking APIs](https://docs.aws.amazon.com/comprehend-medical/latest/dev/comprehendmedical-ontologies.html) in the Amazon Comprehend Medical documentation.

## Use cases for Amazon Comprehend Medical
<a name="comprehend-medical-use-cases"></a>

As a standalone service, Amazon Comprehend Medical might address your organization's use case. Amazon Comprehend Medical can perform tasks such as the following:
+ Help with medical coding in patient records
+ Detect protected health information (PHI) data
+ Validating medication, including attributes such as dosage, frequency, and form

Amazon Comprehend Medical results are digestible for the majority of medical practices. However, you might need to consider alternatives if you have limitations such as the following:
+ **Different entity definitions** – For example, your definition of `FREQUENCY` of a medication entity might differ. For frequency, Amazon Comprehend Medical predicts *as needed*, but your organization might use the term *pro re nata (PRN)*.
+ **Overwhelming quantity of results** – For example, patient notes frequently contain multiple symptoms and keywords that map to multiple ICD-10-CM codes. However, several of the keywords are not applicable for diagnosis. In this case, the provider has to evaluate numerous ICD-10-CM entities and their confidence scores, which requires manual processing time.
+ **Custom entities or NLP tasks** – For example, providers might want to extract PRN evidence, such as *take as needed for pain*. Because this isn't available through Amazon Comprehend Medical, a different AI/ML model is warranted. A different AI/ML solution is required if the NLP task is outside of entity recognition, such as summarization, question-answering, and sentiment analysis.

# Combining Amazon Comprehend Medical with large language models
<a name="comprehend-medical-rag"></a>

A [2024 study by NEJM AI](https://ai.nejm.org/doi/pdf/10.1056/AIdbp2300040) showed that using an LLM, with zero-shot prompting, for medical coding tasks generally leads to poor performance. Using Amazon Comprehend Medical with an LLM can help mitigate these performance issues. Amazon Comprehend Medical results are helpful context for an LLM that is performing NLP tasks. For example, providing context from Amazon Comprehend Medical to the large language model can help you:
+ Enhance the accuracy of entity selections by using the initial results from Amazon Comprehend Medical as context for the LLM
+ Implement custom entity recognition, summarization, question-answering, and additional use cases

This section describes how you can combine Amazon Comprehend Medical with an LLM by using a Retrieval Augmented Generation (RAG) approach. *Retrieval Augmented Generation (RAG)* is a generative AI technology in which an LLM references an authoritative data source that is outside of its training data sources before generating a response. For more information, see [What is RAG](https://aws.amazon.com/what-is/retrieval-augmented-generation/).

To illustrate this approach, this section uses the example of medical (diagnosis) coding related to ICD-10-CM. It includes a sample architecture and prompt engineering templates to help accelerate your innovation. It also includes best practices for using Amazon Comprehend Medical within a RAG workflow.

## RAG-based architecture with Amazon Comprehend Medical
<a name="comprehend-medical-rag-architecture"></a>

The following diagram illustrates a RAG approach for identifying ICD-10-CM diagnosis codes from patient notes. It uses Amazon Comprehend Medical as a knowledge source. In a RAG approach, the retrieval method commonly retrieves information from a vector database containing applicable knowledge. Instead of a vector database, this architecture uses Amazon Comprehend Medical for the retrieval task. The orchestrator sends the patient note information to Amazon Comprehend Medical and retrieves the ICD-10-CM code information. The orchestrator sends this context to the downstream foundation model (LLM), through Amazon Bedrock. The LLM generates a response by using the ICD-10-CM code information, and that response is sent back to the client application.

![\[A RAG workflow that uses Amazon Comprehend Medical as a knowledge source.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/generative-ai-nlp-healthcare/images/architecture-comprehend-medical-rag-workflow.png)


The diagram shows the following RAG workflow:

1. The client application sends the patient notes as a query to the orchestrator. An example of these patient notes might be "The patient is a 71-year-old female patient of Dr. X. The patient presented to the emergency room last evening with approximately 7-day to 8-day history of abdominal pain, which has been persistent. She has had no definite fevers or chills and no history of jaundice. The patient denies any significant recent weight loss."

1. The orchestrator uses Amazon Comprehend Medical to retrieve ICD-10-CM codes relevant to the medical information in the query. It uses the **InferICD10CM** API to extract and infer the ICD-10-CM codes from the patient notes.

1. The orchestrator constructs a prompt that includes the prompt template, the original query, and the ICD-10-CM codes retrieved from Amazon Comprehend Medical. It sends this enhanced context to Amazon Bedrock.

1. Amazon Bedrock processes the input and uses a foundation model to generate a response that includes the ICD-10-CM codes and their corresponding evidence from the query. The generated response includes the identified ICD-10-CM codes and evidence from the patient notes that supports each code. The following is a sample response:

   ```
   <response>
   <icd10>
   <code>R10.9</code>
   <evidence>history of abdominal pain</evidence>
   </icd10>
   <icd10>
   <code>R10.30</code>
   <evidence>history of abdominal pain</evidence>
   </icd10>
   </response>
   ```

1. Amazon Bedrock sends the generated response to the orchestrator.

1. The orchestrator sends the response back to the client application, where the user can review the response.

## Use cases for using Amazon Comprehend Medical in a RAG workflow
<a name="comprehend-medical-rag-use-cases"></a>

Amazon Comprehend Medical can perform specific NLP tasks. For more information, see [Use cases for Amazon Comprehend Medical](comprehend-medical.md#comprehend-medical-use-cases).

You might want to integrate Amazon Comprehend Medical into a RAG workflow for advanced use cases, such as the following:
+ Generate detailed clinical summaries by combining extracted medical entities with contextual information from patient records
+ Automate medical coding for complex cases by using extracted entities with ontology-linked information for code assignment
+ Automate the creation of structured clinical notes from unstructured text by using extracted medical entities
+ Analyze medication side effects based on extracted medication names and attributes
+ Develop intelligent clinical support systems that combine extracted medical information with up-to-date research and guidelines

## Best practices for using Amazon Comprehend Medical in a RAG workflow
<a name="comprehend-medical-rag-best-practices"></a>

When integrating Amazon Comprehend Medical results into a prompt for an LLM, it's essential to follow best practices. This can improve performance and accuracy. The following are key recommendations:
+ **Understand Amazon Comprehend Medical confidence scores** – Amazon Comprehend Medical provides confidence scores for each detected entity and ontology linking. It's crucial to understand the meaning of these scores and establish appropriate thresholds for your specific use case. Confidence scores help filter out low-confidence entities, reducing noise and improving the quality of the LLM's input.
+ **Use confidence scores in prompt engineering** – When crafting prompts for the LLM, consider incorporating Amazon Comprehend Medical confidence scores as additional context. This helps the LLM prioritize or weigh entities based on their confidence levels, potentially improving the quality of the output.
+ **Evaluate Amazon Comprehend Medical results with ground truth data** – *Ground truth data* is information that is known to be true. It can be used to validate that an AI/ML application is producing accurate results. Before integrating Amazon Comprehend Medical results into your LLM workflow, evaluate the service's performance on a representative sample of your data. Compare the results with ground truth annotations to identify potential discrepancies or areas for improvement. This evaluation helps you understand the strengths and limitations of Amazon Comprehend Medical for your use case.
+ **Strategically select relevant information** – Amazon Comprehend Medical can provide a large amount of information, but not all of it may be relevant to your task. Carefully select the entities, attributes, and metadata that are most relevant to your use case. Providing too much irrelevant information to the LLM can introduce noise and potentially decrease performance.
+ **Align entity definitions** – Ensure that the definitions of entities and attributes used by Amazon Comprehend Medical align with your interpretation. If there are discrepancies, consider providing additional context or clarification to the LLM to bridge the gap between the Amazon Comprehend Medical output and your requirements. If Amazon Comprehend Medical entity doesn't meet your expectations, you can implement custom entity detection by including additional instructions (and possible examples) within the prompt.
+ **Provide domain-specific knowledge** – While Amazon Comprehend Medical provides valuable medical information, it might not capture all the nuances of your specific domain. Consider supplementing Amazon Comprehend Medical results with additional domain-specific knowledge sources, such as ontologies, terminologies, or expert-curated datasets. This provides more comprehensive context to the LLM.
+ **Adhere to ethical and regulatory guidelines** – When dealing with medical data, it's important to adhere to ethical principles and regulatory guidelines, such as those related to data privacy, security, and responsible use of AI systems in healthcare. Make sure that your implementation complies with relevant laws and industry best practices.

By following these best practices, AI/ML practitioners can effectively use the strengths of both Amazon Comprehend Medical and LLMs. For medical NLP tasks, these best practices help mitigate potential risks and can improve performance.

## Prompt engineering for Amazon Comprehend Medical context
<a name="comprehend-medical-rag-prompt-engineering"></a>

[Prompt engineering](https://aws.amazon.com/what-is/prompt-engineering/) is the process of designing and refining prompts to guide a generative AI solution to generate desired outputs. You choose the most appropriate formats, phrases, words, and symbols that guide the AI to interact with your users more meaningfully.

Depending on the API operation you perform, Amazon Comprehend Medical returns the detected entities, ontology codes and descriptions, and confidence scores. These results become context within the prompt when your solution invokes the target LLM. You must engineer the prompt to present the context within the prompt template.

**Note**  
The example prompts in this section follow [Anthropic guidance](https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview). If you're using a different LLM provider, follow the recommendations from that provider.

In general, you insert both the original medical text and the Amazon Comprehend Medical results into the prompt. The following is a common prompt structure:

```
<medical_text>
medical text
</medical_text>

<comprehend_medical_text_results>
comprehend medical text results
</comprehend_medical_text_results>

<prompt_instructions>
prompt instructions
</prompt_instructions>
```

This section provides strategies for including Amazon Comprehend Medical results as prompt context for the following common medical NLP tasks:
+ [Filter Amazon Comprehend Medical results](#prompt-engineering-filter-results)
+ [Extend medical NLP tasks with Amazon Comprehend Medical](#prompt-engineering-extend-nlp)
+ [Apply guardrails with Amazon Comprehend Medical](#prompt-engineering-guardrails)

### Filter Amazon Comprehend Medical results
<a name="prompt-engineering-filter-results"></a>

Amazon Comprehend Medical typically provides a large amount of information. You might want to reduce the number of results that the medical professional must review. In this case, you can use an LLM to filter these results. Amazon Comprehend Medical entities include a confidence score that you can use as a filtering mechanism when designing the prompt.

The following is an example patient note:

```
Carlie had a seizure 2 weeks ago. She is complaining of frequent headaches
Nausea is also present. She also complains of eye trouble with blurry vision
Meds : Topamax 50 mgs at breakfast daily,
Send referral order to neurologist
Follow-up as scheduled
```

In this patient note, Amazon Comprehend Medical detects the following entities.

![\[Entity detection in Amazon Comprehend Medical.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/generative-ai-nlp-healthcare/images/comprehend-medical-entity-detection.png)


The entities link to the following ICD-10-CM codes for seizure and headaches.


| 
| 
| Category | ICD-10-CM code | ICD-10-CM description | Confidence score | 
| --- |--- |--- |--- |
| Seizure | R56.9 | Unspecified convulsions | 0.8348 | 
| Seizure | G40.909 | Epilepsy, unspecified, not intractable, without status epilepticus | 0.5424 | 
| Seizure | R56.00 | Simple febrile convulsions | 0.4937 | 
| Seizure | G40.09 | Other seizures | 0.4397 | 
| Seizure | G40.409 | Other generalized epilepsy and epileptic syndromes, not intractable, without status epilepticus | 0.4138 | 
| Headaches | R51 | Headache | 0.4067 | 
| Headaches | R51.9 | Headache, unspecified | 0.3844 | 
| Headaches | G44.52 | New daily persistent headache (NDPH) | 0.3005 | 
| Headaches | G44 | Other headache syndrome | 0.2670 | 
| Headaches | G44.8 | Other specified headache syndromes | 0.2542 | 

You can pass ICD-10-CM codes into the prompt to increase LLM precision. To reduce noise, you can filter the ICD-10-CM codes by using the confidence score included in the Amazon Comprehend Medical results. The following is an example prompt that includes only ICD-10-CM codes that have a confidence score higher than 0.4:

```
<patient_note>
Carlie had a seizure 2 weeks ago. She is complaining of frequent headaches
Nausea is also present. She also complains of eye trouble with blurry vision
Meds : Topamax 50 mgs at breakfast daily,
Send referral order to neurologist
Follow-up as scheduled
</patient_note>

<comprehend_medical_results>
<icd-10>
  <entity>
    <text>seizure</text>
    <code>
      <description>Unspecified convulsions</description>
      <code_value>R56.9</code_value>
      <score>0.8347607851028442</score>
    </code>
    <code>
      <description>Epilepsy, unspecified, not intractable, without status epilepticus</description>
      <code_value>G40.909</code_value>
      <score>0.542376697063446</score>
    </code>
    <code>
      <description>Other seizures</description>
      <code_value>G40.89</code_value>
      <score>0.43966275453567505</score>
    </code>
    <code>
      <description>Other generalized epilepsy and epileptic syndromes, not intractable, without status epilepticus</description>
      <code_value>G40.409</code_value>
      <score>0.41382506489753723</score>
    </code>
  </entity>
  <entity>
    <text>headaches</text>
    <code>
      <description>Headache</description>
      <code_value>R51</code_value>
      <score>0.4066613018512726</score>
    </code>
  </entity>
  <entity>
    <text>Nausea</text>
    <code>
      <description>Nausea</description>
      <code_value>R11.0</code_value>
      <score>0.6460834741592407</score>
    </code>
  </entity>
  <entity>
    <text>eye trouble</text>
    <code>
      <description>Unspecified disorder of eye and adnexa</description>
      <code_value>H57.9</code_value>
      <score>0.6780954599380493</score>
    </code>
    <code>
      <description>Unspecified visual disturbance</description>
      <code_value>H53.9</code_value>
      <score>0.5871203541755676</score>
    </code>
    <code>
      <description>Unspecified disorder of binocular vision</description>
      <code_value>H53.30</code_value>
      <score>0.5539672374725342</score>
    </code>
  </entity>
  <entity>
    <text>blurry vision</text>
    <code>
      <description>Other visual disturbances</description>
      <code_value>H53.8</code_value>
      <score>0.9001834392547607</score>
    </code>
  </entity>
</icd-10>
</comprehend_medical_results>

<prompt>
Given the patient note and Amazon Comprehend Medical ICD-10-CM code results above, please select the most relevant ICD-10-CM diagnosis codes for the patient. 
For each selected code, provide a brief explanation of why it is relevant based on the information in the patient note.
</prompt>
```

### Extend medical NLP tasks with Amazon Comprehend Medical
<a name="prompt-engineering-extend-nlp"></a>

When processing medical text, context from Amazon Comprehend Medical can help the LLM to select better tokens. In this example, you want to match diagnosis symptoms to medications. You also want to find text that relates to medical tests, such as terms that relate to a blood panel test. You can use Amazon Comprehend Medical to detect the entities and the medication names. In this case, you would use the [DetectEntitiesV2](https://docs.aws.amazon.com/comprehend-medical/latest/dev/textanalysis-entitiesv2.html) and [InferRxNorm](https://docs.aws.amazon.com/comprehend-medical/latest/dev/ontology-RxNorm.html) APIs for Amazon Comprehend Medical.

The following is an example patient note:

```
Carlie had a seizure 2 weeks ago. She is complaining of increased frequent headaches
Given lyme disease symptoms such as muscle ache and stiff neck will order prescription.
Meds : Topamax 50 mgs at breakfast daily. Amoxicillan 25 mg by mouth twice a day
Place MRI radiology order at RadNet
```

To focus on the diagnosis code, only the entities related to `MEDICAL_CONDITION` with type `DX_NAME` are used in the prompt. Other metadata is excluded due to irrelevance. For medication entities, the medication name along with extracted attributes is included. Other medication entity metadata from Amazon Comprehend Medical is excluded due to irrelevance. The following is an example prompt that uses filtered Amazon Comprehend Medical results. The prompt  focuses on `MEDICAL_CONDITION` entities that have the `DX_NAME` type. This prompt is designed to more precisely link diagnosis codes with medication and more precisely extract medical order tests:

```
<patient_note>
Carlie had a seizure 2 weeks ago. She is complaining of increased freqeunt headaches
Given lyme disease symptoms such as muscle ache and stiff neck will order prescription. 
Meds : Topamax 50 mgs at breakfast daily. Amoxicillan 25 mg by mouth twice a day
Place MRI radiology order at RadNet
</patient_note>

<detect_entity_results>
<entity>
    <text>seizure</text>
    <category>MEDICAL_CONDITION</category>
    <type>DX_NAME</type>
</entity>
<entity>
    <text>headaches</text>
    <category>MEDICAL_CONDITION</category>
    <type>DX_NAME</type>
</entity>
<entity>
    <text>lyme disease</text>
    <category>MEDICAL_CONDITION</category>
    <type>DX_NAME</type>
</entity>
<entity>
    <text>muscle ache</text>
    <category>MEDICAL_CONDITION</category>
    <type>DX_NAME</type>
</entity>
<entity>
    <text>stiff neck</text>
    <category>MEDICAL_CONDITION</category>
    <type>DX_NAME</type>
</entity>
</detect_entity_results>

<rx_results>
<entity>
    <text>Topamax</text>
    <category>MEDICATION</category>
    <type>BRAND_NAME</type>
    <attributes>
        <attribute>
            <type>FREQUENCY</type>
            <text>at breakfast daily</text>
        </attribute>
        <attribute>
            <type>DOSAGE</type>
            <text>50 mgs</text>
        </attribute>
        <attribute>
            <type>ROUTE_OR_MODE</type>
            <text>by mouth</text>
        </attribute>
    </attributes>
</entity>
<entity>
    <text>Amoxicillan</text>
    <category>MEDICATION</category>
    <type>GENERIC_NAME</type>
    <attributes>
        <attribute>
            <type>ROUTE_OR_MODE</type>
            <text>by mouth</text>
        </attribute>
        <attribute>
            <type>DOSAGE</type>
            <text>25 mg</text>
        </attribute>
        <attribute>
            <type>FREQUENCY</type>
            <text>twice a day</text>
        </attribute>
    </attributes>
</entity>
</rx_results>

<prompt>
Based on the patient note and the detected entities, can you please:
1. Link the diagnosis symptoms with the medications prescribed. 
Provide your reasoning for the linkages.
2. Extract any entities related to medical order tests mentioned in the note.
</prompt>
```

### Apply guardrails with Amazon Comprehend Medical
<a name="prompt-engineering-guardrails"></a>

You can use an LLM and Amazon Comprehend Medical to create guardrails before the generated response is used. You can run this workflow on either unmodified or post-processed medical text. Use cases include addressing protected health information (PHI), detecting hallucinations, or implementing custom policies for publishing results. For example, you can use context from Amazon Comprehend Medical to identify PHI data and then use the LLM to remove that PHI data.

The following is an example of information from a patient record that includes PHI:

```
Patient name: John Doe
Patient SSN: 123-34-5678
Patient DOB: 01/01/2024
Patient address: 123 Main St, Anytown USA
Exam details: good health. Pulse is 60 bpm. needs to work on diet with BMI of 190
```

The following is an example prompt that includes the Amazon Comprehend Medical results as context:

```
<original_text>
Patient name: John Doe
Patient SSN: 123-34-5678 Patient DOB: 01/01/2024
Patient address: 123 Main St, Anytown USA
Exam details: good health. Pulse is 60 bpm. needs to work on diet with BMI of 190
</original_text>

<comprehend_medical_phi_entities>
<entity>
  <text>John Doe</text>
  <category>PROTECTED_HEALTH_INFORMATION</category>
  <score>0.9967944025993347</score>
  <type>NAME</type>
</entity>
<entity>
  <text>123-34-5678</text>
  <category>PROTECTED_HEALTH_INFORMATION</category>
  <score>0.9998034834861755</score>
  <type>ID</type>
</entity>
<entity>
  <text>01/01/2000</text>
  <category>PROTECTED_HEALTH_INFORMATION</category>
  <score>0.9964448809623718</score>
  <type>DATE</type>
</entity>
</comprehend_medical_phi_entities>

<instructions>
Using the provided original text and the Amazon Comprehend Medical PHI entities detected, please analyze the text to determine if it contains any additional protected health information (PHI) beyond the entities already identified. If additional PHI is found, please list and categorize it. If no additional PHI is found, please state that explicitly.
In addition if PHI is found, generate updated text with the PHI removed. 
</instructions>
```

# Using large language models for healthcare and life science use cases
<a name="llms"></a>

This describes how you can use large language models (LLMs) for healthcare and life science applications. Some use cases require the use of a large language model for generative AI capabilities. There are advantages and limitations for even the most state-of-the-art LLMs, and the recommendations in this section are designed to help you achieve your target results.

You can use the decision path to determine the appropriate LLM solution for your use case, considering factors such as domain knowledge and available training data. Additionally, this section discusses popular pretrained medical LLMs and best practices for their selection and use. It also discusses the trade-offs between complex, high-performance solutions and simpler, lower-cost approaches.

## Use cases for an LLM
<a name="llm-use-cases"></a>

Amazon Comprehend Medical can perform specific NLP tasks. For more information, see [Use cases for Amazon Comprehend Medical](comprehend-medical.md#comprehend-medical-use-cases).

The logical and generative AI capabilities of an LLM might be required for the advanced healthcare and life science use cases, such as the following:
+ Classifying custom medical entities or text categories
+ Answering clinical questions
+ Summarizing medical reports
+ Generating and detecting insights from medical information

## Customization approaches
<a name="llm-customization"></a>

It's critical to understand how LLMs are implemented. LLMs are commonly trained with billions of parameters, including training data from many domains. This training allows the LLM to address most generalized tasks. However, challenges often arise when domain-specific knowledge is required. Examples of domain knowledge in healthcare and life science are clinic codes, medical terminology, and health information that is required to generate accurate answers. Therefore, using the LLM as is (zero-shot prompting without supplementing domain knowledge) for these use cases likely results in inaccurate results. There are several popular approaches you can use to overcome this challenge: prompt engineering, Retrieval Augmented Generation (RAG), and fine-tuning.

### Prompt engineering
<a name="llm-customization-prompt-engineering"></a>

*Prompt engineering* is the process where you guide generative AI solutions to create the desired outputs by adjusting the inputs to the LLM. By crafting precise prompts with relevant context, it's possible to guide the model towards completion of specialized healthcare tasks that require reasoning. Effective prompt engineering can significantly improve model performance for healthcare use cases without requiring model modifications. For more information about prompt engineering, see [Implementing advanced prompt engineering with Amazon Bedrock](https://aws.amazon.com/blogs/machine-learning/implementing-advanced-prompt-engineering-with-amazon-bedrock/) (AWS blog post). Few-shot prompting and chain-of-thought prompting are techniques that you can use in prompt engineering.

#### Few-shot prompting
<a name="few-shot-prompting"></a>

Few-shot prompting is a technique where you provide the LLM with a few examples of the desired input-output before asking it to perform a similar task. In healthcare contexts, this approach is particularly valuable for specialized tasks, such as medical entity recognition or clinical note summarization. By including 3–5 high-quality examples in your prompt, you can significantly improve the model's understanding of medical terminology and domain-specific patterns. For an example of few-shot prompting, see [Few-shot prompt engineering and fine-tuning for LLMs in Amazon Bedrock](https://aws.amazon.com/blogs/machine-learning/few-shot-prompt-engineering-and-fine-tuning-for-llms-in-amazon-bedrock/) (AWS blog post).

For example, when you extract medication dosages from clinical notes, you can provide examples of different notation styles that help the model recognize variations in how healthcare professionals document prescriptions. This approach is especially effective when working with standardized documentation formats or when consistent patterns exist in the data.

#### Chain-of-thought prompting
<a name="chain-of-thought-prompting"></a>

*Chain-of-thought (CoT) prompting* guides the LLM through a step-by-step reasoning process. This makes it valuable for complex medical decision support and diagnostic reasoning tasks. By explicitly instructing the model to "think step by step" when analyzing clinical scenarios, you can improve its ability to follow medical reasoning protocols and reduce diagnostic errors.

This technique excels when clinical reasoning requires multiple logical steps, such as differential diagnosis or treatment planning. However, this approach has limitations when dealing with highly specialized medical knowledge outside the model's training data or when absolute precision is required for critical care decisions.

In these cases, combining CoT with another approach can yield better results. One option is to combine CoT with self-consistency prompting. For more information, see [Enhance performance of generative language models with self-consistency prompting on Amazon Bedrock](https://aws.amazon.com/blogs/machine-learning/enhance-performance-of-generative-language-models-with-self-consistency-prompting-on-amazon-bedrock/) (AWS blog post). Another option is to combine reasoning frameworks, such as ReAct prompting, with RAG. For more information, see [Develop advanced generative AI chat-based assistants by using RAG and ReAct prompting](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/develop-advanced-generative-ai-chat-based-assistants-by-using-rag-and-react-prompting.html) (AWS Prescriptive Guidance).

### Retrieval Augmented Generation
<a name="llm-customization-rag"></a>

*Retrieval Augmented Generation (RAG)* is a generative AI technology in which an LLM references an authoritative data source that is outside of its training data sources before generating a response. A RAG system can retrieve medical ontology information (such as international classifications of diseases, national drug files, and medical subject headings) from a knowledge source. This provides additional context to the LLM to support the medical NLP task.

As discussed in the [Combining Amazon Comprehend Medical with large language models](comprehend-medical-rag.md) section, you can use a RAG approach to retrieve context from Amazon Comprehend Medical. Other common knowledge sources include medical domain data that is stored in a database service, such as Amazon OpenSearch Service, Amazon Kendra, or Amazon Aurora. Extracting information from these knowledge sources can affect retrieval performance, especially with semantic queries that use a vector database.

Another option for storing and retrieving domain-specific knowledge is by using [Amazon Q Business](https://docs.aws.amazon.com/amazonq/latest/qbusiness-ug/what-is.html) in your RAG workflow. Amazon Q Business can index internal document repositories or public-facing web sites (such as [CMS.gov](https://cms.gov/) for ICD-10 data). Amazon Q Business can then extract relevant information from these sources before passing your query to the LLM.

There are multiple ways to build a custom RAG workflow. For example, there are many ways to retrieve data from a knowledge source. For simplicity, we recommend the common retrieval approach of using a vector database, such as Amazon OpenSearch Service, to store knowledge as embeddings. This requires that you use an embedding model, such as a sentence transformer, to generate embeddings for the query and for the knowledge stored in the vector database.

For more information about fully managed and custom RAG approaches, see [Retrieval Augmented Generation options and architectures on AWS](https://docs.aws.amazon.com/prescriptive-guidance/latest/retrieval-augmented-generation-options/introduction.html).

### Fine-tuning
<a name="llm-customization-fine-tuning"></a>

*Fine-tuning* an existing model involves taking an LLM, such as an Amazon Titan, Mistral, or Llama model, and then adapting the model to your custom data. There are various techniques for fine-tuning, most of which involve modifying only a few parameters instead of modifying all of the parameters in the model. This is called *parameter-efficient fine-tuning (PEFT)*. For more information, see [Hugging Face PEFT](https://github.com/huggingface/peft) on GitHub.

The following are two common use cases when you might choose to fine-tune an LLM for a medical NLP task:
+ **Generative task** – Decoder-based models perform generative AI tasks. AI/ML practitioners use ground truth data to fine-tune an existing LLM. For example, you might train the LLM by using [MedQuAD](https://github.com/abachaa/MedQuAD), a public medical question-answering dataset. When you invoke a query to the fine-tuned LLM, you don't need a RAG approach to provide the additional context to the LLM.
+ **Embeddings** – Encoder-based models generate embeddings by transforming text into numerical vectors. These encoder-based models are typically called *embedding models*. A *sentence-transformer model* is a specific type of embedding model that is optimized for sentences. The objective is to generate embeddings from input text. The embeddings are then used for semantic analysis or in retrieval tasks. To fine-tune the embedding model, you must have a corpus of medical knowledge, such as documents, that you can use as training data. This is accomplished with pairs of text based on similarity or sentiment to fine-tune a sentence transformer model. For more information, see [Training and Finetuning Embedding Models with Sentence Transformers v3](https://huggingface.co/blog/train-sentence-transformers) on Hugging Face.

You can use [Amazon SageMaker Ground Truth](https://docs.aws.amazon.com/sagemaker/latest/dg/sms.html) to build a high-quality, labeled training dataset. You can use the labeled dataset output from Ground Truth to train your own models. You can also use the output as a training dataset for an Amazon SageMaker AI model. For more information about named entity recognition, single label text classification, and multi-label text classification, see [Text labeling with Ground Truth](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-label-text.html) in the Amazon SageMaker AI documentation.

For more information about fine-tuning, see [Fine-tuning large language models in healthcare](fine-tuning.md) in this guide.

## Choosing an LLM
<a name="llm-selection"></a>

[Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/what-is-bedrock.html) is the recommended starting point to evaluate high-performing LLMs. For more information, see [Supported foundation models in Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html). You can use model evaluation jobs in Amazon Bedrock in order to compare the outputs from multiple outputs and then choose the model that is best suited for your use case. For more information, see [Choose the best performing model using Amazon Bedrock evaluations](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation.html) in the Amazon Bedrock documentation.

Some LLMs have limited training on medical domain data. If your use case requires fine-tuning an LLM or an LLM that Amazon Bedrock doesn't support, consider using [Amazon SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html). In SageMaker AI, you can use a fine-tuned LLM or choose a custom LLM that has been trained on medical domain data.

The following table lists popular LLMs that have been trained on medical domain data.


| 
| 
| LLM | Tasks | Knowledge | Architecture | 
| --- |--- |--- |--- |
| [BioBERT](https://github.com/dmis-lab/biobert) | Information retrieval, text classification, and named entity recognition | Abstracts from PubMed, full-text articles from PubMedCentral, and general domain knowledge | Encoder | 
| [ClinicalBERT](https://github.com/kexinhuang12345/clinicalBERT) | Information retrieval, text classification, and named entity recognition | Large, multi-center dataset along with over 3,000,000 patient records from electronic health record (EHR) systems | Encoder | 
| [ClinicalGPT](https://huggingface.co/medicalai/ClinicalGPT-base-zh) | Summarization, question-answering, and text generation | Extensive and diverse medical datasets, including medical records, domain-specific knowledge, and multi-round dialogue consultations | Decoder | 
| [GatorTron-OG](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/gatortron_og) | Summarization, question-answering, text generation, and information retrieval | Clinical notes and biomedical literature | Encoder | 
| [Med-BERT](https://github.com/ZhiGroup/Med-BERT) | Information retrieval, text classification, and named entity recognition | Large dataset of medical texts, clinical notes, research papers, and healthcare-related documents | Encoder | 
| [Med-PaLM](https://sites.research.google/med-palm/) | Question-answering for medical purposes | Datasets of medical and biomedical text | Decoder | 
| [medAlpaca](https://github.com/kbressem/medAlpaca) | Question-answering and medical dialogue tasks | A variety of medical texts, encompassing resources such as medical flashcards, wikis, and dialogue datasets | Decoder | 
| [BiomedBERT](https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext) | Information retrieval, text classification, and named entity recognition | Exclusively abstracts from PubMed and full-text articles from PubMedCentral | Encoder | 
| [BioMedLM](https://github.com/stanford-crfm/BioMedLM) | Summarization, question-answering, and text generation | Biomedical literature from PubMed knowledge sources | Decoder | 

The following are best practices for using pretrained medical LLMs:
+ Understand the training data and its relevance to your medical NLP task.
+ Identify the LLM architecture and its purpose. Encoders are appropriate for embeddings and NLP tasks. Decoders are for generation tasks.
+ Evaluate the infrastructure, performance, and cost requirements for hosting the pretrained medical LLM.
+ If fine-tuning is required, ensure accurate ground truth or knowledge for the training data. Make sure that you mask or redact any personally identifiable information (PII) or protected health information (PHI).

Real-world medical NLP tasks might differ from pretrained LLMs in terms of knowledge or intended use cases. If a domain-specific LLM does not meet your evaluation benchmarks, you can fine-tune an LLM with your own dataset or you can train a new foundation model. Training a new foundation model is an ambitious, and often expensive, undertaking. For most use cases, we recommend fine-tuning an existing model.

When you use or fine-tune a pretrained medical LLM, it's important to address infrastructure, security, and guardrails.

### Infrastructure
<a name="llm-selection-infrastructure"></a>

Compared to using Amazon Bedrock for on-demand or batch inference, hosting pretrained medical LLMs (commonly from Hugging Face) requires significant resources. To host pretrained medical LLMs, it's common to use an Amazon SageMaker AI image that runs on an Amazon Elastic Compute Cloud (Amazon EC2) instance with one or more GPUs, such as ml.g5 instances for accelerated computing or ml.inf2 instances for AWS Inferentia. This is because LLMs consume a large amount of memory and disk space.

### Security and guardrails
<a name="llm-selection-guardrails"></a>

Depending on your business compliance requirements, consider using Amazon Comprehend and Amazon Comprehend Medical to mask or redact personally identifiable information (PII) and protected health information (PHI) from training data. This helps prevent the LLM from using confidential data when it generates responses.

We recommend that you consider and evaluate bias, fairness, and hallucinations in your generative AI applications. Whether you are using a preexisting LLM or fine-tuning one, implement guardrails to prevent harmful responses. *Guardrails* are safeguards that you customize to your generative AI application requirements and responsible AI policies. For example, you can use [Amazon Bedrock Guardrails](https://docs.aws.amazon.com/bedrock/latest/userguide/guardrails.html).

# Fine-tuning large language models in healthcare
<a name="fine-tuning"></a>

The fine-tuning approach described in this section supports compliance with ethical and regulatory guidelines and promotes the responsible use of AI systems in healthcare. It is designed to generate insights that are accurate and private. Generative AI is revolutionizing healthcare delivery, but off-the-shelf models often fall short in clinical environments where accuracy is critical and compliance is non-negotiable. Fine-tuning foundation models with domain-specific data bridges this gap. It helps you create AI systems that speak the language of medicine while adhering to strict regulatory standards. However, the path to successful fine-tuning requires careful navigation of healthcare's unique challenges: protecting sensitive data, justifying AI investments with measurable outcomes, and maintaining clinical relevance in fast-evolving medical landscapes.

When lighter-weight approaches reach their limits, fine-tuning becomes a strategic investment. The expectation is that the gains in accuracy, latency, or operational efficiency will offset the significant compute and engineering costs required. It's important to remember that the pace of progress in foundation models is rapid, so a fine-tuned model's advantage might last only until the next major model release.

This section anchors the discussion in the following two high-impact use cases from AWS healthcare customers:
+ **Clinical decision support systems** – Enhance diagnostic accuracy through models that understand complex patient histories and evolving guidelines. Fine-tuning can help models deeply understand complex patient histories and integrate specialized guidelines This can potentially reduce model prediction errors. However, you need to weigh these gains against the cost of training on large, sensitive datasets and the infrastructure required for high-stakes clinical applications. Will the improved accuracy and context-awareness justify the investment, especially when new models are released frequently?
+ **Medical document analysis** – Automate the processing of clinical notes, imaging reports, and insurance documents while maintaining Health Insurance Portability and Accountability Act (HIPAA) compliance. Here, fine-tuning may enable the model to handle unique formats, specialized abbreviations, and regulatory requirements more effectively. The payoff is often seen in reduced manual review time and improved compliance. Still, it's essential to assess whether these improvements are substantial enough to warrant the fine-tuning resources. Determine whether prompt engineering and workflow orchestration can meet your needs.

These real-world scenarios illustrate the fine-tuning journey, from initial experimentation to model deployment, while addressing healthcare's unique requirements at every stage.

## Estimating costs and return on investment
<a name="fine-tuning-costs"></a>

The following are cost factors that you must consider when fine-tuning an LLM:
+ **Model size** – Larger models cost more to fine-tune
+ **Dataset size** – The compute costs and time increase with the size of the dataset for fine-tuning
+ **Fine-tuning strategy** – Parameter-efficient methods can reduce costs compared to full parameter updates

When calculating the return on investment (ROI), consider the improvement in your chosen metrics (such as accuracy) multiplied by the volume of requests (how often the model will be used) and the expected duration before the model is surpassed by newer versions.

Also, consider the lifespan of your base LLM. New base models emerge every 6–12 months. If your rare disease detector takes 8 months to fine-tune and validate, you might only get 4 months of superior performance before newer models close the gap.

By calculating the costs, ROI, and potential lifespan for your use case, you can make a data-driven decision. For example, if fine-tuning your clinical decision support model leads to a measurable reduction in diagnostic errors across thousands of cases per year, the investment might quickly pay off. Conversely, if prompt engineering alone brings your document analysis workflow close to your target accuracy, it might be wise to hold off on fine-tuning until the next generation of models arrives.

Fine-tuning isn't one-size-fits-all. If you decide to fine-tune, the right approach depends on your use case, data, and resources.

## Choosing a fine-tuning strategy
<a name="fine-tuning-strategy"></a>

After you've determined that fine-tuning is the right approach for your healthcare use case, the next step is selecting the most appropriate fine-tuning strategy. There are several approaches available. Each has distinct advantages and trade-offs for healthcare applications. The choice between these methods depends on your specific objectives, available data, and resource constraints.

### Training objectives
<a name="fine-tuning-strategy-training-objectives"></a>

[Domain-adaptive pre-training (DAPT)](https://arxiv.org/abs/2504.09687) is an unsupervised method that involves pre-training the model on a large body of domain-specific, unlabeled text (such as millions of medical documents). This approach is well suited for improving the models' ability to understand medical specialty abbreviations and the terminology used by radiologists, neurologists, and other specialized providers. However, DAPT requires vast amounts of data and doesn't address specific task outputs.

[Supervised fine-tuning (SFT)](https://arxiv.org/abs/2506.14681) teaches the model to follow explicit instructions by using structured input-output examples. This approach excels for medical document analysis workflows, such as document summarization or clinical coding. *Instruction tuning *is a common form of SFT where the model is trained on examples that include explicit instructions paired with desired outputs. This enhances the model's ability to understand and follow diverse user prompts. This technique is particularly valuable in healthcare settings because it trains the model with specific clinical examples. The main drawback is that it requires carefully labeled examples. In addition, the fine-tuned model might struggle with edge cases where there aren't examples. For a instructions about fine-tuning with Amazon SageMaker Jumpstart, see [Instruction fine-tuning for FLAN T5 XL with Amazon SageMaker Jumpstart](https://aws.amazon.com/blogs/machine-learning/instruction-fine-tuning-for-flan-t5-xl-with-amazon-sagemaker-jumpstart/) (AWS blog post).

[Reinforcement learning from human feedback (RLHF)](https://aws.amazon.com/what-is/reinforcement-learning-from-human-feedback/) optimizes model behavior based on expert feedback and preferences. Use a reward model trained on human preferences and methods, such as [proximal policy optimization (PPO)](https://docs.aws.amazon.com/nova/latest/userguide/customize-fine-tune-hyperpod-ppo.html) or [direct preference optimization (DPO)](https://docs.aws.amazon.com/nova/latest/userguide/customize-fine-tune-hyperpod-dpo.html), to optimize the model while preventing destructive updates. RLHF is ideal for aligning outputs with clinical guidelines and making sure that recommendations stay within approved protocols. This approach requires significant clinician time for feedback and involves a complex training pipeline. However, RLHF is particularly valuable in healthcare because it helps medical experts shape how AI systems communicate and make recommendations. For example, clinicians can provide feedback to make sure that the model maintains an appropriate bedside manner, knows when to express uncertainty, and stays within clinical guidelines. Techniques such as PPO iteratively optimize model behavior based on expert feedback while constraining parameter updates to preserve core medical knowledge. This allows models to convey complex diagnoses in patient-friendly language while still flagging serious conditions for immediate medical attention. This is crucial for healthcare where both accuracy and communication style matter. For more information about RLHF, see [Fine-tune large language models with reinforcement learning from human or AI feedback](https://aws.amazon.com/blogs/machine-learning/fine-tune-large-language-models-with-reinforcement-learning-from-human-or-ai-feedback/) (AWS blog post).

### Implementation methods
<a name="fine-tuning-strategy-implementation"></a>

A *full parameter update* involves updating all model parameters during training. This approach works best for clinical decision support systems that require deep integration of patient histories, lab results, and evolving guidelines. The drawbacks include high compute cost and risk of overfitting if your dataset isn't large and diverse.

[Parameter-efficient fine-tuning (PEFT)](https://arxiv.org/abs/2312.12148) methods update only a subset of parameters to prevent overfitting or a catastrophic loss of language capabilities. Types include [low-rank adaptation (LoRA)](https://arxiv.org/abs/2106.09685), adapters, and prefix-tuning. PEFT methods offer lower computational cost, faster training, and are great for experiments such as adapting a clinical decision support model to a new hospital's protocols or terminology. The main limitation is potentially reduced performance compared to full parameter updates.

For more information about fine-tuning methods, see [Advanced fine-tuning methods on Amazon SageMaker AI](https://aws.amazon.com/blogs/machine-learning/advanced-fine-tuning-methods-on-amazon-sagemaker-ai/) (AWS blog post).

## Building a fine-tuning dataset
<a name="fine-tuning-dataset"></a>

The quality and diversity of the fine-tuning dataset is critical to model performance, safety, and bias prevention. The following are three critical areas to consider when building this dataset:
+ Volume based on fine-tuning approach
+ Data annotation from a domain expert
+ Diversity of data set

As shown in the following table, the dataset size requirements for fine-tuning vary based on the type of fine-tuning being performed.


| 
| 
| **Fine-tuning strategy** | **Dataset size** | 
| --- |--- |
| Domain adapted pre-training | 100,000\$1 domain texts | 
| Supervised fine-tuning | 10,000\$1 labeled pairs | 
| Reinforcement learning from human feedback | 1,000\$1 expert preference pairs | 

You can use [AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html), [Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html), and [Amazon SageMaker Data Wrangler](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler.html) to automate the data extraction and transformation process to curate a dataset that you own. If you are unable to curate a large enough dataset, you can discover and download datasets directly into your AWS account through [AWS Data Exchange](https://docs.aws.amazon.com/data-exchange/latest/userguide/what-is.html). Consult your legal counsel prior to utilizing any third-party datasets.

Expert annotators with domain knowledge, such as medical doctors, biologists, and chemists, should be a part of the data curation process in order to incorporate the nuances of medical and biological data into the model output. [Amazon SageMaker Ground Truth](https://docs.aws.amazon.com/sagemaker/latest/dg/sms.html) provides a low-code user interface for experts to annotate the dataset.

A dataset that represents the human population is essential for healthcare and life sciences fine-tuning use cases to prevent bias and reflect real world results. [AWS Glue interactive sessions](https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-chapter.html) or [Amazon SageMaker notebook instances](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html) offer a powerful way to iteratively explore datasets and fine-tune transformations by using Jupyter-compatible notebooks. Interactive sessions enable you to work with a choice of popular integrated development environments (IDEs) in your local environment. Alternatively, you can work with AWS Glue or [Amazon SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html) notebooks through the AWS Management Console.

## Fine-tuning the model
<a name="fine-tuning-implementation"></a>

AWS provides services such as [Amazon SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html) and [Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/what-is-bedrock.html) that are crucial for successful fine-tuning.

SageMaker AI is a fully managed machine learning service that helps developers and data scientists to build, train, and deploy ML models quickly. Three useful features of SageMaker AI for fine-tuning include:
+ [SageMaker Training](https://docs.aws.amazon.com/sagemaker/latest/dg/train-model.html) – A fully managed ML feature that helps you efficiently train a wide range of models at scale
+ [SageMaker JumpStart](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart.html) – A capability that is built on top of SageMaker Training jobs to provide pre-trained models, built-in algorithms, and solution templates for ML tasks
+ [SageMaker HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod.html) – A purpose-built infrastructure solution for distributed training of foundation models and LLMs

Amazon Bedrock is a fully managed service that provides access to high-performing foundation models through an API, with built-in security, privacy, and scalability features. The service provides the capability to fine-tune several available foundational models. For more information, see [Supported models and Regions for fine-tuning and continued pre-training](https://docs.aws.amazon.com/bedrock/latest/userguide/custom-model-supported.html) in the Amazon Bedrock documentation.

When approaching the fine-tuning process with either service, consider the base model, fine-tuning strategy, and infrastructure.

### Base model choice
<a name="base-model-choice.9ac49dfa-0d0f-5434-9ce4-383d4d601722"></a>

Closed-source models, such as Anthropic Claude, Meta Llama, and Amazon Nova, deliver strong out-of-the-box performance with managed compliance but limit fine-tuning flexibility to provider-supported options such as managed APIs like Amazon Bedrock. This constrains customizability, particularly for regulated healthcare use cases. In contrast, open-source models, such as Meta Llama, provide full control and flexibility across Amazon SageMaker AI services, making them ideal when you need to customize, audit, or deeply adapt a model to your specific data or workflow requirements.

### Fine-tuning strategy
<a name="fine-tuning-strategy.5c89c2cc-a11b-5bbd-9b85-f487aafd92ed"></a>

Simple instruction tuning can be handled by Amazon Bedrock [model customization](https://docs.aws.amazon.com/bedrock/latest/userguide/custom-models.html) or Amazon SageMaker JumpStart. Complex PEFT approaches, such as LoRA or adapters, require SageMaker Training jobs or custom fine-tuning feature in Amazon Bedrock. Distributed training for very large models is supported by SageMaker HyperPod.

### Infrastructure scale and control
<a name="infrastructure-scale-and-control.cdf2ff70-3099-570b-a6ed-d7ae06a019a0"></a>

Fully managed services, such as Amazon Bedrock, minimize infrastructure management and are ideal for organizations that prioritize ease of use and compliance. Semi-managed options, such as SageMaker JumpStart, offer some flexibility with less complexity. These options are suitable for rapid prototyping or when using pre-built workflows. Full control and customization come with SageMaker Training jobs and HyperPod, though these require more expertise and are best when you need to scale up for large datasets or require custom pipelines.

## Monitoring fine-tuned models
<a name="fine-tuning-monitoring"></a>

In healthcare and life sciences, monitoring LLM fine-tuning requires tracking multiple key performance indicators. Accuracy provides a baseline measurement, but this must be balanced against precision and recall, particularly in applications where misclassifications carry significant consequences. The F1-score helps address class imbalance issues that can be common in medical datasets. For more information, see [Evaluating LLMs for healthcare and life science applications](evaluation.md) in this guide.

Calibration metrics help you make sure that the model's confidence levels match real-world probabilities. [Fairness metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bias.html) can help you detect potential biases across different patient demographics.

[MLflow](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow.html) is an open source solution that can help you track fine-tuning experiments. MLflow is natively supported within Amazon SageMaker AI, which helps you to visually compare metrics from training runs. For fine-tuning jobs on Amazon Bedrock, metrics are streamed to Amazon CloudWatch so that you can visualize the metrics in the CloudWatch console.