# Best practices for Amazon Managed Workflows for Apache Airflow
<a name="best-practices"></a>

This guide describes the best practices we recommend when using Amazon Managed Workflows for Apache Airflow.

**Topics**
+ [Performance tuning for Apache Airflow on Amazon MWAA](best-practices-tuning.md)
+ [Managing Python dependencies in requirements.txt](best-practices-dependencies.md)

# Performance tuning for Apache Airflow on Amazon MWAA
<a name="best-practices-tuning"></a>

This topic describes how to tune the performance of an Amazon Managed Workflows for Apache Airflow environment using [Using Apache Airflow configuration options on Amazon MWAA](configuring-env-variables.md).

**Contents**
+ [Adding an Apache Airflow configuration option](#best-practices-tuning-console-add)
+ [Apache Airflow scheduler](#best-practices-tuning-scheduler)
  + [Parameters](#best-practices-tuning-scheduler-params)
  + [Limits](#best-practices-tuning-scheduler-limits)
+ [DAG folders](#best-practices-tuning-dag-folders)
  + [Parameters](#best-practices-tuning-dag-folders-params)
+ [DAG files](#best-practices-tuning-dag-files)
  + [Parameters](#best-practices-tuning-dag-files-params)
+ [Tasks](#best-practices-tuning-tasks)
  + [Parameters](#best-practices-tuning-tasks-params)

## Adding an Apache Airflow configuration option
<a name="best-practices-tuning-console-add"></a>

Use the following procedure to add an Airflow configuration option to your environment.

1. Open the [Environments](https://console.aws.amazon.com/mwaa/home#/environments) page on the Amazon MWAA console.

1. Choose an environment.

1. Choose **Edit**.

1. Choose **Next**.

1. Choose **Add custom configuration** in the **Airflow configuration options** pane.

1. Choose a configuration from the dropdown list and enter a value, or enter a custom configuration and enter a value.

1. Choose **Add custom configuration** for each configuration you want to add.

1. Choose **Save**.

To learn more, refer to [Using Apache Airflow configuration options on Amazon MWAA](configuring-env-variables.md).

## Apache Airflow scheduler
<a name="best-practices-tuning-scheduler"></a>

The Apache Airflow scheduler is a core component of Apache Airflow. An issue with the scheduler can prevent DAGs from being parsed and tasks from being scheduled. For more information about Apache Airflow scheduler tuning, refer to [Fine-tuning your scheduler performance](https://airflow.apache.org/docs/apache-airflow/2.2.2/concepts/scheduler.html#fine-tuning-your-scheduler-performance) in the Apache Airflow documentation website.

### Parameters
<a name="best-practices-tuning-scheduler-params"></a>

This section describes the configuration options available for the Apache Airflow scheduler (Apache Airflow v2 and later) and their use cases.

------
#### [ Apache Airflow v3 ]


| Configuration | Use case | 
| --- | --- | 
|  **[celery.sync\$1parallelism](https://airflow.apache.org/docs/apache-airflow/3.0.6/configurations-ref.html#parallelism)** The number of processes the Celery Executor uses to sync task state. **Default:** 1  |  You can use this option to prevent queue conflicts by limiting the processes the Celery Executor uses. By default, a value is set to `1` to prevent errors in delivering task logs to CloudWatch Logs. Setting the value to `0` means using the maximum number of processes, but can cause errors when delivering task logs.  | 
|  **[scheduler.scheduler\$1idle\$1sleep\$1time](https://airflow.apache.org/docs/apache-airflow/3.0.6/configurations-ref.html#scheduler-idle-sleep-time)** The number of seconds to wait between consecutive DAG file processing in the scheduler "loop."  **Default:** 1  |  You can use this option to free up CPU usage on the scheduler by **increasing** the time the scheduler sleeps after it's finished retrieving DAG parsing results, finding and queuing tasks, and executing queued tasks in the *Executor*. Increasing this value consumes the number of scheduler threads run on an environment in `dag_processor.parsing_processes` for Apache Airflow v2 and Apache Airflow v3. This can reduce the capacity of the schedulers to parse DAGs, and increase the time it takes for DAGs to populate in the webserver.  | 
|  **[scheduler.max\$1dagruns\$1to\$1create\$1per\$1loop](https://airflow.apache.org/docs/apache-airflow/3.0.6/configurations-ref.html#max-dagruns-to-create-per-loop)** The maximum number of DAGs to create *DagRuns* for per scheduler "loop." **Default:** 10  |  You can use this option to free up resources for scheduling tasks by **decreasing** the maximum number of *DagRuns* for the scheduler "loop."  | 
|  **[dag\$1processor.parsing\$1processes](https://airflow.apache.org/docs/apache-airflow/3.0.6/configurations-ref.html#parsing-processes)** The number of threads the scheduler can run in parallel to schedule DAGs. **Default:** Use `(2 * number of vCPUs) - 1`  |  You can use this option to free up resources by **decreasing** the number of processes the scheduler runs in parallel to parse DAGs. We recommend keeping this number low if DAG parsing is impacting task scheduling. You **must** specify a value that's less than the vCPU count on your environment. To learn more, refer to [Limits](#best-practices-tuning-scheduler-limits).  | 

------
#### [ Apache Airflow v2 ]


| Configuration | Use case | 
| --- | --- | 
|  **[celery.sync\$1parallelism](https://airflow.apache.org/docs/apache-airflow/2.10.3/configurations-ref.html#parallelism)** The number of processes the Celery Executor uses to sync task state. **Default:** 1  |  You can use this option to prevent queue conflicts by limiting the processes the Celery Executor uses. By default, a value is set to `1` to prevent errors in delivering task logs to CloudWatch Logs. Setting the value to `0` means using the maximum number of processes, but can cause errors when delivering task logs.  | 
|  **[scheduler.idle\$1sleep\$1time](https://airflow.apache.org/docs/apache-airflow/2.10.3/configurations-ref.html#scheduler-idle-sleep-time)** The number of seconds to wait between consecutive DAG file processing in the scheduler "loop."  **Default:** 1  |  You can use this option to free up CPU usage on the scheduler by **increasing** the time the scheduler sleeps after it's finished retrieving DAG parsing results, finding and queuing tasks, and executing queued tasks in the *Executor*. Increasing this value consumes the number of scheduler threads run on an environment in `scheduler.parsing_processes` for Apache Airflow v2 and Apache Airflow v3. This can reduce the capacity of the schedulers to parse DAGs, and increase the time it takes for DAGs to populate in the webserver.  | 
|  **[scheduler.max\$1dagruns\$1to\$1create\$1per\$1loop](https://airflow.apache.org/docs/apache-airflow/2.10.3/configurations-ref.html#max-dagruns-to-create-per-loop)** The maximum number of DAGs to create *DagRuns* for per scheduler "loop." **Default:** 10  |  You can use this option to free up resources for scheduling tasks by **decreasing** the maximum number of *DagRuns* for the scheduler "loop."  | 
|  **[scheduler.parsing\$1processes](https://airflow.apache.org/docs/apache-airflow/2.10.3/configurations-ref.html#parsing-processes)** The number of threads the scheduler can run in parallel to schedule DAGs. **Default:** Use `(2 * number of vCPUs) - 1`  |  You can use this option to free up resources by **decreasing** the number of processes the scheduler runs in parallel to parse DAGs. We recommend keeping this number low if DAG parsing is impacting task scheduling. You **must** specify a value that's less than the vCPU count on your environment. To learn more, refer to [Limits](#best-practices-tuning-scheduler-limits).  | 

------

### Limits
<a name="best-practices-tuning-scheduler-limits"></a>

This section describes the limits to consider when adjusting the default parameters for the scheduler.<a name="scheduler-considerations"></a>

**scheduler.parsing\$1processes, scheduler.max\$1threads (v2 only)**  
Two threads are allowed per vCPU for an environment class. At least one thread must be reserved for the scheduler for an environment class. If you notice a delay in tasks being scheduled, you might need to increase your [environment class](environment-class.md). For example, a large environment has a 4 vCPU Fargate container instance for its scheduler. This means that a maximum of `7` total threads are available to use for other processes. That is, two threads multiplied four vCPUs, minus one for the scheduler itself. The value you specify in `scheduler.max_threads` (v2 only) and `scheduler.parsing_processes` must not exceed the number of threads available for an environment class, as listed:  
+ **mw1.small** – Must not exceed `1` thread for other processes. The remaining thread is reserved for the scheduler.
+ **mw1.medium** – Must not exceed `3` threads for other processes. The remaining thread is reserved for the scheduler.
+ **mw1.large** – Must not exceed `7` threads for other processes. The remaining thread is reserved for the scheduler.

## DAG folders
<a name="best-practices-tuning-dag-folders"></a>

The Apache Airflow scheduler continuously scans the DAGs folder on your environment. Any contained `plugins.zip` files, or Python (`.py`) files containing “airflow” import statements. Any resulting Python DAG objects are then placed into a *DagBag* for that file to be processed by the scheduler to determine what, if any, tasks need to be scheduled. Dag file parsing occurs regardless of whether the files contain any viable DAG objects.

### Parameters
<a name="best-practices-tuning-dag-folders-params"></a>

This section describes the configuration options available for the DAGs folder (Apache Airflow v2 and later) and their use cases.

------
#### [ Apache Airflow v3 ]


| Configuration | Use case | 
| --- | --- | 
|  **[dag\$1processor.refresh\$1interval](https://airflow.apache.org/docs/apache-airflow/3.0.6/configurations-ref.html#config-dag-processor-refresh-interval)** The number of seconds the DAGs folder must be scanned for new files. **Default:** 300 seconds  |  You can use this option to free up resources by **increasing** the number of seconds to parse the DAGs folder. We recommend increasing this value if you experience long parsing times in `total_parse_time metrics`, which might be due to a large number of files in your DAGs folder.  | 
|  **[dag\$1processor.min\$1file\$1process\$1interval](https://airflow.apache.org/docs/apache-airflow/3.0.6/configurations-ref.html#min-file-process-interval)** The number of seconds after which the scheduler parses a DAG and updates to the DAG are reflected. **Default:** 30 seconds  |  You can use this option to free up resources by **increasing** the number of seconds that the scheduler waits before parsing a DAG. For example, if you specify a value of `30`, the DAG file is parsed after every 30 seconds. We recommend keeping this number high to decrease the CPU usage on your environment.  | 

------
#### [ Apache Airflow v2 ]


| Configuration | Use case | 
| --- | --- | 
|  **[scheduler.dag\$1dir\$1list\$1interval](https://airflow.apache.org/docs/apache-airflow/2.10.3/configurations-ref.html#dag-dir-list-interval)** The number of seconds the DAGs folder must be scanned for new files. **Default:** 300 seconds  |  You can use this option to free up resources by **increasing** the number of seconds to parse the DAGs folder. We recommend increasing this value if you experience long parsing times in `total_parse_time metrics`, which might be due to a large number of files in your DAGs folder.  | 
|  **[scheduler.min\$1file\$1process\$1interval](https://airflow.apache.org/docs/apache-airflow/2.10.3/configurations-ref.html#min-file-process-interval)** The number of seconds after which the scheduler parses a DAG and updates to the DAG are reflected. **Default:** 30 seconds  |  You can use this option to free up resources by **increasing** the number of seconds that the scheduler waits before parsing a DAG. For example, if you specify a value of `30`, the DAG file is parsed after every 30 seconds. We recommend keeping this number high to decrease the CPU usage on your environment.  | 

------

## DAG files
<a name="best-practices-tuning-dag-files"></a>

As part of the Apache Airflow scheduler loop, individual DAG files are parsed to extract DAG Python objects. In Apache Airflow v2 and later, the scheduler parses a maximum of number of [parsing processes](https://airflow.apache.org/docs/apache-airflow/2.10.3/configurations-ref.html#parsing-processes) at the same time. The number of seconds specified in `scheduler.min_file_process_interval` (v2) or `dag_processor.min_file_process_interval` (v3) must pass before the same file is parsed again.

### Parameters
<a name="best-practices-tuning-dag-files-params"></a>

This section describes the configuration options available for Apache Airflow DAG files (Apache Airflow v2 and later) and their use cases.

------
#### [ Apache Airflow v3 ]


| Configuration | Use case | 
| --- | --- | 
|  **[dag\$1processor.dag\$1file\$1processor\$1timeout](https://airflow.apache.org/docs/apache-airflow/3.0.6/configurations-ref.html#dag-file-processor-timeout)** The number of seconds before the *DagFileProcessor* times out processing a DAG file. **Default:** 50 seconds  |  You can use this option to free up resources by **increasing** the time it takes before the *DagFileProcessor* times out. We recommend increasing this value if you experience timeouts in your DAG processing logs that result in no viable DAGs being loaded.  | 
|  **[core.dagbag\$1import\$1timeout](https://airflow.apache.org/docs/apache-airflow/3.0.6/configurations-ref.html#dagbag-import-timeout)** The number of seconds before importing a Python file times out. **Default:** 30 seconds  |  You can use this option to free up resources by **increasing** the time it takes before the scheduler times out while importing a Python file to extract the DAG objects. This option is processed as part of the scheduler "loop," and must contain a value less than the value specified in `dag_processor.dag_file_processor_timeout`.  | 
|  **[core.min\$1serialized\$1dag\$1update\$1interval](https://airflow.apache.org/docs/apache-airflow/3.0.6/configurations-ref.html#min-serialized-dag-update-interval)** The minimum number of seconds after which serialized DAGs in the database are updated. **Default:** 30  |  You can use this option to free up resources by **increasing** the number of seconds after which serialized DAGs in the database are updated. We recommend increasing this value if you have a large number of DAGs, or complex DAGs. Increasing this value reduces the load on the scheduler and the database as DAGs are serialized.   | 
|  **[core.min\$1serialized\$1dag\$1fetch\$1interval](https://airflow.apache.org/docs/apache-airflow/3.0.6/configurations-ref.html#min-serialized-dag-fetch-interval)** The number of seconds a serialized DAG is re-fetched from the database when already loaded in the DagBag. **Default:** 10  |  You can use this option to free up resources by **increasing** the number of seconds a serialized DAG is re-fetched. The value must be greater than the value specified in `core.min_serialized_dag_update_interval` to reduce database "write" rates. Increasing this value reduces the load on the webserver and the database as DAGs are serialized.  | 

------
#### [ Apache Airflow v2 ]


| Configuration | Use case | 
| --- | --- | 
|  **[core.dag\$1file\$1processor\$1timeout](https://airflow.apache.org/docs/apache-airflow/2.10.3/configurations-ref.html#dag-file-processor-timeout)** The number of seconds before the *DagFileProcessor* times out processing a DAG file. **Default:** 50 seconds  |  You can use this option to free up resources by **increasing** the time it takes before the *DagFileProcessor* times out. We recommend increasing this value if you experience timeouts in your DAG processing logs that result in no viable DAGs being loaded.  | 
|  **[core.dagbag\$1import\$1timeout](https://airflow.apache.org/docs/apache-airflow/2.10.3/configurations-ref.html#dagbag-import-timeout)** The number of seconds before importing a Python file times out. **Default:** 30 seconds  |  You can use this option to free up resources by **increasing** the time it takes before the scheduler times out while importing a Python file to extract the DAG objects. This option is processed as part of the scheduler "loop," and must contain a value less than the value specified in `core.dag_file_processor_timeout`.  | 
|  **[core.min\$1serialized\$1dag\$1update\$1interval](https://airflow.apache.org/docs/apache-airflow/2.10.3/configurations-ref.html#min-serialized-dag-update-interval)** The minimum number of seconds after which serialized DAGs in the database are updated. **Default:** 30  |  You can use this option to free up resources by **increasing** the number of seconds after which serialized DAGs in the database are updated. We recommend increasing this value if you have a large number of DAGs, or complex DAGs. Increasing this value reduces the load on the scheduler and the database as DAGs are serialized.   | 
|  **[core.min\$1serialized\$1dag\$1fetch\$1interval](https://airflow.apache.org/docs/apache-airflow/2.10.3/configurations-ref.html#min-serialized-dag-fetch-interval)** The number of seconds a serialized DAG is re-fetched from the database when already loaded in the DagBag. **Default:** 10  |  You can use this option to free up resources by **increasing** the number of seconds a serialized DAG is re-fetched. The value must be greater than the value specified in `core.min_serialized_dag_update_interval` to reduce database "write" rates. Increasing this value reduces the load on the webserver and the database as DAGs are serialized.  | 

------

## Tasks
<a name="best-practices-tuning-tasks"></a>

The Apache Airflow scheduler and workers are both involved in queuing and de-queuing tasks. The scheduler takes parsed tasks ready to schedule from a **None** status to a **Scheduled** status. The executor, also running on the scheduler container in Fargate, queues those tasks and sets their status to **Queued**. When the workers have capacity, it takes the task from the queue and sets the status to **Running**, which subsequently changes its status to **Success** or **Failed** based on whether the task succeeds or fails.

### Parameters
<a name="best-practices-tuning-tasks-params"></a>

This section describes the configuration options available for Apache Airflow tasks and their use cases.

The default configuration options that Amazon MWAA overrides are marked in *red*.

------
#### [ Apache Airflow v3 ]


| Configuration | Use case | 
| --- | --- | 
|  **[core.parallelism](https://airflow.apache.org/docs/apache-airflow/3.0.6/configurations-ref.html#parallelism)** The maximum number of task instances that can have a `Running` status. **Default:** Dynamically set based on `(maxWorkers * maxCeleryWorkers) / schedulers * 1.5`.  |  You can use this option to free up resources by **increasing** the number of task instances that can run simultaneously. The value specified must be the number of available workers multiplied by the workers' task density. We recommend changing this value only when you experience a large number of tasks stuck in the “Running” or “Queued” state.  | 
|  **[core.execute\$1tasks\$1new\$1python\$1interpreter](https://airflow.apache.org/docs/apache-airflow/3.0.6/configurations-ref.html#execute-tasks-new-python-interpreter)** Determines whether Apache Airflow executes tasks by forking the parent process, or by creating a new Python process. **Default:** `True`  |  When set to `True`, Apache Airflow recognizes changes you make to your plugins as a new Python process so created to execute tasks.  | 
|  **[celery.worker\$1concurrency](https://airflow.apache.org/docs/apache-airflow-providers-celery/stable/configurations-ref.html#worker-concurrency)** Amazon MWAA overrides the Airflow base install for this option to scale workers as part of its autoscaling component. **Default:** Not applicable  |  *Any value specified for this option is ignored.*  | 
|  **[celery.worker\$1autoscale](https://airflow.apache.org/docs/apache-airflow-providers-celery/stable/configurations-ref.html#worker-autoscale)** The task concurrency for workers. **Defaults:** [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/mwaa/latest/userguide/best-practices-tuning.html)  |  You can use this option to free up resources by **reducing** the `maximum`, `minimum` task concurrency of workers. Workers accept up to the `maximum` concurrent tasks configured, regardless of whether there are sufficient resources to do so. If tasks are scheduled without sufficient resources, the tasks immediately fail. We recommend changing this value for resource-intensive tasks by reducing the values to be less than the defaults to allow more capacity per task.  | 

------
#### [ Apache Airflow v2 ]


| Configuration | Use case | 
| --- | --- | 
|  **[core.parallelism](https://airflow.apache.org/docs/apache-airflow/2.10.3/configurations-ref.html#parallelism)** The maximum number of task instances that can have a `Running` status. **Default:** Dynamically set based on `(maxWorkers * maxCeleryWorkers) / schedulers * 1.5`.  |  You can use this option to free up resources by **increasing** the number of task instances that can run simultaneously. The value specified must be the number of available workers multiplied by the workers' task density. We recommend changing this value only when you experience a large number of tasks stuck in the “Running” or “Queued” state.  | 
|  **[core.dag\$1concurrency](https://airflow.apache.org/docs/apache-airflow/2.10.3/configurations-ref.html#dag-concurrency)** The number of task instances allowed to run concurrently for each DAG. **Default:** 10000  |  You can use this option to free up resources by **increasing** the number of task instances allowed to run concurrently. For example, if you have one hundred DAGs with ten parallel tasks, and you want all DAGs to run concurrently, you can calculate the maximum parallelism as the number of available workers multiplied by the workers task density in `celery.worker_concurrency`, divided by the number of DAGs.  | 
|  **[core.execute\$1tasks\$1new\$1python\$1interpreter](https://airflow.apache.org/docs/apache-airflow/2.10.3/configurations-ref.html#execute-tasks-new-python-interpreter)** Determines whether Apache Airflow executes tasks by forking the parent process, or by creating a new Python process. **Default:** `True`  |  When set to `True`, Apache Airflow recognizes changes you make to your plugins as a new Python process so created to execute tasks.  | 
|  **[celery.worker\$1concurrency](https://airflow.apache.org/docs/apache-airflow-providers-celery/stable/configurations-ref.html#worker-concurrency)** Amazon MWAA overrides the Airflow base install for this option to scale workers as part of its autoscaling component. **Default:** Not applicable  |  *Any value specified for this option is ignored.*  | 
|  **[celery.worker\$1autoscale](https://airflow.apache.org/docs/apache-airflow-providers-celery/stable/configurations-ref.html#worker-autoscale)** The task concurrency for workers. **Defaults:** [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/mwaa/latest/userguide/best-practices-tuning.html)  |  You can use this option to free up resources by **reducing** the `maximum`, `minimum` task concurrency of workers. Workers accept up to the `maximum` concurrent tasks configured, regardless of whether there are sufficient resources to do so. If tasks are scheduled without sufficient resources, the tasks immediately fail. We recommend changing this value for resource-intensive tasks by reducing the values to be less than the defaults to allow more capacity per task.  | 

------

# Managing Python dependencies in requirements.txt
<a name="best-practices-dependencies"></a>

This topic describes how to install and manage Python dependencies in a `requirements.txt` file for an Amazon Managed Workflows for Apache Airflow environment.

**Contents**
+ [Testing DAGs using the Amazon MWAA CLI utility](#best-practices-dependencies-cli-utility)
+ [Installing Python dependencies using PyPi.org Requirements File Format](#best-practices-dependencies-different-ways)
  + [Option one: Python dependencies from the Python Package Index](#best-practices-dependencies-pip-extras)
  + [Option two: Python wheels (.whl)](#best-practices-dependencies-python-wheels)
    + [Using the `plugins.zip` file on an Amazon S3 bucket](#best-practices-dependencies-python-wheels-s3)
    + [Using a WHL file hosted on a URL](#best-practices-dependencies-python-wheels-url)
    + [Creating a WHL files from a DAG](#best-practices-dependencies-python-wheels-dag)
  + [Option three: Python dependencies hosted on a private PyPi/PEP-503 Compliant Repo](#best-practices-dependencies-custom-auth-url)
+ [Enabling logs on the Amazon MWAA console](#best-practices-dependencies-troubleshooting-enable)
+ [Accessing logs on the CloudWatch Logs console](#best-practices-dependencies-troubleshooting-view)
+ [Accessing errors in the Apache Airflow UI](#best-practices-dependencies-troubleshooting-aa)
  + [Log in to Apache Airflow](#airflow-access-and-login)
+ [Example `requirements.txt` scenarios](#best-practices-dependencies-ex-mix-match)

## Testing DAGs using the Amazon MWAA CLI utility
<a name="best-practices-dependencies-cli-utility"></a>
+ The command line interface (CLI) utility replicates an Amazon Managed Workflows for Apache Airflow environment locally.
+ The CLI builds a Docker container image locally that’s similar to an Amazon MWAA production image. You can use this to run a local Apache Airflow environment to develop and test DAGs, custom plugins, and dependencies before deploying to Amazon MWAA.
+ To run the CLI, refer to [aws-mwaa-docker-images](https://github.com/aws/amazon-mwaa-docker-images) on GitHub.

## Installing Python dependencies using PyPi.org Requirements File Format
<a name="best-practices-dependencies-different-ways"></a>

The following section describes the different ways to install Python dependencies according to the PyPi.org [Requirements File Format](https://pip.pypa.io/en/stable/reference/pip_install/#requirements-file-format).

### Option one: Python dependencies from the Python Package Index
<a name="best-practices-dependencies-pip-extras"></a>

The following section describes how to specify Python dependencies from the [Python Package Index](https://pypi.org/) in a `requirements.txt` file.

------
#### [ Apache Airflow v3 ]

1. **Test locally**. Add additional libraries iteratively to find the right combination of packages and their versions, before creating a `requirements.txt` file. To run the Amazon MWAA CLI utility, refer to [aws-mwaa-docker-images](https://github.com/aws/amazon-mwaa-docker-images) on GitHub.

1. **Review the Apache Airflow package extras**. To access a list of the packages installed for Apache Airflow v3 on Amazon MWAA, refer to [aws-mwaa-docker-images `requirements.txt`](https://github.com/aws/amazon-mwaa-docker-images/blob/main/requirements.txt) on the GitHub website.

1. **Add a constraints statement**. Add the constraints file for your Apache Airflow v3 environment at the top of your `requirements.txt` file. Apache Airflow constraints files specify the provider versions available at the time of a Apache Airflow release.

    In the following example, replace *\$1environment-version\$1* with your environment's version number, and *\$1Python-version\$1* with the version of Python that's compatible with your environment. 

    For information about the version of Python compatible with your Apache Airflow environment, refer to [Apache Airflow Versions](airflow-versions.md#airflow-versions-official). 

   ```
   --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-{Airflow-version}/constraints-{Python-version}.txt"
   ```

    If the constraints file determines that `xyz==1.0` package is not compatible with other packages in your environment, `pip3 install` fails to prevent incompatible libraries from being installed to your environment. If installation fails for any packages, you can access error logs for each Apache Airflow component (the scheduler, worker, and webserver) in the corresponding log stream on CloudWatch Logs. For more information about log types, refer to [Accessing Airflow logs in Amazon CloudWatch](monitoring-airflow.md). 

1. **Apache Airflow packages**. Add the [package extras](http://airflow.apache.org/docs/apache-airflow/2.5.1/extra-packages-ref.html) and the version (`==`). This helps to prevent packages of the same name, but different version, from being installed on your environment.

   ```
   apache-airflow[package-extra]==2.5.1
   ```

1. **Python libraries**. Add the package name and the version (`==`) in your `requirements.txt` file. This helps to prevent a future breaking update from [PyPi.org](https://pypi.org) from being automatically applied.

   ```
   library == version
   ```  
**Example Boto3 and psycopg2-binary**  

   This example is provided for demonstration purposes. The boto and psycopg2-binary libraries are included with the base install for Apache Airflow v3 and don't need to be specified in a `requirements.txt` file.

   ```
   boto3==1.17.54
   boto==2.49.0
   botocore==1.20.54
   psycopg2-binary==2.8.6
   ```

   If a package is specified without a version, Amazon MWAA installs the latest version of the package from [PyPi.org](https://pypi.org). This version can conflict with other packages in your `requirements.txt`.

------
#### [ Apache Airflow v2 ]

1. **Test locally**. Add additional libraries iteratively to find the right combination of packages and their versions, before creating a `requirements.txt` file. To run the Amazon MWAA CLI utility, refer to [aws-mwaa-docker-images](https://github.com/aws/amazon-mwaa-docker-images) on GitHub.

1. **Review the Apache Airflow package extras**. To access a list of the packages installed for Apache Airflow v2 on Amazon MWAA, access [aws-mwaa-docker-images `requirements.txt`](https://github.com/aws/amazon-mwaa-docker-images/blob/main/requirements.txt) on the GitHub website.

1. **Add a constraints statement**. Add the constraints file for your Apache Airflow v2 environment at the top of your `requirements.txt` file. Apache Airflow constraints files specify the provider versions available at the time of a Apache Airflow release.

    Beginning with Apache Airflow v2.7.2, your requirements file must include a `--constraint` statement. If you do not provide a constraint, Amazon MWAA will specify one for you to ensure the packages listed in your requirements are compatible with the version of Apache Airflow you are using. 

   In the following example, replace *\$1environment-version\$1* with your environment's version number, and *\$1Python-version\$1* with the version of Python that's compatible with your environment.

   For information about the version of Python compatible with your Apache Airflow environment, refer to [Apache Airflow Versions](airflow-versions.md#airflow-versions-official).

   ```
   --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-{Airflow-version}/constraints-{Python-version}.txt"
   ```

   If the constraints file determines that `xyz==1.0` package is not compatible with other packages in your environment, `pip3 install` fails to prevent incompatible libraries from being installed to your environment. If installation fails for any packages, you can access error logs for each Apache Airflow component (the scheduler, worker, and webserver) in the corresponding log stream on CloudWatch Logs. For more information about log types, refer to [Accessing Airflow logs in Amazon CloudWatch](monitoring-airflow.md).

1. **Apache Airflow packages**. Add the [package extras](http://airflow.apache.org/docs/apache-airflow/2.5.1/extra-packages-ref.html) and the version (`==`). This helps to prevent packages of the same name, but different version, from being installed on your environment.

   ```
   apache-airflow[package-extra]==2.5.1
   ```

1. **Python libraries**. Add the package name and the version (`==`) in your `requirements.txt` file. This helps to prevent a future breaking update from [PyPi.org](https://pypi.org) from being automatically applied.

   ```
   library == version
   ```  
**Example Boto3 and psycopg2-binary**  

   This example is provided for demonstration purposes. The boto and psycopg2-binary libraries are included with the Apache Airflow v2 base install and don't need to be specified in a `requirements.txt` file.

   ```
   boto3==1.17.54
   boto==2.49.0
   botocore==1.20.54
   psycopg2-binary==2.8.6
   ```

   If a package is specified without a version, Amazon MWAA installs the latest version of the package from [PyPi.org](https://pypi.org). This version can conflict with other packages in your `requirements.txt`.

------

### Option two: Python wheels (.whl)
<a name="best-practices-dependencies-python-wheels"></a>

A Python wheel is a package format designed to ship libraries with compiled artifacts. There are several benefits to wheel packages as a method to install dependencies in Amazon MWAA:
+ **Faster installation** – the WHL files are copied to the container as a single ZIP, and then installed locally, without having to download each one.
+ **Fewer conflicts** – You can determine version compatibility for your packages in advance. As a result, there is no need for `pip` to recursively work out compatible versions.
+ **More resilience** – With externally hosted libraries, downstream requirements can change, resulting in version incompatibility between containers on a Amazon MWAA environment. By not depending on an external source for dependencies, every container on has have the same libraries regardless of when the each container is instantiated.

We recommend the following methods to install Python dependencies from a Python wheel archive (`.whl`) in your `requirements.txt`.

**Topics**
+ [Using the `plugins.zip` file on an Amazon S3 bucket](#best-practices-dependencies-python-wheels-s3)
+ [Using a WHL file hosted on a URL](#best-practices-dependencies-python-wheels-url)
+ [Creating a WHL files from a DAG](#best-practices-dependencies-python-wheels-dag)

#### Using the `plugins.zip` file on an Amazon S3 bucket
<a name="best-practices-dependencies-python-wheels-s3"></a>

The Apache Airflow scheduler, workers, and webserver (for Apache Airflow v2.2.2 and later) search for custom plugins during startup on the AWS-managed Fargate container for your environment at `/usr/local/airflow/plugins/*`. This process begins prior to Amazon MWAA's `pip3 install -r requirements.txt` for Python dependencies and Apache Airflow service startup. A `plugins.zip` file can be used for any files that you don't want continuously changed during environment execution, or that you do not want to grant access to users that write DAGs. For example, Python library wheel files, certificate PEM files, and configuration YAML files.

The following section describes how to install a wheel that's in the `plugins.zip` file on your Amazon S3 bucket.

1. **Download the necessary WHL files** You can use [https://pip.pypa.io/en/stable/cli/pip_download/](https://pip.pypa.io/en/stable/cli/pip_download/) with your existing `requirements.txt` on the Amazon MWAA [aws-mwaa-docker-images](https://github.com/aws/amazon-mwaa-docker-images) or another [Amazon Linux 2](https://aws.amazon.com/amazon-linux-2) container to resolve and download the necessary Python wheel files.

   ```
   pip3 download -r "$AIRFLOW_HOME/dags/requirements.txt" -d "$AIRFLOW_HOME/plugins"
   cd "$AIRFLOW_HOME/plugins"
   zip "$AIRFLOW_HOME/plugins.zip" *
   ```

1. **Specify the path in your `requirements.txt`**. Specify the plugins directory at the top of your requirements.txt using [https://pip.pypa.io/en/stable/cli/pip_install/#install-find-links](https://pip.pypa.io/en/stable/cli/pip_install/#install-find-links) and instruct `pip` not to install from other sources using [https://pip.pypa.io/en/stable/cli/pip_install/#install-no-index](https://pip.pypa.io/en/stable/cli/pip_install/#install-no-index), as listed in the following code:

   ```
   --find-links /usr/local/airflow/plugins
   --no-index
   ```  
**Example wheel in requirements.txt**  

   The following example assumes you've uploaded the wheel in a `plugins.zip` file at the root of your Amazon S3 bucket. For example:

   ```
   --find-links /usr/local/airflow/plugins
   --no-index
   
   numpy
   ```

   Amazon MWAA fetches the `numpy-1.20.1-cp37-cp37m-manylinux1_x86_64.whl` wheel from the `plugins` folder and installs it on your environment.

#### Using a WHL file hosted on a URL
<a name="best-practices-dependencies-python-wheels-url"></a>

The following section describes how to install a wheel that's hosted on a URL. The URL must either be publicly accessible, or accessible from within the custom Amazon VPC you specified for your Amazon MWAA environment.
+ **Provide a URL**. Provide the URL to a wheel in your `requirements.txt`.  
**Example wheel archive on a public URL**  

  The following example downloads a wheel from a public site.

  ```
  --find-links https://files.pythonhosted.org/packages/
  --no-index
  ```

  Amazon MWAA fetches the wheel from the URL you specified and installs them on your environment.
**Note**  
URLs are not accessible from private webservers installing requirements in Amazon MWAA v2.2.2 and later.

#### Creating a WHL files from a DAG
<a name="best-practices-dependencies-python-wheels-dag"></a>

If you have a private webserver using Apache Airflow v2.2.2 or later and you're unable to install requirements because your environment does not have access to external repositories, you can use the following DAG to take your existing Amazon MWAA requirements and package them on Amazon S3:

```
from airflow import DAG
 from airflow.operators.bash_operator import BashOperator
 from airflow.utils.dates import days_ago
					
 S3_BUCKET = 'my-s3-bucket'
 S3_KEY = 'backup/plugins_whl.zip' 
					
 with DAG(dag_id="create_whl_file", schedule_interval=None, catchup=False, start_date=days_ago(1)) as dag:
 cli_command = BashOperator(
 task_id="bash_command",
 bash_command=f"mkdir /tmp/whls;pip3 download -r /usr/local/airflow/requirements/requirements.txt -d /tmp/whls;zip -j /tmp/plugins.zip /tmp/whls/*;aws s3 cp /tmp/plugins.zip s3://amzn-s3-demo-bucket/{S3_KEY}"
)
```

After running the DAG, use this new file as your Amazon MWAA `plugins.zip`, optionally, packaged with other plugins. Then, update your `requirements.txt` preceded by `--find-links /usr/local/airflow/plugins` and `--no-index` without adding `--constraint`.

This method you can use to use the same libraries offline.

### Option three: Python dependencies hosted on a private PyPi/PEP-503 Compliant Repo
<a name="best-practices-dependencies-custom-auth-url"></a>

The following section describes how to install an Apache Airflow extra that's hosted on a private URL with authentication.

1. Add your user name and password as [Apache Airflow configuration options](configuring-env-variables.md). For example:
   + `foo.user` : `YOUR_USER_NAME`
   + `foo.pass` : `YOUR_PASSWORD`

1. Create your `requirements.txt` file. Substitute the placeholders in the following example with your private URL, and the username and password you've added as [Apache Airflow configuration options](configuring-env-variables.md). For example:

   ```
   --index-url https://${AIRFLOW__FOO__USER}:${AIRFLOW__FOO__PASS}@my.privatepypi.com
   ```

1. Add any additional libraries to your `requirements.txt` file. For example:

   ```
   --index-url https://${AIRFLOW__FOO__USER}:${AIRFLOW__FOO__PASS}@my.privatepypi.com
   my-private-package==1.2.3
   ```

## Enabling logs on the Amazon MWAA console
<a name="best-practices-dependencies-troubleshooting-enable"></a>

The [execution role](mwaa-create-role.md) for your Amazon MWAA environment needs permission to send logs to CloudWatch Logs. To update the permissions of an execution role, refer to [Amazon MWAA execution role](mwaa-create-role.md).

You can enable Apache Airflow logs at the `INFO`, `WARNING`, `ERROR`, or `CRITICAL` level. When you choose a log level, Amazon MWAA sends logs for that level and all higher levels of severity. For example, if you enable logs at the `INFO` level, Amazon MWAA sends `INFO` logs and `WARNING`, `ERROR`, and `CRITICAL` log levels to CloudWatch Logs. We recommend enabling Apache Airflow logs at the `INFO` level for the scheduler to access logs received for the `requirements.txt`.

![\[This image depicts how to enable logs at the INFO level.\]](http://docs.aws.amazon.com/mwaa/latest/userguide/images/mwaa-console-logs-info.png)


## Accessing logs on the CloudWatch Logs console
<a name="best-practices-dependencies-troubleshooting-view"></a>

You can access Apache Airflow logs for the scheduler scheduling your workflows and parsing your `dags` folder. The following steps describe how to open the log group for the scheduler on the Amazon MWAA console, and access Apache Airflow logs on the CloudWatch Logs console.

**To access logs for a `requirements.txt`**

1. Open the [Environments](https://console.aws.amazon.com/mwaa/home#/environments) page on the Amazon MWAA console.

1. Choose an environment.

1. Choose the **Airflow scheduler log group** on the **Monitoring** pane.

1. Choose the `requirements_install_ip` log in **Log streams**.

1. Refer to the list of packages that were installed on the environment at `/usr/local/airflow/.local/bin`. For example:

   ```
   Collecting appdirs==1.4.4 (from -r /usr/local/airflow/.local/bin (line 1))
   Downloading https://files.pythonhosted.org/packages/3b/00/2344469e2084fb28kjdsfiuyweb47389789vxbmnbjhsdgf5463acd6cf5e3db69324/appdirs-1.4.4-py2.py3-none-any.whl  
   Collecting astroid==2.4.2 (from -r /usr/local/airflow/.local/bin (line 2))
   ```

1. Review the list of packages and whether any of these encountered an error during installation. If something went wrong, you can get an error similar to the following:

   ```
   2021-03-05T14:34:42.731-07:00
   No matching distribution found for LibraryName==1.0.0 (from -r /usr/local/airflow/.local/bin (line 4))
   No matching distribution found for LibraryName==1.0.0 (from -r /usr/local/airflow/.local/bin (line 4))
   ```

## Accessing errors in the Apache Airflow UI
<a name="best-practices-dependencies-troubleshooting-aa"></a>

You can also check your Apache Airflow UI to identify whether an error is related to another issue. The most common error you can encounter with Apache Airflow on Amazon MWAA is:

```
Broken DAG: No module named x
```

If you find this error in your Apache Airflow UI, you're likely missing a required dependency in your `requirements.txt` file.

### Log in to Apache Airflow
<a name="airflow-access-and-login"></a>

You need [Apache Airflow UI access policy: AmazonMWAAWebServerAccess](access-policies.md#web-ui-access) permissions for your AWS account in AWS Identity and Access Management (IAM) to access your Apache Airflow UI.

**To access your Apache Airflow UI**

1. Open the [Environments](https://console.aws.amazon.com/mwaa/home#/environments) page on the Amazon MWAA console.

1. Choose an environment.

1. Choose **Open Airflow UI**.

## Example `requirements.txt` scenarios
<a name="best-practices-dependencies-ex-mix-match"></a>

You can mix and match different formats in your `requirements.txt`. The following example uses a combination of the different ways to install extras.

**Example Extras on PyPi.org and a public URL**  
You need to use the `--index-url` option when specifying packages from PyPi.org, in addition to packages on a public URL, such as custom PEP 503 compliant repo URLs.  

```
aws-batch == 0.6
				phoenix-letter >= 0.3
				
				--index-url http://dist.repoze.org/zope2/2.10/simple
				zopelib
```