# Orchestration in AWS Glue
<a name="etl-jobs"></a>

The following sections provide information on orchestration of jobs in AWS Glue.

**Topics**
+ [Starting jobs and crawlers using triggers](trigger-job.md)
+ [Performing complex ETL activities using blueprints and workflows in AWS Glue](orchestrate-using-workflows.md)
+ [Developing blueprints in AWS Glue](orchestrate-using-blueprints.md)

# Starting jobs and crawlers using triggers
<a name="trigger-job"></a>

In AWS Glue, you can create Data Catalog objects called triggers, which you can use to either manually or automatically start one or more crawlers or extract, transform, and load (ETL) jobs. Using triggers, you can design a chain of dependent jobs and crawlers.

**Note**  
You can accomplish the same thing by defining *workflows*. Workflows are preferred for creating complex multi-job ETL operations. For more information, see [Performing complex ETL activities using blueprints and workflows in AWS Glue](orchestrate-using-workflows.md).

**Topics**
+ [AWS Glue triggers](about-triggers.md)
+ [Adding triggers](console-triggers.md)
+ [Activating and deactivating triggers](activate-triggers.md)

# AWS Glue triggers
<a name="about-triggers"></a>

When *fired*, a trigger can start specified jobs and crawlers. A trigger fires on demand, based on a schedule, or based on a combination of events.

**Note**  
Only two crawlers can be activated by a single trigger. If you want to crawl multiple data stores, use multiple sources for each crawler instead of running multiple crawlers simultaneously.

A trigger can exist in one of several states. A trigger is either `CREATED`, `ACTIVATED`, or `DEACTIVATED`. There are also transitional states, such as `ACTIVATING`. To temporarily stop a trigger from firing, you can deactivate it. You can then reactivate it later.

There are three types of triggers:

**Scheduled**  
A time-based trigger based on `cron`.  
You can create a trigger for a set of jobs or crawlers based on a schedule. You can specify constraints, such as the frequency that the jobs or crawlers run, which days of the week they run, and at what time. These constraints are based on `cron`. When you're setting up a schedule for a trigger, consider the features and limitations of cron. For example, if you choose to run your crawler on day 31 each month, keep in mind that some months don't have 31 days. For more information about cron, see [Time-based schedules for jobs and crawlers](monitor-data-warehouse-schedule.md). 

**Conditional**  
A trigger that fires when a previous job or crawler or multiple jobs or crawlers satisfy a list of conditions.  
 When you create a conditional trigger, you specify a list of jobs and a list of crawlers to watch. For each watched job or crawler, you specify a status to watch for, such as succeeded, failed, timed out, and so on. The trigger fires if the watched jobs or crawlers end with the specified statuses. You can configure the trigger to fire when any or all of the watched events occur.  
For example, you could configure a trigger T1 to start job J3 when both job J1 and job J2 successfully complete, and another trigger T2 to start job J4 if either job J1 or job J2 fails.  
The following table lists the job and crawler completion states (events) that triggers watch for.      
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/about-triggers.html)

**On-demand**  
A trigger that fires when you activate it. On-demand triggers never enter the `ACTIVATED` or `DEACTIVATED` state. They always remain in the `CREATED` state.

So that they are ready to fire as soon as they exist, you can set a flag to activate scheduled and conditional triggers when you create them.

**Important**  
Jobs or crawlers that run as a result of other jobs or crawlers completing are referred to as *dependent*. Dependent jobs or crawlers are only started if the job or crawler that completes was started by a trigger. All jobs or crawlers in a dependency chain must be descendants of a single **scheduled** or **on-demand** trigger.

**Passing job parameters with triggers**  
A trigger can pass parameters to the jobs that it starts. Parameters include job arguments, timeout value, security configuration, and more. If the trigger starts multiple jobs, the parameters are passed to each job.

The following are the rules for job arguments passed by a trigger:
+ If the key in the key-value pair matches a default job argument, the passed argument overrides the default argument. If the key doesn’t match a default argument, then the argument is passed as an additional argument to the job.
+ If the key in the key-value pair matches a non-overridable argument, the passed argument is ignored.

For more information, see [Triggers](aws-glue-api-jobs-trigger.md) in the AWS Glue API.

# Adding triggers
<a name="console-triggers"></a>

You can add a trigger using the AWS Glue console, the AWS Command Line Interface (AWS CLI), or the AWS Glue API.

**To add a trigger (console)**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/). 

1. In the navigation pane, under **ETL**, choose **Triggers**. Then choose **Add trigger**.

1. Provide the following properties:  
**Name**  
Give your trigger a unique name.  
**Trigger type**  
Specify one of the following:  
   + **Schedule:** The trigger fires at a specific frequency and time.
   + **Job events:** A conditional trigger. The trigger fires when any or all jobs in the list match their designated statuses. For the trigger to fire, the watched jobs must have been started by triggers. For any job you choose, you can only watch one job event (completion status).
   + **On-demand:** The trigger fires when it is activated.

1. Complete the trigger wizard. On the **Review** page, you can activate **Schedule** and **Job events** (conditional) triggers immediately by selecting **Enable trigger on creation**.

**To add a trigger (AWS CLI)**
+ Enter a command similar to the following.

  ```
  aws glue create-trigger --name MyTrigger --type SCHEDULED --schedule  "cron(0 12 * * ? *)" --actions CrawlerName=MyCrawler --start-on-creation  
  ```

  This command creates a schedule trigger named `MyTrigger`, which runs every day at 12:00pm UTC and starts a crawler named `MyCrawler`. The trigger is created in the activated state.

For more information, see [AWS Glue triggers](about-triggers.md).

# Time-based schedules for jobs and crawlers
<a name="monitor-data-warehouse-schedule"></a>

You can define a time-based schedule for your crawlers and jobs in AWS Glue. The definition of these schedules uses the Unix-like [cron](http://en.wikipedia.org/wiki/Cron) syntax. You specify time in [Coordinated Universal Time (UTC)](http://en.wikipedia.org/wiki/Coordinated_Universal_Time), and the minimum precision for a schedule is 5 minutes.

To learn more about configuring jobs and crawlers to run using a schedule, see [Starting jobs and crawlers using triggers](trigger-job.md).

## Cron expressions
<a name="CronExpressions"></a>

Cron expressions have six required fields, which are separated by white space. 

**Syntax**

```
cron(Minutes Hours Day-of-month Month Day-of-week Year)
```


| **Fields** | **Values** | **Wildcards** | 
| --- | --- | --- | 
|  Minutes  |  0–59  |  , - \$1 /  | 
|  Hours  |  0–23  |  , - \$1 /  | 
|  Day-of-month  |  1–31  |  , - \$1 ? / L W  | 
|  Month  |  1–12 or JAN-DEC  |  , - \$1 /  | 
|  Day-of-week  |  1–7 or SUN-SAT  |  , - \$1 ? / L  | 
|  Year  |  1970–2199  |  , - \$1 /  | 

**Wildcards**
+ The **,** (comma) wildcard includes additional values. In the `Month` field, `JAN,FEB,MAR` would include January, February, and March.
+ The **-** (dash) wildcard specifies ranges. In the `Day` field, 1–15 would include days 1 through 15 of the specified month.
+ The **\$1** (asterisk) wildcard includes all values in the field. In the `Hours` field, **\$1** would include every hour.
+ The **/** (forward slash) wildcard specifies increments. In the `Minutes` field, you could enter **1/10** to specify every 10th minute, starting from the first minute of the hour (for example, the 11th, 21st, and 31st minute).
+ The **?** (question mark) wildcard specifies one or another. In the `Day-of-month` field you could enter **7**, and if you didn't care what day of the week the seventh was, you could enter **?** in the Day-of-week field.
+ The **L** wildcard in the `Day-of-month` or `Day-of-week` fields specifies the last day of the month or week.
+ The **W** wildcard in the `Day-of-month` field specifies a weekday. In the `Day-of-month` field, `3W` specifies the day closest to the third weekday of the month.

**Limits**
+ You can't specify the `Day-of-month` and `Day-of-week` fields in the same cron expression. If you specify a value in one of the fields, you must use a **?** (question mark) in the other.
+ Cron expressions that lead to rates faster than 5 minutes are not supported. 

**Examples**  
When creating a schedule, you can use the following sample cron strings.


| Minutes | Hours | Day of month | Month | Day of week | Year | Meaning | 
| --- | --- | --- | --- | --- | --- | --- | 
|  0  |  10  |  \$1  |  \$1  |  ?  |  \$1  |  Run at 10:00 am (UTC) every day  | 
|  15  |  12  |  \$1  |  \$1  |  ?  |  \$1  |  Run at 12:15 pm (UTC) every day  | 
|  0  |  18  |  ?  |  \$1  |  MON-FRI  |  \$1  |  Run at 6:00 pm (UTC) every Monday through Friday  | 
|  0  |  8  |  1  |  \$1  |  ?  |  \$1  |  Run at 8:00 am (UTC) every first day of the month  | 
|  0/15  |  \$1  |  \$1  |  \$1  |  ?  |  \$1  |  Run every 15 minutes  | 
|  0/10  |  \$1  |  ?  |  \$1  |  MON-FRI  |  \$1  |  Run every 10 minutes Monday through Friday  | 
|  0/5  |  8–17  |  ?  |  \$1  |  MON-FRI  |  \$1  |  Run every 5 minutes Monday through Friday between 8:00 am and 5:55 pm (UTC)  | 

For example to run on a schedule of every day at 12:15 UTC, specify:

```
cron(15 12 * * ? *)   
```

# Activating and deactivating triggers
<a name="activate-triggers"></a>

You can activate or deactivate a trigger using the AWS Glue console, the AWS Command Line Interface (AWS CLI), or the AWS Glue API.

**To activate or deactivate a trigger (console)**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/). 

1. In the navigation pane, under **ETL**, choose **Triggers**.

1. Select the check box next to the desired trigger, and on the **Action** menu choose **Enable trigger** to activate the trigger or **Disable trigger** to deactivate the trigger.

**To activate or deactivate a trigger (AWS CLI)**
+ Enter one of the following commands.

  ```
  aws glue start-trigger --name MyTrigger  
  
  aws glue stop-trigger --name MyTrigger
  ```

  Starting a trigger activates it, and stopping a trigger deactivates it. When you activate an on-demand trigger, it fires immediately.

For more information, see [AWS Glue triggers](about-triggers.md).

# Performing complex ETL activities using blueprints and workflows in AWS Glue
<a name="orchestrate-using-workflows"></a>

Some of your organization's complex extract, transform, and load (ETL) processes might best be implemented by using multiple, dependent AWS Glue jobs and crawlers. Using AWS Glue *workflows*, you can design a complex multi-job, multi-crawler ETL process that AWS Glue can run and track as single entity. After you create a workflow and specify the jobs, crawlers, and triggers in the workflow, you can run the workflow on demand or on a schedule.

**Topics**
+ [Overview of workflows in AWS Glue](workflows_overview.md)
+ [Creating and building out a workflow manually in AWS Glue](creating_running_workflows.md)
+ [Starting an AWS Glue workflow with an Amazon EventBridge event](starting-workflow-eventbridge.md)
+ [Viewing the EventBridge events that started a workflow](viewing-start-event-info.md)
+ [Running and monitoring a workflow in AWS Glue](running_monitoring_workflow.md)
+ [Stopping a workflow run](workflow-stopping.md)
+ [Repairing and resuming a workflow run](resuming-workflow.md)
+ [Getting and setting workflow run properties in AWS Glue](workflow-run-properties-code.md)
+ [Querying workflows using the AWS Glue API](workflows_api_concepts.md)
+ [Blueprint and workflow restrictions in AWS Glue](blueprint_workflow_restrictions.md)
+ [Troubleshooting blueprint errors in AWS Glue](blueprint_workflow_troubleshoot.md)
+ [Permissions for personas and roles for AWS Glue blueprints](blueprints-personas-permissions.md)

# Overview of workflows in AWS Glue
<a name="workflows_overview"></a>

In AWS Glue, you can use workflows to create and visualize complex extract, transform, and load (ETL) activities involving multiple crawlers, jobs, and triggers. Each workflow manages the execution and monitoring of all its jobs and crawlers. As a workflow runs each component, it records execution progress and status. This provides you with an overview of the larger task and the details of each step. The AWS Glue console provides a visual representation of a workflow as a graph.

You can create a workflow from an AWS Glue blueprint, or you can manually build a workflow a component at a time using the AWS Management Console or the AWS Glue API. For more information about blueprints, see [Overview of blueprints in AWS Glue](blueprints-overview.md).

*Triggers* within workflows can start both jobs and crawlers and can be fired when jobs or crawlers complete. By using triggers, you can create large chains of interdependent jobs and crawlers. In addition to triggers within a workflow that define job and crawler dependencies, each workflow has a *start trigger*. There are three types of start triggers:
+ **Schedule** – The workflow is started according to a schedule that you define. The schedule can be daily, weekly, monthly, and so on, or can be a custom schedule based on a `cron` expression.
+ **On demand** – The workflow is started manually from the AWS Glue console, API, or AWS CLI.
+ **EventBridge event** – The workflow is started upon the occurrence of a single Amazon EventBridge event or a batch of Amazon EventBridge events. With this trigger type, AWS Glue can be an event consumer in an event-driven architecture. Any EventBridge event type can start a workflow. A common use case is the arrival of a new object in an Amazon S3 bucket (the S3 `PutObject` operation). 

  Starting a workflow with a batch of events means waiting until a specified number of events have been received or until a specified amount of time has passed. When you create the EventBridge event trigger, you can optionally specify batch conditions. If you specify batch conditions, you must specify the batch size (number of events), and can optionally specify a batch window (number of seconds). The default and maximum batch window is 900 seconds (15 minutes). The batch condition that is met first starts the workflow. The batch window starts when the first event arrives. If you don't specify batch conditions when creating a trigger, the batch size defaults to 1.

  When the workflow starts, the batch conditions are reset and the event trigger begins watching for the next batch condition to be met to start the workflow again.

  The following table shows how batch size and batch window operate together to trigger a workflow.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/workflows_overview.html)

  The `GetWorkflowRun` API operation returns the batch condition that triggered the workflow.

Regardless of how a workflow is started, you can specify the maximum number of concurrent workflow runs when you create the workflow.

If an event or batch of events starts a workflow run that eventually fails, that event or batch of events is no longer considered for starting a workflow run. A new workflow run is started only when the next event or batch of events arrives.

**Important**  
Limit the total number of jobs, crawlers, and triggers within a workflow to 100 or less. If you include more than 100, you might get errors when trying to resume or stop workflow runs.

A workflow run will not be started if it would exceed the concurrency limit set for the workflow, even though the event condition is met. It's advisable to adjust workflow concurrency limits based on the expected event volume. AWS Glue does not retry workflow runs that fail due to exceeded concurrency limits. Likewise, it's advisable to adjust concurrency limits for jobs and crawlers within workflows based on expected event volume.

**Workflow run properties**  
To share and manage state throughout a workflow run, you can define default workflow run properties. These properties, which are name/value pairs, are available to all the jobs in the workflow. Using the AWS Glue API, jobs can retrieve the workflow run properties and modify them for jobs that come later in the workflow.

**Workflow graph**  
The following image shows the graph of a very basic workflow on the AWS Glue console. Your workflow could have dozens of components.

![\[Console screenshot that shows the Graph tab of a workflow. The graph contains five icons that represent a schedule trigger, two jobs, an event success trigger, and a crawler that updates the schema.\]](http://docs.aws.amazon.com/glue/latest/dg/images/graph-complete-with-tabs.png)


This workflow is started by a schedule trigger, `Month-close1`, which starts two jobs, `De-duplicate` and `Fix phone numbers`. Upon successful completion of both jobs, an event trigger, `Fix/De-dupe succeeded`, starts a crawler, `Update schema`.

**Static and dynamic workflow views**  
For each workflow, there is the notion of *static view* and *dynamic view*. The static view indicates the design of the workflow. The dynamic view is a runtime view that includes the latest run information for each of the jobs and crawlers. Run information includes success status and error details. 

When a workflow is running, the console displays the dynamic view, graphically indicating the jobs that have completed and that are yet to be run. You can also retrieve a dynamic view of a running workflow using the AWS Glue API. For more information, see [Querying workflows using the AWS Glue API](workflows_api_concepts.md).

**See also**  
[Creating a workflow from a blueprint in AWS Glue](creating_workflow_blueprint.md)
[Creating and building out a workflow manually in AWS Glue](creating_running_workflows.md)
[Workflows](aws-glue-api-workflow.md) (for the workflows API)

# Creating and building out a workflow manually in AWS Glue
<a name="creating_running_workflows"></a>

You can use the AWS Glue console to manually create and build out a workflow one node at a time.

A workflow contains jobs, crawlers, and triggers. Before manually creating a workflow, create the jobs and crawlers that the workflow is to include. It's best to specify run-on-demand crawlers for workflows. You can create new triggers while you are building out your workflow, or you can *clone* existing triggers into the workflow. When you clone a trigger, all the catalog objects associated with the trigger—the jobs or crawlers that fire it and the jobs or crawlers that it starts—are added to the workflow.

**Important**  
Limit the total number of jobs, crawlers, and triggers within a workflow to 100 or less. If you include more than 100, you might get errors when trying to resume or stop workflow runs.

You build out your workflow by adding triggers to the workflow graph, and defining the watched events and actions for each trigger. You begin with a *start trigger*, which can be either an on-demand or schedule trigger, and complete the graph by adding event (conditional) triggers.

## Step 1: Create the workflow
<a name="workflow-step1"></a>

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, under **ETL**, choose **Workflows**.

1. Choose **Add workflow** and complete the **Add a new ETL workflow** form.

   Any optional default run properties that you add are made available as arguments to all jobs in the workflow. For more information, see [Getting and setting workflow run properties in AWS Glue](workflow-run-properties-code.md).

1. Choose **Add workflow**.

   The new workflow appears in the list on the **Workflows** page.

## Step 2: Add a start trigger
<a name="workflow-step2"></a>

1. On the **Workflows** page, select your new workflow. Then, at the bottom of the page, ensure that the **Graph** tab is selected.

1. Choose **Add trigger**, and in the **Add trigger** dialog box, do one of the following:
   + Choose **Clone existing**, and choose a trigger to clone. Then choose **Add**.

     The trigger appears on the graph, along with the jobs and crawlers that it watches and the jobs and crawlers that it starts.

     If you mistakenly selected the wrong trigger, select the trigger on the graph, and then choose **Remove**.
   + Choose **Add new**, and complete the **Add trigger** form.

     1. For **Trigger type**, select **Schedule**, **On demand**, or **EventBridge event**.

        For trigger type **Schedule**, choose one of the **Frequency** options. Choose **Custom** to enter a `cron` expression.

        For trigger type **EventBridge event**, enter **Number of events** (batch size), and optionally enter **Time delay** (batch window). If you omit **Time delay**, the batch window defaults to 15 minutes. For more information, see [Overview of workflows in AWS Glue](workflows_overview.md).

     1. Choose **Add**.

     The trigger appears on the graph, along with a placeholder node (labeled **Add node**). In the example below, the start trigger is a schedule trigger named `Month-close1`. 

     At this point, the trigger isn't saved yet.  
![\[A graph with two rectangular nodes: a trigger, and a placeholder node. An arrow points from the trigger node to the placeholder node.\]](http://docs.aws.amazon.com/glue/latest/dg/images/graph-start-trigger.png)

1. If you added a new trigger, complete these steps:

   1. Do one of the following:
      + Choose the placeholder node (**Add node**).
      + Ensure that the start trigger is selected, and on the **Action** menu above the graph, choose **Add jobs/crawlers to trigger**.

   1. In the **Add jobs(s) and crawler(s) to trigger** dialog box, select one or more jobs or crawlers, and then choose **Add**.

      The trigger is saved, and the selected jobs or crawlers appear on the graph with connectors from the trigger.

      If you mistakenly added the wrong jobs or crawlers, you can select either the trigger or a connector and choose **Remove**.

## Step 3: Add more triggers
<a name="workflow-step3"></a>

Continue to build out your workflow by adding more triggers of type **Event**. To zoom in or out or to enlarge the graph canvas, use the icons to the right of the graph. For each trigger to add, complete the following steps:

**Note**  
There is no action to save the workflow. After you add your last trigger and assign actions to the trigger, the workflow is complete and saved. You can always come back later and add more nodes.

1. Do one of the following:
   + To clone an existing trigger, ensure that no node on the graph is selected, and on the **Action** menu, choose **Add trigger**.
   + To add a new trigger that watches a particular job or crawler on the graph, select the job or crawler node, and then choose the **Add trigger** placeholder node.

     You can add more jobs or crawlers to watch for this trigger in a later step.

1.  In the **Add trigger** dialog box, do one of the following:
   + Choose **Add new**, and complete the **Add trigger** form. Then choose **Add**.

     The trigger appears on the graph. You will complete the trigger in a later step.
   + Choose **Clone existing**, and choose a trigger to clone. Then choose **Add**.

     The trigger appears on the graph, along with the jobs and crawlers that it watches and the jobs and crawlers that it starts.

     If you mistakenly chose the wrong trigger, select the trigger on the graph, and then choose **Remove**.

1. If you added a new trigger, complete these steps:

   1. Select the new trigger.

      As the following graph shows, the trigger `De-dupe/fix succeeded` is selected, and placeholder nodes appear for (1) events to watch and (2) actions.  
![\[A graph with many nodes, two of which are placeholder nodes that are called out as numbers 1 and 2.\]](http://docs.aws.amazon.com/glue/latest/dg/images/graph-dual-placeholders.png)

   1. (Optional if the trigger already watches an event and you want to add more jobs or crawlers to watch.) Choose the events-to-watch placeholder node, and in the **Add job(s) and crawler(s) to watch** dialog box, select one or more jobs or crawlers. Choose an event to watch (SUCCEEDED, FAILED, etc.), and choose **Add**.

   1. Ensure that the trigger is selected, and choose the actions placeholder node.

   1. In the **Add job(s) and crawler(s) to watch dialog** box, select one or more jobs or crawlers, and choose **Add**.

      The selected jobs and crawlers appear on the graph, with connectors from the trigger.

For more information on workflows and blueprints, see the following topics.
+ [Overview of workflows in AWS Glue](workflows_overview.md)
+ [Running and monitoring a workflow in AWS Glue](running_monitoring_workflow.md)
+ [Creating a workflow from a blueprint in AWS Glue](creating_workflow_blueprint.md)

# Starting an AWS Glue workflow with an Amazon EventBridge event
<a name="starting-workflow-eventbridge"></a>

Amazon EventBridge, also known as CloudWatch Events, enables you to automate your AWS services and respond automatically to system events such as application availability issues or resource changes. Events from AWS services are delivered to EventBridge in near real time. You can write simple rules to indicate which events are of interest to you, and what automated actions to take when an event matches a rule.

With EventBridge support, AWS Glue can serve as an event producer and consumer in an event-driven architecture. For workflows, AWS Glue supports any type of EventBridge event as a consumer. The likely most common use case is the arrival of a new object in an Amazon S3 bucket. If you have data arriving in irregular or undefined intervals, you can process this data as close to its arrival as possible.

**Note**  
AWS Glue does not provide guaranteed delivery of EventBridge messages. AWS Glue performs no deduplication if EventBridge delivers duplicate messages. You must manage idempotency based on your use case.  
Be sure to configure EventBridge rules correctly to avoid sending unwanted events.

**Before you begin**  
If you want to start a workflow with Amazon S3 data events, you must ensure that events for the S3 bucket of interest are logged to AWS CloudTrail and EventBridge. To do so, you must create a CloudTrail trail. For more information, see [Creating a trail for your AWS account](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-create-and-update-a-trail.html).

**To start a workflow with an EventBridge event**
**Note**  
In the following commands, replace:  
*<workflow-name>* with the name to assign to the workflow.
*<trigger-name>* with the name to assign to the trigger.
*<bucket-name>* with the name of the Amazon S3 bucket.
*<account-id>* with a valid AWS account ID.
*<region>* with the name of the Region (for example, `us-east-1`).
*<rule-name>* with the name to assign to the EventBridge rule.

1. Ensure that you have AWS Identity and Access Management (IAM) permissions to create and view EventBridge rules and targets. The following is a sample policy that you can attach. You might want to scope it down to put limits on the operations and resources.

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Action": [
           "events:PutRule",
           "events:DisableRule",
           "events:DeleteRule",
           "events:PutTargets",
           "events:RemoveTargets",
           "events:EnableRule",
           "events:List*",
           "events:Describe*"
         ],
         "Resource": "*"
       }
     ]
   }
   ```

------

1. Create an IAM role that the EventBridge service can assume when passing an event to AWS Glue.

   1. On the **Create role** page of the IAM console, choose **AWS Service**. Then choose the service **CloudWatch Events**.

   1. Complete the **Create role** wizard. The wizard automatically attaches the `CloudWatchEventsBuiltInTargetExecutionAccess` and `CloudWatchEventsInvocationAccess` policies.

   1. Attach the following inline policy to the role. This policy allows the EventBridge service to direct events to AWS Glue.

------
#### [ JSON ]

****  

      ```
      {
        "Version":"2012-10-17",		 	 	 
        "Statement": [
          {
            "Effect": "Allow",
            "Action": [
              "glue:notifyEvent"
            ],
            "Resource": [
              "arn:aws:glue:us-east-1:111122223333:workflow/workflow-name"
            ]
          }
        ]
      }
      ```

------

1. Enter the following command to create the workflow.

   See [create-workflow](https://docs.aws.amazon.com/cli/latest/reference/glue/create-workflow.html) in the *AWS CLI Command Reference* for information about additional optional command-line parameters.

   ```
   aws glue create-workflow --name <workflow-name>
   ```

1. Enter the following command to create an EventBridge event trigger for the workflow. This will be the start trigger for the workflow. Replace *<actions>* with the actions to perform (the jobs and crawlers to start).

   See [create-trigger](https://docs.aws.amazon.com/cli/latest/reference/glue/create-trigger.html) in the *AWS CLI Command Reference* for information about how to code the `actions` argument.

   ```
   aws glue create-trigger --workflow-name <workflow-name> --type EVENT --name <trigger-name> --actions <actions>
   ```

   If you want the workflow to be triggered by a batch of events instead of a single EventBridge event, enter the following command instead.

   ```
   aws glue create-trigger --workflow-name <workflow-name> --type EVENT --name <trigger-name> --event-batching-condition BatchSize=<number-of-events>,BatchWindow=<seconds> --actions <actions>
   ```

   For the `event-batching-condition` argument, `BatchSize` is required and `BatchWindow` is optional. If `BatchWindow` is omitted, the window defaults to 900 seconds, which is the maximum window size.  
**Example**  

   The following example creates a trigger that starts the `eventtest` workflow after three EventBridge events have arrived, or five minutes after the first event arrives, whichever comes first.

   ```
   aws glue create-trigger --workflow-name eventtest --type EVENT --name objectArrival --event-batching-condition BatchSize=3,BatchWindow=300 --actions JobName=test1
   ```

1. Create a rule in Amazon EventBridge. 

   1. Create the JSON object for the rule details in your preferred text editor. 

      The following example specifies Amazon S3 as the event source, `PutObject` as the event name, and the bucket name as a request parameter. This rule starts a workflow when a new object arrives in the bucket.

      ```
      {
        "source": [
          "aws.s3"
        ],
        "detail-type": [
          "AWS API Call via CloudTrail"
        ],
        "detail": {
          "eventSource": [
            "s3.amazonaws.com"
          ],
          "eventName": [
            "PutObject"
          ],
          "requestParameters": {
            "bucketName": [
              "<bucket-name>"
            ]
          }
        }
      }
      ```

      To start the workflow when a new object arrives in a folder within the bucket, you can substitute the following code for `requestParameters`.

      ```
          "requestParameters": {
            "bucketName": [
              "<bucket-name>"
            ]
            "key" : [{ "prefix" : "<folder1>/<folder2>/*"}}]
        }
      ```

   1. Use your preferred tool to convert the rule JSON object to an escaped string.

      ```
      {\n  \"source\": [\n    \"aws.s3\"\n  ],\n  \"detail-type\": [\n    \"AWS API Call via CloudTrail\"\n  ],\n  \"detail\": {\n    \"eventSource\": [\n      \"s3.amazonaws.com\"\n    ],\n    \"eventName\": [\n      \"PutObject\"\n    ],\n    \"requestParameters\": {\n      \"bucketName\": [\n        \"<bucket-name>\"\n      ]\n    }\n  }\n}
      ```

   1. Run the following command to create a JSON parameter template that you can edit to specify input parameters to a subsequent `put-rule` command. Save the output in a file. In this example, the file is called `ruleCommand`.

      ```
      aws events put-rule --name <rule-name> --generate-cli-skeleton >ruleCommand
      ```

      For more information about the `--generate-cli-skeleton` parameter, see [Generating AWS CLI skeleton and input parameters from a JSON or YAML input file](https://docs.aws.amazon.com/cli/latest/userguide/cli-usage-skeleton.html) in the *AWS Command Line Interface User Guide*.

      The output file should look like the following.

      ```
      {
          "Name": "",
          "ScheduleExpression": "",
          "EventPattern": "",
          "State": "ENABLED",
          "Description": "",
          "RoleArn": "",
          "Tags": [
              {
                  "Key": "",
                  "Value": ""
              }
          ],
          "EventBusName": ""
      }
      ```

   1. Edit the file to optionally remove parameters and to specify at a minimum the `Name`, `EventPattern`, and `State` parameters. For the `EventPattern` parameter, provide the escaped string for the rule details that you created in a previous step. 

      ```
      {
          "Name": "<rule-name>",
          "EventPattern": "{\n  \"source\": [\n    \"aws.s3\"\n  ],\n  \"detail-type\": [\n    \"AWS API Call via CloudTrail\"\n  ],\n  \"detail\": {\n    \"eventSource\": [\n      \"s3.amazonaws.com\"\n    ],\n    \"eventName\": [\n      \"PutObject\"\n    ],\n    \"requestParameters\": {\n      \"bucketName\": [\n        \"<bucket-name>\"\n      ]\n    }\n  }\n}",
          "State": "DISABLED",
          "Description": "Start an AWS Glue workflow upon new file arrival in an Amazon S3 bucket"
      }
      ```
**Note**  
It is best to leave the rule disabled until you finish building out the workflow.

   1. Enter the following `put-rule` command, which reads input parameters from the file `ruleCommand`.

      ```
      aws events put-rule --name <rule-name> --cli-input-json file://ruleCommand
      ```

      The following output indicates success.

      ```
      {
          "RuleArn": "<rule-arn>"
      }
      ```

1. Enter the following command to attach the rule to a target. The target is the workflow in AWS Glue. Replace *<role-name>* with the role that you created at the beginning of this procedure.

   ```
   aws events put-targets --rule <rule-name> --targets "Id"="1","Arn"="arn:aws:glue:<region>:<account-id>:workflow/<workflow-name>","RoleArn"="arn:aws:iam::<account-id>:role/<role-name>" --region <region>
   ```

   The following output indicates success.

   ```
   {
       "FailedEntryCount": 0,
       "FailedEntries": []
   }
   ```

1. Confirm successful connection of the rule and target by entering the following command.

   ```
   aws events list-rule-names-by-target --target-arn arn:aws:glue:<region>:<account-id>:workflow/<workflow-name>
   ```

   The following output indicates success, where *<rule-name>* is the name of the rule that you created.

   ```
   {
       "RuleNames": [
           "<rule-name>"
       ]
   }
   ```

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. Select the workflow, and verify that the start trigger and its actions—the jobs or crawlers that it starts— appear on the workflow graph. Then continue with the procedure in [Step 3: Add more triggers](creating_running_workflows.md#workflow-step3). Or add more components to the workflow by using the AWS Glue API or AWS Command Line Interface.

1. When the workflow is completely specified, enable the rule.

   ```
   aws events enable-rule --name <rule-name>
   ```

   The workflow is now ready to be started by an EventBridge event or event batch.

**See also**  
[https://docs.aws.amazon.com/eventbridge/latest/userguide/what-is-amazon-eventbridge.html](https://docs.aws.amazon.com/eventbridge/latest/userguide/what-is-amazon-eventbridge.html)
[Overview of workflows in AWS Glue](workflows_overview.md)
[Creating and building out a workflow manually in AWS Glue](creating_running_workflows.md)

# Viewing the EventBridge events that started a workflow
<a name="viewing-start-event-info"></a>

You can view the event ID of the Amazon EventBridge event that started your workflow. If your workflow was started by a batch of events, you can view the event IDs of all events in the batch.

For workflows with a batch size greater than one, you can also see which batch condition started the workflow: the arrival of the number of events in the batch size, or the expiration of the batch window.

**To view the EventBridge events that started a workflow (console)**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **Workflows**.

1. Select the workflow. Then at the bottom, choose the **History** tab.

1. Select a workflow run, and then choose **View run details**.

1. On the run details page, locate the **Run properties** field, and look for the **aws:eventIds** key.

   The value for that key is a list of EventBridge event IDs.

**To view the EventBridge events that started a workflow (AWS API)**
+ Include the following code in your Python script.

  ```
  workflow_params = glue_client.get_workflow_run_properties(Name=workflow_name,RunId=workflow_run_id)
  batched_events = workflow_params['aws:eventIds']
  ```

  `batched_events` will be a list of strings, where each string is an event ID.

**See also**  
[https://docs.aws.amazon.com/eventbridge/latest/userguide/what-is-amazon-eventbridge.html](https://docs.aws.amazon.com/eventbridge/latest/userguide/what-is-amazon-eventbridge.html)
[Overview of workflows in AWS Glue](workflows_overview.md)

# Running and monitoring a workflow in AWS Glue
<a name="running_monitoring_workflow"></a>

If the start trigger for a workflow is an on-demand trigger, you can start the workflow from the AWS Glue console. Complete the following steps to run and monitor a workflow. If the workflow fails, you can view the run graph to determine the node that failed. To help troubleshoot, if the workflow was created from a blueprint, you can view the blueprint run to see the blueprint parameter values that were used to create the workflow. For more information, see [Viewing blueprint runs in AWS Glue](viewing_blueprint_runs.md).

You can run and monitor a workflow by using the AWS Glue console, API, or AWS Command Line Interface (AWS CLI).

**To run and monitor a workflow (console)**

1. Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, under **ETL**, choose **Workflows**.

1. Select a workflow. On the **Actions** menu, choose **Run**.

1. Check the **Last run status** column in the workflows list. Choose the refresh button to view ongoing workflow status.

1. While the workflow is running or after it has completed (or failed), view the run details by completing the following steps.

   1. Ensure that the workflow is selected, and choose the **History** tab.

   1. Choose the current or most recent workflow run, and then choose **View run details**.

      The workflow runtime graph shows the current run status.

   1. Choose any node in the graph to view details and status of the node.  
![\[The run graph shows a start trigger, which starts a job. Another trigger watches for job completion. The job node (a rectangle that encloses a clipboard icon and a job name) is selected, and the job details are shown in a pane at the right. The job details include job run ID and status.\]](http://docs.aws.amazon.com/glue/latest/dg/images/workflow-pre-select-resume.png)

**To run and monitor a workflow (AWS CLI)**

1. Enter the following command. Replace *<workflow-name>* with the workflow to run.

   ```
   aws glue start-workflow-run --name <workflow-name>
   ```

   If the workflow is successfully started, the command returns the run ID.

1. View workflow run status by using the `get-workflow-run` command. Supply the workflow name and run ID.

   ```
   aws glue get-workflow-run --name myWorkflow --run-id wr_d2af14217e8eae775ba7b1fc6fc7a42c795aed3cbcd8763f9415452e2dbc8705
   ```

   The following is sample command output.

   ```
   {
       "Run": {
           "Name": "myWorkflow",
           "WorkflowRunId": "wr_d2af14217e8eae775ba7b1fc6fc7a42c795aed3cbcd8763f9415452e2dbc8705",
           "WorkflowRunProperties": {
               "run_state": "COMPLETED",
               "unique_id": "fee63f30-c512-4742-a9b1-7c8183bdaae2"
           },
           "StartedOn": 1578556843.049,
           "CompletedOn": 1578558649.928,
           "Status": "COMPLETED",
           "Statistics": {
               "TotalActions": 11,
               "TimeoutActions": 0,
               "FailedActions": 0,
               "StoppedActions": 0,
               "SucceededActions": 9,
               "RunningActions": 0,
               "ErroredActions": 0
           }
       }
   }
   ```

**See also:**  
[Overview of workflows in AWS Glue](workflows_overview.md)
[Overview of blueprints in AWS Glue](blueprints-overview.md)

# Stopping a workflow run
<a name="workflow-stopping"></a>

You can use the AWS Glue console, AWS Command Line Interface (AWS CLI) or AWS Glue API to stop a workflow run. When you stop a workflow run, all running jobs and crawlers are immediately terminated, and jobs and crawlers that are not yet started never start. It might take up to a minute for all running jobs and crawlers to stop. The workflow run status goes from **Running** to **Stopping**, and when the workflow run is completely stopped, the status goes to **Stopped**.

After the workflow run is stopped, you can view the run graph to see which jobs and crawlers completed and which never started. You can then determine if you must perform any steps to ensure data integrity. Stopping a workflow run causes no automatic rollback operations to be performed.

**To stop a workflow run (console)**

1. Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, under **ETL**, choose **Workflows**.

1. Choose a running workflow, and then choose the **History** tab.

1. Choose the workflow run, and then choose **Stop run**.

   The run status changes to **Stopping**.

1. (Optional) Choose the workflow run, choose **View run details**, and review the run graph.

**To stop a workflow run (AWS CLI)**
+ Enter the following command. Replace *<workflow-name>* with the name of the workflow and *<run-id>* with the run ID of the workflow run to stop.

  ```
  aws glue stop-workflow-run --name <workflow-name> --run-id <run-id>
  ```

  The following is an example of the **stop-workflow-run** command.

  ```
  aws glue stop-workflow-run --name my-workflow --run-id wr_137b88917411d128081069901e4a80595d97f719282094b7f271d09576770354
  ```

# Repairing and resuming a workflow run
<a name="resuming-workflow"></a>

 If one or more nodes (jobs or crawlers) in a workflow do not successfully complete, this means that the workflow only partially ran. After you find the root causes and make corrections, you can select one or more nodes to resume the workflow run from, and then resume the workflow run. The selected nodes and all nodes that are downstream from those nodes are then run.

**Topics**
+ [Resuming a workflow run: How it works](#resume-workflow-howitworks)
+ [Resuming a workflow run](#how-to-resume-workflow)
+ [Notes and limitations for resuming workflow runs](#resume-workflow-notes)

## Resuming a workflow run: How it works
<a name="resume-workflow-howitworks"></a>

Consider the workflow W1 in the following diagram.

![\[Triggers are shown in rectangles and jobs are shown in circles. Trigger T1 at the left starts the workflow by running job J1. Subsequent triggers and jobs exist, but jobs J2 and J3 fail, so downstream triggers and jobs are shown as not run.\]](http://docs.aws.amazon.com/glue/latest/dg/images/workflow_W1.png)


The workflow run proceeds as follows:

1. Trigger T1 starts job J1.

1. Successful completion of J1 fires triggers T2 and T3, which run jobs J2 and J3, respectively.

1. Jobs J2 and J3 fail.

1. Triggers T4 and T5 depend on the successful completion of J2 and J3, so they don't fire, and jobs J4 and J5 don't run. Workflow W1 is only partially run.

Now assume that the issues that caused J2 and J3 to fail are corrected. J2 and J3 are selected as the starting points to resume the workflow run from.

![\[Jobs J2 and J3 are flagged as nodes to be resumed. Downstream triggers and jobs are shown as successfully run.\]](http://docs.aws.amazon.com/glue/latest/dg/images/workflow_W1_resumed.png)


The workflow run resumes as follows:

1. Jobs J2 and J3 run successfully.

1. Triggers T4 and T5 fire.

1. Jobs J4 and J5 run successfully.

The resumed workflow run is tracked as a separate workflow run with a new run ID. When you view the workflow history, you can view the previous run ID for any workflow run. In the example in the following screenshot, the workflow run with run ID `wr_c7a22...` (the second row) had a node that did not complete. The user fixed the problem and resumed the workflow run, which resulted in run ID `wr_a07e55...` (the first row).

![\[A table under the History tab for a workflow contains two rows, one for each workflow run. The first row has both a run ID and previous run ID. The second row has only a run ID. The previous run ID in the first row is the same as the run ID in the 2nd row.\]](http://docs.aws.amazon.com/glue/latest/dg/images/previous-run-id.png)


**Note**  
For the rest of this discussion, the term "resumed workflow run" refers to the workflow run that was created when the previous workflow run was resumed. The "original workflow run" refers to the workflow run that only partially ran and that needed to be resumed.

**Resumed workflow run graph**  
In a resumed workflow run, although only a subset of nodes are run, the run graph is a complete graph. That is, the nodes that didn't run in the resumed workflow are copied from the run graph of the original workflow run. Copied job and crawler nodes that ran in the original workflow run include run details.

Consider again the workflow W1 in the previous diagram. When the workflow run is resumed starting with J2 and J3, the run graph for the resumed workflow run shows all jobs, J1 though J5, and all triggers, T1 through T5. The run details for J1 are copied from the original workflow run.

**Workflow run snapshots**  
When a workflow run is started, AWS Glue takes a snapshot of the workflow design graph at that point in time. That snapshot is used for the duration of the workflow run. If you make changes to any triggers after the run starts, those changes don't affect the current workflow run. Snapshots ensure that workflow runs proceed in a consistent manner.

Snapshots make only triggers immutable. Changes that you make to downstream jobs and crawlers during the workflow run take effect for the current run.

## Resuming a workflow run
<a name="how-to-resume-workflow"></a>

Follow these steps to resume a workflow run. You can resume a workflow run by using the AWS Glue console, API, or AWS Command Line Interface (AWS CLI).

**To resume a workflow run (console)**

1. Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

   Sign in as a user who has permissions to view workflows and resume workflow runs.
**Note**  
To resume workflow runs, you need the `glue:ResumeWorkflowRun` AWS Identity and Access Management (IAM) permission.

1. In the navigation pane, choose **Workflows**.

1. Select a workflow, and then choose the **History** tab.

1. Select the workflow run that only partially ran, and then choose **View run details**.

1. In the run graph, select the first (or only) node that you want to restart and that you want to resume the workflow run from.

1. In the details pane to the right of the graph, select the **Resume** check box.  
![\[The run graph shows three nodes, including a failed job node. The job details pane at the right includes a Resume check box.\]](http://docs.aws.amazon.com/glue/latest/dg/images/workflow-pre-select-resume.png)

   The node changes color and shows a small resume icon at the upper right.  
![\[The change to the run graph is described in the text. The Resume check box is selected.\]](http://docs.aws.amazon.com/glue/latest/dg/images/workflow-post-select-resume.png)

1. Complete the previous two steps for any additional nodes to restart.

1. Choose **Resume run**.

**To resume a workflow run (AWS CLI)**

1. Ensure that you have the `glue:ResumeWorkflowRun` IAM permission.

1. Retrieve the node IDs for the nodes that you want to restart.

   1.  Run the `get-workflow-run` command for the original workflow run. Supply the workflow name and run ID, and add the `--include-graph` option, as shown in the following example. Get the run ID from the **History** tab on the console, or by running the `get-workflow` command.

      ```
      aws glue get-workflow-run --name cloudtrailtest1 --run-id wr_a07e55f2087afdd415a404403f644a4265278f68b13ba3da08c71924ebe3c3a8 --include-graph
      ```

      The command returns the nodes and edges of the graph as a large JSON object.

   1. Locate the nodes of interest by the `Type` and `Name` properties of the node objects.

      The following is an example node object from the output.

      ```
      {
          "Type": "JOB",
          "Name": "test1_post_failure_4592978",
          "UniqueId": "wnode_d1b2563c503078b153142ee76ce545fe5ceef66e053628a786ddd74a05da86fd",
          "JobDetails": {
              "JobRuns": [
                  {
                      "Id": "jr_690b9f7fc5cb399204bc542c6c956f39934496a5d665a42de891e5b01f59e613",
                      "Attempt": 0,
                      "TriggerName": "test1_aggregate_failure_649b2432",
                      "JobName": "test1_post_failure_4592978",
                      "StartedOn": 1595358275.375,
                      "LastModifiedOn": 1595358298.785,
                      "CompletedOn": 1595358298.785,
                      "JobRunState": "FAILED",
                      "PredecessorRuns": [],
                      "AllocatedCapacity": 0,
                      "ExecutionTime": 16,
                      "Timeout": 2880,
                      "MaxCapacity": 0.0625,
                      "LogGroupName": "/aws-glue/python-jobs"
                  }
              ]
          }
      }
      ```

   1. Get the node ID from the `UniqueId` property of the node object.

1. Run the `resume-workflow-run` command. Provide the workflow name, run ID, and list of node IDs separated by spaces, as shown in the following example.

   ```
   aws glue resume-workflow-run --name cloudtrailtest1 --run-id wr_a07e55f2087afdd415a404403f644a4265278f68b13ba3da08c71924ebe3c3a8 --node-ids wnode_ca1f63e918fb855e063aed2f42ec5762ccf71b80082ae2eb5daeb8052442f2f3  wnode_d1b2563c503078b153142ee76ce545fe5ceef66e053628a786ddd74a05da86fd
   ```

   The command outputs the run ID of the resumed (new) workflow run and a list of nodes that will be started.

   ```
   {
       "RunId": "wr_2ada0d3209a262fc1156e4291134b3bd643491bcfb0ceead30bd3e4efac24de9",
       "NodeIds": [
           "wnode_ca1f63e918fb855e063aed2f42ec5762ccf71b80082ae2eb5daeb8052442f2f3"
       ]
   }
   ```

   Note that although the example `resume-workflow-run` command listed two nodes to restart, the example output indicated that only one node would be restarted. This is because one node was downstream of the other node, and the downstream node would be restarted anyway by the normal flow of the workflow.

## Notes and limitations for resuming workflow runs
<a name="resume-workflow-notes"></a>

Keep the following notes and limitations in mind when resuming workflow runs.
+ You can resume a workflow run only if it's in the `COMPLETED` state.
**Note**  
Even if one ore more nodes in a workflow run don't complete, the workflow run state is shown as `COMPLETED`. Be sure to check the run graph to discover any nodes that didn't successfully complete.
+ You can resume a workflow run from any job or crawler node that the original workflow run attempted to run. You can't resume a workflow run from a trigger node.
+ Restarting a node does not reset its state. Any data that was partially processed is not rolled back.
+ You can resume a failed workflow run multiple times. However, a resumed run can only be resumed once more. For additional retries, resume the original failed run instead
+ If you select two nodes to restart and they're dependent upon each other, the upstream node is run before the downstream node. In fact, selecting the downstream node is redundant, because it will be run according to the normal flow of the workflow.

# Getting and setting workflow run properties in AWS Glue
<a name="workflow-run-properties-code"></a>

Use workflow run properties to share and manage state among the jobs in your AWS Glue workflow. You can set default run properties when you create the workflow. Then, as your jobs run, they can retrieve the run property values and optionally modify them for input to jobs that are later in the workflow. When a job modifies a run property, the new value exists only for the workflow run. The default run properties aren't affected.

If your AWS Glue job is not part of a workflow, these properties will not be set.

The following sample Python code from an extract, transform, and load (ETL) job demonstrates how to get the workflow run properties.

```
import sys
import boto3
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.context import SparkContext

glue_client = boto3.client("glue")
args = getResolvedOptions(sys.argv, ['JOB_NAME','WORKFLOW_NAME', 'WORKFLOW_RUN_ID'])
workflow_name = args['WORKFLOW_NAME']
workflow_run_id = args['WORKFLOW_RUN_ID']
workflow_params = glue_client.get_workflow_run_properties(Name=workflow_name,
                                        RunId=workflow_run_id)["RunProperties"]

target_database = workflow_params['target_database']
target_s3_location = workflow_params['target_s3_location']
```

The following code continues by setting the `target_format` run property to `'csv'`.

```
workflow_params['target_format'] = 'csv'
glue_client.put_workflow_run_properties(Name=workflow_name, RunId=workflow_run_id, RunProperties=workflow_params)
```

For more information, see the following: 
+ [GetWorkflowRunProperties action (Python: get\$1workflow\$1run\$1properties)](aws-glue-api-workflow.md#aws-glue-api-workflow-GetWorkflowRunProperties)
+ [PutWorkflowRunProperties action (Python: put\$1workflow\$1run\$1properties)](aws-glue-api-workflow.md#aws-glue-api-workflow-PutWorkflowRunProperties)

# Querying workflows using the AWS Glue API
<a name="workflows_api_concepts"></a>

AWS Glue provides a rich API for managing workflows. You can retrieve a static view of a workflow or a dynamic view of a running workflow using the AWS Glue API. For more information, see [Workflows](aws-glue-api-workflow.md).

**Topics**
+ [Querying static views](#workflows_api_concepts_static)
+ [Querying dynamic views](#workflows_api_concepts_dynamic)

## Querying static views
<a name="workflows_api_concepts_static"></a>

Use the `GetWorkflow` API operation to get a static view that indicates the design of a workflow. This operation returns a directed graph consisting of nodes and edges, where a node represents a trigger, a job, or a crawler. Edges define the relationships between nodes. They are represented by connectors (arrows) on the graph in the AWS Glue console. 

You can also use this operation with popular graph-processing libraries such as NetworkX, igraph, JGraphT, and the Java Universal Network/Graph (JUNG) Framework. Because all these libraries represent graphs similarly, minimal transformations are needed.

The static view returned by this API is the most up-to-date view according to the latest definition of triggers associated with the workflow.

### Graph definition
<a name="workflows_api_concepts_static_graph"></a>

A workflow graph G is an ordered pair (N, E), where N is a set of nodes and E a set of edges. *Node* is a vertex in the graph identified by a unique number. A node can be of type trigger, job, or crawler. For example: `{name:T1, type:Trigger, uniqueId:1}, {name:J1, type:Job, uniqueId:2}`.

*Edge* is a 2-tuple of the form (`src, dest`), where `src` and `dest` are nodes and there is a directed edge from `src` to `dest`. 

### Example of querying a static view
<a name="workflows_api_concepts_static_example"></a>

Consider a conditional trigger T, which triggers job J2 upon completion of job J1. 

```
J1 ---> T ---> J2
```

Nodes: J1, T, J2 

Edges: (J1, T), (T, J2)

## Querying dynamic views
<a name="workflows_api_concepts_dynamic"></a>

Use the `GetWorkflowRun` API operation to get a dynamic view of a running workflow. This operation returns the same static view of the graph along with metadata related to the workflow run.

For run, nodes representing jobs in the `GetWorkflowRun` call have a list of job runs initiated as part of the latest run of the workflow. You can use this list to display the run status of each job in the graph itself. For downstream dependencies that are not yet run, this field is set to `null`. The graphed information makes you aware of the current state of any workflow at any point of time.

The dynamic view returned by this API is based on the static view that was present when the workflow run was started.

*Runtime nodes example:* `{name:T1, type: Trigger, uniqueId:1}`, `{name:J1, type:Job, uniqueId:2, jobDetails:{jobRuns}}`, `{name:C1, type:Crawler, uniqueId:3, crawlerDetails:{crawls}}` 

### Example 1: Dynamic view
<a name="workflows_api_concepts_dynamic_examples"></a>

The following example illustrates a simple two-trigger workflow. 
+ Nodes: t1, j1, t2, j2 
+ Edges: (t1, j1), (j1, t2), (t2, j2)

The `GetWorkflow` response contains the following.

```
{
    Nodes : [
        {
            "type" : Trigger,
            "name" : "t1",
            "uniqueId" : 1
        },
        {
            "type" : Job,
            "name" : "j1",
            "uniqueId" : 2
        },
        {
            "type" : Trigger,
            "name" : "t2",
            "uniqueId" : 3
        },
        {
            "type" : Job,
            "name" : "j2",
            "uniqueId" : 4
        }
    ],
    Edges : [
        {
            "sourceId" : 1,
            "destinationId" : 2
        },
        {
            "sourceId" : 2,
            "destinationId" : 3
        },
        {
            "sourceId" : 3,
            "destinationId" : 4
        }
}
```

The `GetWorkflowRun` response contains the following.

```
{
    Nodes : [
        {
            "type" : Trigger,
            "name" : "t1",
            "uniqueId" : 1,
            "jobDetails" : null,
            "crawlerDetails" : null
        },
        {
            "type" : Job,
            "name" : "j1",
            "uniqueId" : 2,
            "jobDetails" : [
                {
                    "id" : "jr_12334",
                    "jobRunState" : "SUCCEEDED",
                    "errorMessage" : "error string"
                }
            ],
            "crawlerDetails" : null
        },
        {
            "type" : Trigger,
            "name" : "t2",
            "uniqueId" : 3,
            "jobDetails" : null,
            "crawlerDetails" : null
        },
        {
            "type" : Job,
            "name" : "j2",
            "uniqueId" : 4,
            "jobDetails" : [
                {
                    "id" : "jr_1233sdf4",
                    "jobRunState" : "SUCCEEDED",
                    "errorMessage" : "error string"
                }
            ],
            "crawlerDetails" : null
        }
    ],
    Edges : [
        {
            "sourceId" : 1,
            "destinationId" : 2
        },
        {
            "sourceId" : 2,
            "destinationId" : 3
        },
        {
            "sourceId" : 3,
            "destinationId" : 4
        }
}
```

### Example 2: Multiple jobs with a conditional trigger
<a name="workflows_api_concepts_dynamic_example_2"></a>

The following example shows a workflow with multiple jobs and a conditional trigger (t3).

```
Consider Flow:
T(t1) ---> J(j1) ---> T(t2) ---> J(j2)
             |                    |
             |                    |
             >+------> T(t3) <-----+
                        |
                        |
                      J(j3)

Graph generated:
Nodes: t1, t2, t3, j1, j2, j3
Edges: (t1, j1), (j1, t2), (t2, j2), (j1, t3), (j2, t3), (t3, j3)
```

# Blueprint and workflow restrictions in AWS Glue
<a name="blueprint_workflow_restrictions"></a>

The following are restrictions for blueprints and workflows.

## Blueprint restrictions
<a name="bluprint-restrictions"></a>

Keep the following blueprint restrictions in mind:
+ The blueprint must be registered in the same AWS Region where the Amazon S3 bucket resides in.
+ To share blueprints across AWS accounts you must give the read permissions on the blueprint ZIP archive in Amazon S3. Customers who have read permission on a blueprint ZIP archive can register the blueprint in their AWS account and use it. 
+ The set of blueprint parameters is stored as a single JSON object. The maximum length of this object is 128 KB.
+ The maximum uncompressed size of the blueprint ZIP archive is 5 MB. The maximum compressed size is 1 MB.
+ Limit the total number of jobs, crawlers, and triggers within a workflow to 100 or less. If you include more than 100, you might get errors when trying to resume or stop workflow runs.

## Workflow restrictions
<a name="workflow-restrictions"></a>

Keep the following workflow restrictions in mind. Some of these comments are directed more at a user creating workflows manually.
+ The maximum batch size for an Amazon EventBridge event trigger is 100. The maximum window size is 900 seconds (15 minutes).
+ A trigger can be associated with only one workflow.
+ Only one starting trigger (on-demand or schedule) is permitted.
+ If a job or crawler in a workflow is started by a trigger that is outside the workflow, any triggers inside the workflow that depend on job or crawler completion (succeeded or otherwise) do not fire.
+ Similarly, if a job or crawler in a workflow has triggers that depend on job or crawler completion (succeeded or otherwise) both within the workflow and outside the workflow, and if the job or crawler is started from within a workflow, only the triggers inside the workflow fire upon job or crawler completion.

# Troubleshooting blueprint errors in AWS Glue
<a name="blueprint_workflow_troubleshoot"></a>

If you encounter errors when using AWS Glue blueprints, use the following solutions to help you find the source of the problems and fix them.

**Topics**
+ [Error: missing PySpark module](#blueprint-workflow-error-1)
+ [Error: missing blueprint config file](#blueprint-workflow-error-2)
+ [Error: missing imported file](#blueprint-workflow-error-3)
+ [Error: not authorized to perform iamPassRole on resource](#blueprint-workflow-error-4)
+ [Error: invalid cron schedule](#blueprint-workflow-error-5)
+ [Error: a trigger with the same name already exists](#blueprint-workflow-error-6)
+ [Error: workflow with name: foo already exists.](#blueprint-workflow-error-7)
+ [Error: module not found in specified layoutGenerator path](#blueprint-workflow-error-8)
+ [Error: validation error in Connections field](#blueprint-workflow-error-9)

## Error: missing PySpark module
<a name="blueprint-workflow-error-1"></a>

AWS Glue returns the error "Unknown error executing layout generator function ModuleNotFoundError: No module named 'pyspark'".

When you unzip the blueprint archive it could be like either of the following:

```
$ unzip compaction.zip 
Archive:  compaction.zip
   creating: compaction/
  inflating: compaction/blueprint.cfg  
  inflating: compaction/layout.py    
  inflating: compaction/README.md    
  inflating: compaction/compaction.py   
  
$ unzip compaction.zip
Archive:  compaction.zip
  inflating: blueprint.cfg           
  inflating: compaction.py           
  inflating: layout.py               
  inflating: README.md
```

In the first case, all the files related to the blueprint were placed under a folder named compaction and it was then converted into a zip file named *compaction.zip*.

In the second case, all the files required for the blueprint were not included into a folder and were added as root files under the zip file *compaction.zip*.

Creating a file in either of the above formats is allowed. However make sure that `blueprint.cfg` has the correct path to the name of the function in the script that generates the layout.

**Examples**  
In case 1: `blueprint.cfg` should have `layoutGenerator` as the following:

```
layoutGenerator": "compaction.layout.generate_layout"
```

In case 2: `blueprint.cfg` should have `layoutGenerator` as the following

```
layoutGenerator": "layout.generate_layout" 
```

If this path is not included correctly, you could see an error as indicated. For example, if you have the folder structure as mentioned in case 2 and you have the `layoutGenerator` indicated as in case 1, you can see the above error.

## Error: missing blueprint config file
<a name="blueprint-workflow-error-2"></a>

AWS Glue returns the error "Unknown error executing layout generator function FileNotFoundError: [Errno 2] No such file or directory: '/tmp/compaction/blueprint.cfg'".

The blueprint.cfg should be placed at the root level of the ZIP archive or within a folder which has the same name as the ZIP archive.

When we extract the blueprint ZIP archive, blueprint.cfg is expected to be found in one of the following paths. If it is not found in one of the following paths, you can see the above error.

```
$ unzip compaction.zip 
Archive:  compaction.zip
   creating: compaction/
  inflating: compaction/blueprint.cfg  
  
$ unzip compaction.zip
Archive:  compaction.zip
  inflating: blueprint.cfg
```

## Error: missing imported file
<a name="blueprint-workflow-error-3"></a>

AWS Glue returns the error "Unknown error executing layout generator function FileNotFoundError: [Errno 2] No such file or directory:\$1 \$1'demo-project/foo.py'".

If your layout generation script has functionality to read other files, make sure you give a full path for the file to be imported. For example, the Conversion.py script may be referenced in Layout.py. For more information, see [Sample blueprint Project](https://docs.aws.amazon.com/glue/latest/dg/developing-blueprints-sample.html).

## Error: not authorized to perform iamPassRole on resource
<a name="blueprint-workflow-error-4"></a>

AWS Glue returns the error "User: arn:aws:sts::123456789012:assumed-role/AWSGlueServiceRole/GlueSession is not authorized to perform: iam:PassRole on resource: arn:aws:iam::123456789012:role/AWSGlueServiceRole"

If the jobs and crawlers in the workflow assume the same role as the role passed to create workflow from the blueprint, then the blueprint role needs to include the `iam:PassRole` permission on itself.

If the jobs and crawlers in the workflow assume a role other than the role passed to create the entities of the workflow from the blueprint, then the blueprint role needs to include the `iam:PassRole` permission on that other role instead of on the blueprint role.

For more information, see [Permissions for blueprint Roles](https://docs.aws.amazon.com/glue/latest/dg/blueprints-personas-permissions.html#blueprints-role-permissions).

## Error: invalid cron schedule
<a name="blueprint-workflow-error-5"></a>

AWS Glue returns the error "The schedule cron(0 0 \$1 \$1 \$1 \$1) is invalid."

Provide a valid [cron](https://en.wikipedia.org/wiki/Cron) expression. For more information, see [Time-Based Schedules for Jobs and Crawlers](https://docs.aws.amazon.com/glue/latest/dg/monitor-data-warehouse-schedule.html).

## Error: a trigger with the same name already exists
<a name="blueprint-workflow-error-6"></a>

AWS Glue returns the error "Trigger with name 'foo\$1starting\$1trigger' already submitted with different configuration".

A blueprint does not require you to define triggers in the layout script for workflow creation. Trigger creation is managed by the blueprint library based on the dependencies defined between two actions. 

The naming for the triggers is as follows:
+ For the starting trigger in the workflow the naming is <workflow\$1name>\$1starting\$1trigger.
+ For a node(job/crawler) in the workflow that depends on the completion of either one or multiple upstream nodes; AWS Glue defines a trigger with the name <workflow\$1name>\$1<node\$1name>\$1trigger

This error means a trigger with same name already exists. You can delete the existing trigger and re-run the workflow creation.

**Note**  
Deleting a workflow doesn’t delete the nodes within the workflow. It is possible that though the workflow is deleted, triggers are left behind. Due to this, you may not receive a 'workflow already exists' error, but you may receive a 'trigger already exists' error in a case where you create a workflow, delete it and then try to re-create it with the same name from same blueprint.

## Error: workflow with name: foo already exists.
<a name="blueprint-workflow-error-7"></a>

The workflow name should be unique. Please try with a different name.

## Error: module not found in specified layoutGenerator path
<a name="blueprint-workflow-error-8"></a>

AWS Glue returns the error "Unknown error executing layout generator function ModuleNotFoundError: No module named 'crawl\$1s3\$1locations'".

```
layoutGenerator": "crawl_s3_locations.layout.generate_layout"
```

For example, if you have the above layoutGenerator path, then when you unzip the blueprint archive, it needs to look like the following:

```
$ unzip crawl_s3_locations.zip 
Archive:  crawl_s3_locations.zip
   creating: crawl_s3_locations/
  inflating: crawl_s3_locations/blueprint.cfg  
  inflating: crawl_s3_locations/layout.py    
  inflating: crawl_s3_locations/README.md
```

When you unzip the archive, if the blueprint archive looks like the following, then you can get the above error.

```
$ unzip crawl_s3_locations.zip
Archive:  crawl_s3_locations.zip
  inflating: blueprint.cfg           
  inflating: layout.py               
  inflating: README.md
```

You can see that there is no folder named `crawl_s3_locations` and when the `layoutGenerator` path refers to the layout file via the module `crawl_s3_locations`, you can get the above error.

## Error: validation error in Connections field
<a name="blueprint-workflow-error-9"></a>

AWS Glue returns the error "Unknown error executing layout generator function TypeError: Value ['foo'] for key Connections should be of type <class 'dict'>\$1".

This is a validation error. The `Connections` field in the `Job` class is expecting a dictionary and instead a list of values are provided causing the error.

```
User input was list of values
Connections= ['string']

Should be a dict like the following
Connections*=*{'Connections': ['string']}
```

To avoid these run time errors while creating a workflow from a blueprint, you can validate the workflow, job and crawler definitions as outlined in [Testing a blueprint](https://docs.aws.amazon.com/glue/latest/dg/developing-blueprints-testing.html).

Refer to the syntax in [AWS Glue blueprint Classes Reference](https://docs.aws.amazon.com/glue/latest/dg/developing-blueprints-code-classes.html) for defining the AWS Glue job, crawler and workflow in the layout script.

# Permissions for personas and roles for AWS Glue blueprints
<a name="blueprints-personas-permissions"></a>

The following are the typical personas and suggested AWS Identity and Access Management (IAM) permissions policies for personas and roles for AWS Glue blueprints.

**Topics**
+ [Blueprint personas](#blueprints-personas)
+ [Permissions for blueprint personas](#blueprints-permssions)
+ [Permissions for blueprint roles](#blueprints-role-permissions)

## Blueprint personas
<a name="blueprints-personas"></a>

The following are the personas typically involved in the lifecycle of AWS Glue blueprints.


| Persona | Description | 
| --- | --- | 
| AWS Glue developer | Develops, tests, and publishes blueprints. | 
| AWS Glue administrator | Registers, maintains, and grants permissions on blueprints. | 
| Data analyst | Runs blueprints to create workflows. | 

For more information, see [Overview of blueprints in AWS Glue](blueprints-overview.md).

## Permissions for blueprint personas
<a name="blueprints-permssions"></a>

The following are the suggested permissions for each blueprint persona.


### AWS Glue developer permissions for blueprints
<a name="bp-persona-dev"></a>

The AWS Glue developer must have write permissions on the Amazon S3 bucket that is used to publish the blueprint. Often, the developer registers the blueprint after uploading it. In that case, the developer needs the permissions listed in [AWS Glue administrator permissions for blueprints](#bp-persona-admin). Additionally, if the developer wishes to test the blueprint after its registered, he or she also needs the permissions listed in [Data analyst permissions for blueprints](#bp-persona-analyst). 

### AWS Glue administrator permissions for blueprints
<a name="bp-persona-admin"></a>

The following policy grants permissions to register, view, and maintain AWS Glue blueprints.

**Important**  
In the following policy, replace *<s3-bucket-name>* and *<prefix>* with the Amazon S3 path to uploaded blueprint ZIP archives to register.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:CreateBlueprint",
        "glue:UpdateBlueprint",
        "glue:DeleteBlueprint",
        "glue:GetBlueprint",
        "glue:ListBlueprints",
        "glue:BatchGetBlueprints"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::amzn-s3-demo-bucket/prefix/*"
    }
  ]
}
```

------

### Data analyst permissions for blueprints
<a name="bp-persona-analyst"></a>

The following policy grants permissions to run blueprints and to view the resulting workflow and workflow components. It also grants `PassRole` on the role that AWS Glue assumes to create the workflow and workflow components.

The policy grants permissions on any resource. If you want to configure fine-grained access to individual blueprints, use the following format for blueprint ARNs:

```
arn:aws:glue:<region>:<account-id>:blueprint/<blueprint-name>
```

**Important**  
In the following policy, replace *<account-id>* with a valid AWS account and replace *<role-name>* with the name of the role used to run a blueprint. See [Permissions for blueprint roles](#blueprints-role-permissions) for the permissions that this role requires.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:ListBlueprints",
        "glue:GetBlueprint",
        "glue:StartBlueprintRun",
        "glue:GetBlueprintRun",
        "glue:GetBlueprintRuns",
        "glue:GetCrawler",
        "glue:ListTriggers",
        "glue:ListJobs",
        "glue:BatchGetCrawlers",
        "glue:GetTrigger",
        "glue:BatchGetWorkflows",
        "glue:BatchGetTriggers",
        "glue:BatchGetJobs",
        "glue:BatchGetBlueprints",
        "glue:GetWorkflowRun",
        "glue:GetWorkflowRuns",
        "glue:ListCrawlers",
        "glue:ListWorkflows",
        "glue:GetJob",
        "glue:GetWorkflow",
        "glue:StartWorkflowRun"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::111122223333:role/role-name"
    }
  ]
}
```

------

## Permissions for blueprint roles
<a name="blueprints-role-permissions"></a>

The following are the suggested permissions for the IAM role used to create a workflow from a blueprint. The role has to have a trust relationship with `glue.amazonaws.com`.

**Important**  
In the following policy, replace *<account-id>* with a valid AWS account, and replace *<role-name>* with the name of the role.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:CreateJob",
        "glue:GetCrawler",
        "glue:GetTrigger",
        "glue:DeleteCrawler",
        "glue:CreateTrigger",
        "glue:DeleteTrigger",
        "glue:DeleteJob",
        "glue:CreateWorkflow",
        "glue:DeleteWorkflow",
        "glue:GetJob",
        "glue:GetWorkflow",
        "glue:CreateCrawler"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::111122223333:role/role-name"
    }
  ]
}
```

------

**Note**  
If the jobs and crawlers in the workflow assume a role other than this role, this policy must include the `iam:PassRole` permission on that other role instead of on the blueprint role.

# Developing blueprints in AWS Glue
<a name="orchestrate-using-blueprints"></a>

Your organization might have a set of similar ETL use cases that could benefit from being able to parametrize a single workflow to handle them all. To address this need, AWS Glue enables you to define *blueprints*, which you can use to generate workflows. A blueprint accepts parameters, so that from a single blueprint, a data analyst can create different workflows to handle similar ETL use cases. After you create a blueprint, you can reuse it for different departments, teams, and projects.

**Topics**
+ [Overview of blueprints in AWS Glue](blueprints-overview.md)
+ [Developing blueprints in AWS Glue](developing-blueprints.md)
+ [Registering a blueprint in AWS Glue](registering-blueprints.md)
+ [Viewing blueprints in AWS Glue](viewing_blueprints.md)
+ [Updating a blueprint in AWS Glue](updating_blueprints.md)
+ [Creating a workflow from a blueprint in AWS Glue](creating_workflow_blueprint.md)
+ [Viewing blueprint runs in AWS Glue](viewing_blueprint_runs.md)

# Overview of blueprints in AWS Glue
<a name="blueprints-overview"></a>

**Note**  
The blueprints feature is currently unavailable in the following Regions in the AWS Glue console: Asia Pacific (Jakarta) and Middle East (UAE).

AWS Glue blueprints provide a way to create and share AWS Glue workflows. When there is a complex ETL process that could be used for similar use cases, rather than creating an AWS Glue workflow for each use case, you can create a single blueprint. 

The blueprint specifies the jobs and crawlers to include in a workflow, and specifies parameters that the workflow user supplies when they run the blueprint to create a workflow. The use of parameters enables a single blueprint to generate workflows for the various similar use cases. For more information about workflows, see [Overview of workflows in AWS Glue](workflows_overview.md).

The following are example use cases for blueprints:
+ You want to partition an existing dataset. The input parameters to the blueprint are Amazon Simple Storage Service (Amazon S3) source and target paths and a list of partition columns.
+ You want to snapshot an Amazon DynamoDB table into a SQL data store like Amazon Redshift. The input parameters to the blueprint are the DynamoDB table name and an AWS Glue connection, which designates an Amazon Redshift cluster and destination database.
+ You want to convert CSV data in multiple Amazon S3 paths to Parquet. You want the AWS Glue workflow to include a separate crawler and job for each path. The input parameters are the destination database in the AWS Glue Data Catalog and a comma-delimited list of Amazon S3 paths. Note that in this case, the number of crawlers and jobs that the workflow creates is variable.

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/s3Bm8ay53Ms/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/s3Bm8ay53Ms)


**Blueprint components**  
A blueprint is a ZIP archive that contains the following components:
+ A Python layout generator script

  Contains a function that specifies the workflow *layout*—the crawlers and jobs to create for the workflow, the job and crawler properties, and the dependencies between the jobs and crawlers. The function accepts blueprint parameters and returns a workflow structure (JSON object) that AWS Glue uses to generate the workflow. Because you use a Python script to generate the workflow, you can add your own logic that is suitable for your use cases.
+ A configuration file

  Specifies the fully qualified name of the Python function that generates the workflow layout. Also specifies the names, data types, and other properties of all blueprint parameters used by the script.
+ (Optional) ETL scripts and supporting files

  As an advanced use case, you can parameterize the location of the ETL scripts that your jobs use. You can include job script files in the ZIP archive and specify a blueprint parameter for an Amazon S3 location where the scripts are to be copied to. The layout generator script can copy the ETL scripts to the designated location and specify that location as the job script location property. You can also include any libraries or other supporting files, provided that your script handles them.

![\[Box labeled Blueprint contains two smaller boxes, one labeled Python Script and the other labeled Config File.\]](http://docs.aws.amazon.com/glue/latest/dg/images/blueprint.png)


**Blueprint runs**  
When you create a workflow from a blueprint, AWS Glue runs the blueprint, which starts an asynchronous process to create the workflow and the jobs, crawlers, and triggers that the workflow encapsulates. AWS Glue uses the blueprint run to orchestrate the creation of the workflow and its components. You view the status of the creation process by viewing the blueprint run status. The blueprint run also stores the values that you supplied for the blueprint parameters.

![\[Box labeled Blueprint run contains icons labeled Workflow and Parameter Values.\]](http://docs.aws.amazon.com/glue/latest/dg/images/blueprint-run.png)


You can view blueprint runs using the AWS Glue console or AWS Command Line Interface (AWS CLI). When viewing or troubleshooting a workflow, you can always return to the blueprint run to view the blueprint parameter values that were used to create the workflow.

**Lifecycle of a blueprint**  
blueprints are developed, tested, registered with AWS Glue, and run to create workflows. There are typically three personas involved in the blueprint lifecycle.


| Persona | Tasks | 
| --- | --- | 
| AWS Glue developer |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/blueprints-overview.html)  | 
| AWS Glue administrator |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/blueprints-overview.html)  | 
| Data analyst |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/blueprints-overview.html)  | 

**See also**  
[Developing blueprints in AWS Glue](developing-blueprints.md)
[Creating a workflow from a blueprint in AWS Glue](creating_workflow_blueprint.md)
[Permissions for personas and roles for AWS Glue blueprints](blueprints-personas-permissions.md)

# Developing blueprints in AWS Glue
<a name="developing-blueprints"></a>

As an AWS Glue developer, you can create and publish blueprints that data analysts can use to generate workflows.

**Topics**
+ [Overview of developing blueprints](developing-blueprints-overview.md)
+ [Prerequisites for developing blueprints](developing-blueprints-prereq.md)
+ [Writing the blueprint code](developing-blueprints-code.md)
+ [Sample blueprint project](developing-blueprints-sample.md)
+ [Testing a blueprint](developing-blueprints-testing.md)
+ [Publishing a blueprint](developing-blueprints-publishing.md)
+ [AWS Glue blueprint classes reference](developing-blueprints-code-classes.md)
+ [Blueprint samples](developing-blueprints-samples.md)

**See also**  
[Overview of blueprints in AWS Glue](blueprints-overview.md)

# Overview of developing blueprints
<a name="developing-blueprints-overview"></a>

The first step in your development process is to identify a common use case that would benefit from a blueprint. A typical use case involves a recurring ETL problem that you believe should be solved in a general manner. Next, design a blueprint that implements the generalized use case, and define the blueprint input parameters that together can define a specific use case from the generalized use case.

A blueprint consists of a project that contains a blueprint parameter configuration file and a script that defines the *layout* of the workflow to generate. The layout defines the jobs and crawlers (or *entities* in blueprint script terminology) to create.

You do not directly specify any triggers in the layout script. Instead you write code to specify the dependencies between the jobs and crawlers that the script creates. AWS Glue generates the triggers based on your dependency specifications. The output of the layout script is a workflow object, which contains specifications for all workflow entities.

You build your workflow object using the following AWS Glue blueprint libraries:
+ `awsglue.blueprint.base_resource` – A library of base resources used by the libraries.
+ `awsglue.blueprint.workflow` – A library for defining a `Workflow` class.
+ `awsglue.blueprint.job` – A library for defining a `Job` class.
+ `awsglue.blueprint.crawler` – A library for defining a `Crawler` class.

The only other libraries that are supported for layout generation are those libraries that are available for the Python shell.

Before publishing your blueprint, you can use methods defined in the blueprint libraries to test the blueprint locally.

When you're ready to make the blueprint available to data analysts, you package the script, the parameter configuration file, and any supporting files, such as additional scripts and libraries, into a single deployable asset. You then upload the asset to Amazon S3 and ask an administrator to register it with AWS Glue.

For information about more sample blueprint projects, see [Sample blueprint project](developing-blueprints-sample.md) and [Blueprint samples](developing-blueprints-samples.md).

# Prerequisites for developing blueprints
<a name="developing-blueprints-prereq"></a>

To develop blueprints, you should be familiar with using AWS Glue and writing scripts for Apache Spark ETL jobs or Python shell jobs. In addition, you must complete the following setup tasks.
+ Download four AWS Python libraries to use in your blueprint layout scripts.
+ Set up the AWS SDKs.
+ Set up the AWS CLI.

## Download the Python libraries
<a name="prereqs-get-libes"></a>

Download the following libraries from GitHub, and install them into your project:
+ [https://github.com/awslabs/aws-glue-blueprint-libs/tree/master/awsglue/blueprint/base\$1resource.py](https://github.com/awslabs/aws-glue-blueprint-libs/tree/master/awsglue/blueprint/base_resource.py)
+ [https://github.com/awslabs/aws-glue-blueprint-libs/tree/master/awsglue/blueprint/workflow.py](https://github.com/awslabs/aws-glue-blueprint-libs/tree/master/awsglue/blueprint/workflow.py)
+ [https://github.com/awslabs/aws-glue-blueprint-libs/tree/master/awsglue/blueprint/crawler.py](https://github.com/awslabs/aws-glue-blueprint-libs/tree/master/awsglue/blueprint/crawler.py)
+ [https://github.com/awslabs/aws-glue-blueprint-libs/tree/master/awsglue/blueprint/job.py](https://github.com/awslabs/aws-glue-blueprint-libs/tree/master/awsglue/blueprint/job.py)

## Set up the AWS Java SDK
<a name="prereqs-java-preview-sdk"></a>

For the AWS Java SDK, you must add a `jar` file that includes the API for blueprints.

1. If you haven't already done so, set up the AWS SDK for Java.
   + For Java 1.x, follow the instructions in [Set up the AWS SDK for Java](https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-install.html) in the *AWS SDK for Java Developer Guide*.
   + For Java 2.x, follow the instructions in [Setting up the AWS SDK for Java 2.x](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/setup.html) in the *AWS SDK for Java 2.x Developer Guide*.

1. Download the client `jar` file that has access to the APIs for blueprints.
   + For Java 1.x: s3://awsglue-custom-blueprints-preview-artifacts/awsglue-java-sdk-preview/AWSGlueJavaClient-1.11.x.jar
   + For Java 2.x: s3://awsglue-custom-blueprints-preview-artifacts/awsglue-java-sdk-v2-preview/AwsJavaSdk-Glue-2.0.jar

1. Add the client `jar` to the front of the Java classpath to override the AWS Glue client provided by the AWS Java SDK.

   ```
   export CLASSPATH=<path-to-preview-client-jar>:$CLASSPATH
   ```

1. (Optional) Test the SDK with the following Java application. The application should output an empty list.

   Replace `accessKey` and `secretKey` with your credentials, and replace `us-east-1` with your Region.

   ```
   import com.amazonaws.auth.AWSCredentials;
   import com.amazonaws.auth.AWSCredentialsProvider;
   import com.amazonaws.auth.AWSStaticCredentialsProvider;
   import com.amazonaws.auth.BasicAWSCredentials;
   import com.amazonaws.services.glue.AWSGlue;
   import com.amazonaws.services.glue.AWSGlueClientBuilder;
   import com.amazonaws.services.glue.model.ListBlueprintsRequest;
   
   public class App{
       public static void main(String[] args) {
           AWSCredentials credentials = new BasicAWSCredentials("accessKey", "secretKey");
           AWSCredentialsProvider provider = new AWSStaticCredentialsProvider(credentials);
           AWSGlue glue = AWSGlueClientBuilder.standard().withCredentials(provider)
                   .withRegion("us-east-1").build();
           ListBlueprintsRequest request = new ListBlueprintsRequest().withMaxResults(2);
           System.out.println(glue.listBlueprints(request));
       }
   }
   ```

## Set up the AWS Python SDK
<a name="prereqs-python-preview-sdk"></a>

The following steps assume that you have Python version 2.7 or later, or version 3.9 or later installed on your computer.

1. Download the following boto3 wheel file. If prompted to open or save, save the file. s3://awsglue-custom-blueprints-preview-artifacts/aws-python-sdk-preview/boto3-1.17.31-py2.py3-none-any.whl

1. Download the following botocore wheel file: s3://awsglue-custom-blueprints-preview-artifacts/aws-python-sdk-preview/botocore-1.20.31-py2.py3-none-any.whl

1. Check your Python version.

   ```
   python --version
   ```

1. Depending on your Python version, enter the following commands (for Linux):
   + For Python 2.7 or later.

     ```
     python3 -m pip install --user virtualenv
     source env/bin/activate
     ```
   + For Python 3.9 or later.

     ```
     python3 -m venv python-sdk-test
     source python-sdk-test/bin/activate
     ```

1. Install the botocore wheel file.

   ```
   python3 -m pip install <download-directory>/botocore-1.20.31-py2.py3-none-any.whl
   ```

1. Install the boto3 wheel file.

   ```
   python3 -m pip install <download-directory>/boto3-1.17.31-py2.py3-none-any.whl
   ```

1. Configure your credentials and default region in the `~/.aws/credentials` and `~/.aws/config` files. For more information, see [Configuring the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) in the *AWS Command Line Interface User Guide*.

1. (Optional) Test your setup. The following commands should return an empty list.

   Replace `us-east-1` with your Region.

   ```
   $ python
   >>> import boto3
   >>> glue = boto3.client('glue', 'us-east-1')
   >>> glue.list_blueprints()
   ```

## Set up the preview AWS CLI
<a name="prereqs-setup-cli"></a>

1. If you haven't already done so, install and/or update the AWS Command Line Interface (AWS CLI) on your computer. The easiest way to do this is with `pip`, the Python installer utility:

   ```
   pip install awscli --upgrade --user
   ```

   You can find complete installation instructions for the AWS CLI here: [Installing the AWS Command Line Interface](https://docs.aws.amazon.com/cli/latest/userguide/installing.html).

1. Download the AWS CLI wheel file from: s3://awsglue-custom-blueprints-preview-artifacts/awscli-preview-build/awscli-1.19.31-py2.py3-none-any.whl

1. Install the AWS CLI wheel file.

   ```
   python3 -m pip install awscli-1.19.31-py2.py3-none-any.whl
   ```

1. Run the `aws configure` command. Configure your AWS credentials (including access key, and secret key) and AWS Region. You can find information on configuring the AWS CLI here: [Configuring the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html).

1. Test the AWS CLI. The following command should return an empty list.

   Replace `us-east-1` with your Region.

   ```
   aws glue list-blueprints --region us-east-1
   ```

# Writing the blueprint code
<a name="developing-blueprints-code"></a>

Each blueprint project that you create must contain at a minimum the following files:
+ A Python layout script that defines the workflow. The script contains a function that defines the entities (jobs and crawlers) in a workflow, and the dependencies between them.
+ A configuration file, `blueprint.cfg`, which defines:
  + The full path of the workflow layout definition function.
  + The parameters that the blueprint accepts.

**Topics**
+ [Creating the blueprint layout script](developing-blueprints-code-layout.md)
+ [Creating the configuration file](developing-blueprints-code-config.md)
+ [Specifying blueprint parameters](developing-blueprints-code-parameters.md)

# Creating the blueprint layout script
<a name="developing-blueprints-code-layout"></a>

The blueprint layout script must include a function that generates the entities in your workflow. You can name this function whatever you like. AWS Glue uses the configuration file to determine the fully qualified name of the function.

Your layout function does the following:
+ (Optional) Instantiates the `Job` class to create `Job` objects, and passes arguments such as `Command` and `Role`. These are job properties that you would specify if you were creating the job using the AWS Glue console or API.
+ (Optional) Instantiates the `Crawler` class to create `Crawler` objects, and passes name, role, and target arguments.
+ To indicate dependencies between the objects (workflow entities), passes the `DependsOn` and `WaitForDependencies` additional arguments to `Job()` and `Crawler()`. These arguments are explained later in this section.
+ Instantiates the `Workflow` class to create the workflow object that is returned to AWS Glue, passing a `Name` argument, an `Entities` argument, and an optional `OnSchedule` argument. The `Entities` argument specifies all of the jobs and crawlers to include in the workflow. To see how to construct an `Entities` object, see the sample project later in this section.
+ Returns the `Workflow` object.

For definitions of the `Job`, `Crawler`, and `Workflow` classes, see [AWS Glue blueprint classes reference](developing-blueprints-code-classes.md).

The layout function must accept the following input arguments.


| Argument | Description | 
| --- | --- | 
| user\$1params | Python dictionary of blueprint parameter names and values. For more information, see [Specifying blueprint parameters](developing-blueprints-code-parameters.md). | 
| system\$1params | Python dictionary containing two properties: region and accountId. | 

Here is a sample layout generator script in a file named `Layout.py`:

```
import argparse
import sys
import os
import json
from awsglue.blueprint.workflow import *
from awsglue.blueprint.job import *
from awsglue.blueprint.crawler import *


def generate_layout(user_params, system_params):

    etl_job = Job(Name="{}_etl_job".format(user_params['WorkflowName']),
                  Command={
                      "Name": "glueetl",
                      "ScriptLocation": user_params['ScriptLocation'],
                      "PythonVersion": "2"
                  },
                  Role=user_params['PassRole'])
    post_process_job = Job(Name="{}_post_process".format(user_params['WorkflowName']),
                            Command={
                                "Name": "pythonshell",
                                "ScriptLocation": user_params['ScriptLocation'],
                                "PythonVersion": "2"
                            },
                            Role=user_params['PassRole'],
                            DependsOn={
                                etl_job: "SUCCEEDED"
                            },
                            WaitForDependencies="AND")
    sample_workflow = Workflow(Name=user_params['WorkflowName'],
                            Entities=Entities(Jobs=[etl_job, post_process_job]))
    return sample_workflow
```

The sample script imports the required blueprint libraries and includes a `generate_layout` function that generates a workflow with two jobs. This is a very simple script. A more complex script could employ additional logic and parameters to generate a workflow with many jobs and crawlers, or even a variable number of jobs and crawlers.

## Using the DependsOn argument
<a name="developing-blueprints-code-layout-depends-on"></a>

The `DependsOn` argument is a dictionary representation of a dependency that this entity has on other entities within the workflow. It has the following form. 

```
DependsOn = {dependency1 : state, dependency2 : state, ...}
```

The keys in this dictionary represent the object reference, not the name, of the entity, while the values are strings that correspond to the state to watch for. AWS Glue infers the proper triggers. For the valid states, see [Condition Structure](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-trigger.html#aws-glue-api-jobs-trigger-Condition).

For example, a job might depend on the successful completion of a crawler. If you define a crawler object named `crawler2` as follows:

```
crawler2 = Crawler(Name="my_crawler", ...)
```

Then an object depending on `crawler2` would include a constructor argument such as: 

```
DependsOn = {crawler2 : "SUCCEEDED"}
```

For example:

```
job1 = Job(Name="Job1", ..., DependsOn = {crawler2 : "SUCCEEDED", ...})
```

If `DependsOn` is omitted for an entity, that entity depends on the workflow start trigger.

## Using the WaitForDependencies argument
<a name="developing-blueprints-code-layout-wait-for-dependencies"></a>

The `WaitForDependencies` argument defines whether a job or crawler entity should wait until *all* entities on which it depends complete or until *any* completes.

The allowable values are "`AND`" or "`ANY`".

## Using the OnSchedule argument
<a name="developing-blueprints-code-layout-on-schedule"></a>

The `OnSchedule` argument for the `Workflow` class constructor is a `cron` expression that defines the starting trigger definition for a workflow.

If this argument is specified, AWS Glue creates a schedule trigger with the corresponding schedule. If it isn't specified, the starting trigger for the workflow is an on-demand trigger.

# Creating the configuration file
<a name="developing-blueprints-code-config"></a>

The blueprint configuration file is a required file that defines the script entry point for generating the workflow, and the parameters that the blueprint accepts. The file must be named `blueprint.cfg`.

Here is a sample configuration file.

```
{
    "layoutGenerator": "DemoBlueprintProject.Layout.generate_layout",
    "parameterSpec" : {
           "WorkflowName" : {
                "type": "String",
                "collection": false
           },
           "WorkerType" : {
                "type": "String",
                "collection": false,
                "allowedValues": ["G1.X", "G2.X"],
                "defaultValue": "G1.X"
           },
           "Dpu" : {
                "type" : "Integer",
                "allowedValues" : [2, 4, 6],
                "defaultValue" : 2
           },
           "DynamoDBTableName": {
                "type": "String",
                "collection" : false
           },
           "ScriptLocation" : {
                "type": "String",
                "collection": false
    	}
    }
}
```

The `layoutGenerator` property specifies the fully qualified name of the function in the script that generates the layout.

The `parameterSpec` property specifies the parameters that this blueprint accepts. For more information, see [Specifying blueprint parameters](developing-blueprints-code-parameters.md).

**Important**  
Your configuration file must include the workflow name as a blueprint parameter, or you must generate a unique workflow name in your layout script.

# Specifying blueprint parameters
<a name="developing-blueprints-code-parameters"></a>

The configuration file contains blueprint parameter specifications in a `parameterSpec` JSON object. `parameterSpec` contains one or more parameter objects.

```
"parameterSpec": {
    "<parameter_name>": {
      "type": "<parameter-type>",
      "collection": true|false, 
      "description": "<parameter-description>",
      "defaultValue": "<default value for the parameter if value not specified>"
      "allowedValues": "<list of allowed values>" 
    },
    "<parameter_name>": {    
       ...
    }
  }
```

The following are the rules for coding each parameter object:
+ The parameter name and `type` are mandatory. All other properties are optional.
+ If you specify the `defaultValue` property, the parameter is optional. Otherwise the parameter is mandatory and the data analyst who is creating a workflow from the blueprint must provide a value for it.
+ If you set the `collection` property to `true`, the parameter can take a collection of values. Collections can be of any data type.
+ If you specify `allowedValues`, the AWS Glue console displays a dropdown list of values for the data analyst to choose from when creating a workflow from the blueprint.

The following are the permitted values for `type`:


| Parameter data type | Notes | 
| --- | --- | 
| String | - | 
| Integer | - | 
| Double | - | 
| Boolean | Possible values are true and false. Generates a check box on the Create a workflow from <blueprint> page on the AWS Glue console. | 
| S3Uri | Complete Amazon S3 path, beginning with s3://. Generates a text field and Browse button on the Create a workflow from <blueprint> page. | 
| S3Bucket | Amazon S3 bucket name only. Generates a bucket picker on the Create a workflow from <blueprint> page. | 
| IAMRoleArn | Amazon Resource Name (ARN) of an AWS Identity and Access Management (IAM) role. Generates a role picker on the Create a workflow from <blueprint> page. | 
| IAMRoleName | Name of an IAM role. Generates a role picker on the Create a workflow from <blueprint> page. | 

# Sample blueprint project
<a name="developing-blueprints-sample"></a>

Data format conversion is a frequent extract, transform, and load (ETL) use case. In typical analytic workloads, column-based file formats like Parquet or ORC are preferred over text formats like CSV or JSON. This sample blueprint enables you to convert data from CSV/JSON/etc. into Parquet for files on Amazon S3. 

This blueprint takes a list of S3 paths defined by a blueprint parameter, converts the data to Parquet format, and writes it to the S3 location specified by another blueprint parameter. The layout script creates a crawler and job for each path. The layout script also uploads the ETL script in `Conversion.py` to an S3 bucket specified by another blueprint parameter. The layout script then specifies the uploaded script as the ETL script for each job. The ZIP archive for the project contains the layout script, the ETL script, and the blueprint configuration file.

For information about more sample blueprint projects, see [Blueprint samples](developing-blueprints-samples.md).

The following is the layout script, in the file `Layout.py`.

```
from awsglue.blueprint.workflow import *
from awsglue.blueprint.job import *
from awsglue.blueprint.crawler import *
import boto3

s3_client = boto3.client('s3')

# Ingesting all the S3 paths as Glue table in parquet format
def generate_layout(user_params, system_params):
    #Always give the full path for the file
    with open("ConversionBlueprint/Conversion.py", "rb") as f:
        s3_client.upload_fileobj(f, user_params['ScriptsBucket'], "Conversion.py")
    etlScriptLocation = "s3://{}/Conversion.py".format(user_params['ScriptsBucket'])    
    crawlers = []
    jobs = []
    workflowName = user_params['WorkflowName']
    for path in user_params['S3Paths']:
      tablePrefix = "source_" 
      crawler = Crawler(Name="{}_crawler".format(workflowName),
                        Role=user_params['PassRole'],
                        DatabaseName=user_params['TargetDatabase'],
                        TablePrefix=tablePrefix,
                        Targets= {"S3Targets": [{"Path": path}]})
      crawlers.append(crawler)
      transform_job = Job(Name="{}_transform_job".format(workflowName),
                         Command={"Name": "glueetl",
                                  "ScriptLocation": etlScriptLocation,
                                  "PythonVersion": "3"},
                         Role=user_params['PassRole'],
                         DefaultArguments={"--database_name": user_params['TargetDatabase'],
                                           "--table_prefix": tablePrefix,
                                           "--region_name": system_params['region'],
                                           "--output_path": user_params['TargetS3Location']},
                         DependsOn={crawler: "SUCCEEDED"},
                         WaitForDependencies="AND")
      jobs.append(transform_job)
    conversion_workflow = Workflow(Name=workflowName, Entities=Entities(Jobs=jobs, Crawlers=crawlers))
    return conversion_workflow
```

The following is the corresponding blueprint configuration file `blueprint.cfg`.

```
{
    "layoutGenerator": "ConversionBlueprint.Layout.generate_layout",
    "parameterSpec" : {
        "WorkflowName" : {
            "type": "String",
            "collection": false,
            "description": "Name for the workflow."
        },
        "S3Paths" : {
            "type": "S3Uri",
            "collection": true,
            "description": "List of Amazon S3 paths for data ingestion."
        },
        "PassRole" : {
            "type": "IAMRoleName",
            "collection": false,
            "description": "Choose an IAM role to be used in running the job/crawler"
        },
        "TargetDatabase": {
            "type": "String",
            "collection" : false,
            "description": "Choose a database in the Data Catalog."
        },
        "TargetS3Location": {
            "type": "S3Uri",
            "collection" : false,
            "description": "Choose an Amazon S3 output path: ex:s3://<target_path>/."
        },
        "ScriptsBucket": {
            "type": "S3Bucket",
            "collection": false,
            "description": "Provide an S3 bucket name(in the same AWS Region) to store the scripts."
        }
    }
}
```

The following script in the file `Conversion.py` is the uploaded ETL script. Note that it preserves the partitioning scheme during conversion. 

```
import sys
from pyspark.sql.functions import *
from pyspark.context import SparkContext
from awsglue.transforms import *
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
import boto3

args = getResolvedOptions(sys.argv, [
    'JOB_NAME',
    'region_name',
    'database_name',
    'table_prefix',
    'output_path'])
databaseName = args['database_name']
tablePrefix = args['table_prefix']
outputPath = args['output_path']

glue = boto3.client('glue', region_name=args['region_name'])

glue_context = GlueContext(SparkContext.getOrCreate())
spark = glue_context.spark_session
job = Job(glue_context)
job.init(args['JOB_NAME'], args)

def get_tables(database_name, table_prefix):
    tables = []
    paginator = glue.get_paginator('get_tables')
    for page in paginator.paginate(DatabaseName=database_name, Expression=table_prefix+"*"):
        tables.extend(page['TableList'])
    return tables

for table in get_tables(databaseName, tablePrefix):
    tableName = table['Name']
    partitionList = table['PartitionKeys']
    partitionKeys = []
    for partition in partitionList:
        partitionKeys.append(partition['Name'])

    # Create DynamicFrame from Catalog
    dyf = glue_context.create_dynamic_frame.from_catalog(
        name_space=databaseName,
        table_name=tableName,
        additional_options={
            'useS3ListImplementation': True
        },
        transformation_ctx='dyf'
    )

    # Resolve choice type with make_struct
    dyf = ResolveChoice.apply(
        frame=dyf,
        choice='make_struct',
        transformation_ctx='resolvechoice_' + tableName
    )

    # Drop null fields
    dyf = DropNullFields.apply(
        frame=dyf,
        transformation_ctx="dropnullfields_" + tableName
    )

    # Write DynamicFrame to S3 in glueparquet
    sink = glue_context.getSink(
        connection_type="s3",
        path=outputPath,
        enableUpdateCatalog=True,
        partitionKeys=partitionKeys
    )
    sink.setFormat("glueparquet")

    sink.setCatalogInfo(
        catalogDatabase=databaseName,
        catalogTableName=tableName[len(tablePrefix):]
    )
    sink.writeFrame(dyf)

job.commit()
```

**Note**  
Only two Amazon S3 paths can be supplied as an input to the sample blueprint. This is because AWS Glue triggers are limited to invoking only two crawler actions.

# Testing a blueprint
<a name="developing-blueprints-testing"></a>

While you develop your code, you should perform local testing to verify that the workflow layout is correct.

Local testing doesn't generate AWS Glue jobs, crawlers, or triggers. Instead, you run the layout script locally and use the `to_json()` and `validate()` methods to print objects and find errors. These methods are available in all three classes defined in the libraries. 

There are two ways to handle the `user_params` and `system_params` arguments that AWS Glue passes to your layout function. Your test-bench code can create a dictionary of sample blueprint parameter values and pass that to the layout function as the `user_params` argument. Or, you can remove the references to `user_params` and replace them with hardcoded strings.

If your code makes use of the `region` and `accountId` properties in the `system_params` argument, you can pass in your own dictionary for `system_params`.

**To test a blueprint**

1. Start a Python interpreter in a directory with the libraries, or load the blueprint files and the supplied libraries into your preferred integrated development environment (IDE).

1. Ensure that your code imports the supplied libraries.

1. Add code to your layout function to call `validate()` or `to_json()` on any entity or on the `Workflow` object. For example, if your code creates a `Crawler` object named `mycrawler`, you can call `validate()` as follows.

   ```
   mycrawler.validate()
   ```

   You can print `mycrawler` as follows:

   ```
   print(mycrawler.to_json())
   ```

   If you call `to_json` on an object, there is no need to also call `validate()`, because` to_json()` calls `validate()`. 

   It is most useful to call these methods on the workflow object. Assuming that your script names the workflow object `my_workflow`, validate and print the workflow object as follows.

   ```
   print(my_workflow.to_json())
   ```

   For more information about `to_json()` and `validate()`, see [Class methods](developing-blueprints-code-classes.md#developing-blueprints-code-methods).

   You can also import `pprint` and pretty-print the workflow object, as shown in the example later in this section.

1. Run the code, fix errors, and finally remove any calls to `validate()` or `to_json()`.

**Example**  
The following example shows how to construct a dictionary of sample blueprint parameters and pass it in as the `user_params` argument to layout function `generate_compaction_workflow`. It also shows how to pretty-print the generated workflow object.  

```
from pprint import pprint
from awsglue.blueprint.workflow import *
from awsglue.blueprint.job import *
from awsglue.blueprint.crawler import *
 
USER_PARAMS = {"WorkflowName": "compaction_workflow",
               "ScriptLocation": "s3://amzn-s3-demo-bucket/scripts/threaded-compaction.py",
               "PassRole": "arn:aws:iam::111122223333:role/GlueRole-ETL",
               "DatabaseName": "cloudtrial",
               "TableName": "ct_cloudtrail",
               "CoalesceFactor": 4,
               "MaxThreadWorkers": 200}
 
 
def generate_compaction_workflow(user_params: dict, system_params: dict) -> Workflow:
    compaction_job = Job(Name=f"{user_params['WorkflowName']}_etl_job",
                         Command={"Name": "glueetl",
                                  "ScriptLocation": user_params['ScriptLocation'],
                                  "PythonVersion": "3"},
                         Role="arn:aws:iam::111122223333:role/AWSGlueServiceRoleDefault",
                         DefaultArguments={"DatabaseName": user_params['DatabaseName'],
                                           "TableName": user_params['TableName'],
                                           "CoalesceFactor": user_params['CoalesceFactor'],
                                           "max_thread_workers": user_params['MaxThreadWorkers']})
 
    catalog_target = {"CatalogTargets": [{"DatabaseName": user_params['DatabaseName'], "Tables": [user_params['TableName']]}]}
 
    compacted_files_crawler = Crawler(Name=f"{user_params['WorkflowName']}_post_crawl",
                                      Targets = catalog_target,
                                      Role=user_params['PassRole'],
                                      DependsOn={compaction_job: "SUCCEEDED"},
                                      WaitForDependencies="AND",
                                      SchemaChangePolicy={"DeleteBehavior": "LOG"})
 
    compaction_workflow = Workflow(Name=user_params['WorkflowName'],
                                   Entities=Entities(Jobs=[compaction_job],
                                                     Crawlers=[compacted_files_crawler]))
    return compaction_workflow
 
generated = generate_compaction_workflow(user_params=USER_PARAMS, system_params={})
gen_dict = generated.to_json()
 
pprint(gen_dict)
```

# Publishing a blueprint
<a name="developing-blueprints-publishing"></a>

After you develop a blueprint, you must upload it to Amazon S3. You must have write permissions on the Amazon S3 bucket that you use to publish the blueprint. You must also make sure that the AWS Glue administrator, who will register the blueprint, has read access to the Amazon S3 bucket. For the suggested AWS Identity and Access Management (IAM) permissions policies for personas and roles for AWS Glue blueprints, see [Permissions for personas and roles for AWS Glue blueprints](blueprints-personas-permissions.md).

**To publish a blueprint**

1. Create the necessary scripts, resources, and blueprint configuration file.

1. Add all files to a ZIP archive and upload the ZIP file to Amazon S3. Use an S3 bucket that is in the same Region as the Region in which users will register and run the blueprint.

   You can create a ZIP file from the command line using the following command.

   ```
   zip -r folder.zip folder
   ```

1. Add a bucket policy that grants read permissions to the AWS desired account. The following is a sample policy.

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
           "AWS": "arn:aws:iam::111122223333:root"
         },
         "Action": "s3:GetObject",
         "Resource": "arn:aws:s3:::my-blueprints/*"
       }
     ]
   }
   ```

------

1. Grant the IAM `s3:GetObject` permission on the Amazon S3 bucket to the AWS Glue administrator or to whoever will be registering blueprints. For a sample policy to grant to administrators, see [AWS Glue administrator permissions for blueprints](blueprints-personas-permissions.md#bp-persona-admin).

After you have completed local testing of your blueprint, you may also want to test a blueprint on AWS Glue. To test a blueprint on AWS Glue, it must be registered. You can limit who sees the registered blueprint using IAM authorization, or by using separate testing accounts.

**See also:**  
[Registering a blueprint in AWS Glue](registering-blueprints.md)

# AWS Glue blueprint classes reference
<a name="developing-blueprints-code-classes"></a>

The libraries for AWS Glue blueprints define three classes that you use in your workflow layout script: `Job`, `Crawler`, and `Workflow`.

**Topics**
+ [Job class](#developing-blueprints-code-jobclass)
+ [Crawler class](#developing-blueprints-code-crawlerclass)
+ [Workflow class](#developing-blueprints-code-workflowclass)
+ [Class methods](#developing-blueprints-code-methods)

## Job class
<a name="developing-blueprints-code-jobclass"></a>

The `Job` class represents an AWS Glue ETL job.

**Mandatory constructor arguments**  
The following are mandatory constructor arguments for the `Job` class.


| Argument name | Type | Description | 
| --- | --- | --- | 
| Name | str | Name to assign to the job. AWS Glue adds a randomly generated suffix to the name to distinguish the job from those created by other blueprint runs. | 
| Role | str | Amazon Resource Name (ARN) of the role that the job should assume while executing. | 
| Command | dict | Job command, as specified in the [JobCommand structure](aws-glue-api-jobs-job.md#aws-glue-api-jobs-job-JobCommand) in the API documentation.  | 

**Optional constructor arguments**  
The following are optional constructor arguments for the `Job` class.


| Argument name | Type | Description | 
| --- | --- | --- | 
| DependsOn | dict | List of workflow entities that the job depends on. For more information, see [Using the DependsOn argument](developing-blueprints-code-layout.md#developing-blueprints-code-layout-depends-on). | 
| WaitForDependencies | str | Indicates whether the job should wait until all entities on which it depends complete before executing or until any completes. For more information, see [Using the WaitForDependencies argument](developing-blueprints-code-layout.md#developing-blueprints-code-layout-wait-for-dependencies). Omit if the job depends on only one entity. | 
| (Job properties) | - | Any of the job properties listed in [Job structure](aws-glue-api-jobs-job.md#aws-glue-api-jobs-job-Job) in the AWS Glue API documentation (except CreatedOn and LastModifiedOn). | 

## Crawler class
<a name="developing-blueprints-code-crawlerclass"></a>

The `Crawler` class represents an AWS Glue crawler.

**Mandatory constructor arguments**  
The following are mandatory constructor arguments for the `Crawler` class.


| Argument name | Type | Description | 
| --- | --- | --- | 
| Name | str | Name to assign to the crawler. AWS Glue adds a randomly generated suffix to the name to distinguish the crawler from those created by other blueprint runs. | 
| Role | str | ARN of the role that the crawler should assume while running. | 
| Targets | dict | Collection of targets to crawl. Targets class constructor arguments are defined in the [CrawlerTargets structure](aws-glue-api-crawler-crawling.md#aws-glue-api-crawler-crawling-CrawlerTargets) in the API documentation. All Targets constructor arguments are optional, but you must pass at least one.  | 

**Optional constructor arguments**  
The following are optional constructor arguments for the `Crawler` class.


| Argument name | Type | Description | 
| --- | --- | --- | 
| DependsOn | dict | List of workflow entities that the crawler depends on. For more information, see [Using the DependsOn argument](developing-blueprints-code-layout.md#developing-blueprints-code-layout-depends-on). | 
| WaitForDependencies | str | Indicates whether the crawler should wait until all entities on which it depends complete before running or until any completes. For more information, see [Using the WaitForDependencies argument](developing-blueprints-code-layout.md#developing-blueprints-code-layout-wait-for-dependencies). Omit if the crawler depends on only one entity. | 
| (Crawler properties) | - | Any of the crawler properties listed in [Crawler structure](aws-glue-api-crawler-crawling.md#aws-glue-api-crawler-crawling-Crawler) in the AWS Glue API documentation, with the following exceptions:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/developing-blueprints-code-classes.html) | 

## Workflow class
<a name="developing-blueprints-code-workflowclass"></a>

The `Workflow` class represents an AWS Glue workflow. The workflow layout script returns a `Workflow` object. AWS Glue creates a workflow based on this object.

**Mandatory constructor arguments**  
The following are mandatory constructor arguments for the `Workflow` class.


| Argument name | Type | Description | 
| --- | --- | --- | 
| Name | str | Name to assign to the workflow. | 
| Entities | Entities | A collection of entities (jobs and crawlers) to include in the workflow. The Entities class constructor accepts a Jobs argument, which is a list of Job objects, and a Crawlers argument, which is a list of Crawler objects. | 

**Optional constructor arguments**  
The following are optional constructor arguments for the `Workflow` class.


| Argument name | Type | Description | 
| --- | --- | --- | 
| Description | str | See [Workflow structure](aws-glue-api-workflow.md#aws-glue-api-workflow-Workflow). | 
| DefaultRunProperties | dict | See [Workflow structure](aws-glue-api-workflow.md#aws-glue-api-workflow-Workflow). | 
| OnSchedule | str | A cron expression. | 

## Class methods
<a name="developing-blueprints-code-methods"></a>

All three classes include the following methods.

**validate()**  
Validates the properties of the object and if errors are found, outputs a message and exits. Generates no output if there are no errors. For the `Workflow` class, calls itself on every entity in the workflow.

**to\$1json()**  
Serializes the object to JSON. Also calls `validate()`. For the `Workflow` class, the JSON object includes job and crawler lists, and a list of triggers generated by the job and crawler dependency specifications.

# Blueprint samples
<a name="developing-blueprints-samples"></a>

There are a number of sample blueprint projects available on the [AWS Glue blueprint Github repository](https://github.com/awslabs/aws-glue-blueprint-libs/tree/master/samples). These samples are for reference only and are not intended for production use.

The titles of the sample projects are:
+ Compaction: this blueprint creates a job that compacts input files into larger chunks based on desired file size.
+ Conversion: this blueprint converts input files in various standard file formats into Apache Parquet format, which is optimized for analytic workloads.
+ Crawling Amazon S3 locations: this blueprint crawls multiple Amazon S3 locations to add metadata tables to the Data Catalog.
+ Custom connection to Data Catalog: this blueprint accesses data stores using AWS Glue custom connectors, reads the records, and populates the table definitions in the AWS Glue Data Catalog based on the record schema.
+ Encoding: this blueprint converts your non-UTF files into UTF encoded files.
+ Partitioning: this blueprint creates a partitioning job that places output files into partitions based on specific partition keys.
+ Importing Amazon S3 data into a DynamoDB table: this blueprint imports data from Amazon S3 into a DynamoDB table.
+ Standard table to governed: this blueprint imports an AWS Glue Data Catalog table into a Lake Formation table.

# Registering a blueprint in AWS Glue
<a name="registering-blueprints"></a>

After the AWS Glue developer has coded the blueprint and uploaded a ZIP archive to Amazon Simple Storage Service (Amazon S3), an AWS Glue administrator must register the blueprint. Registering the blueprint makes it available for use.

When you register a blueprint, AWS Glue copies the blueprint archive to a reserved Amazon S3 location. You can then delete the archive from the upload location.

To register a blueprint, you need read permissions on the Amazon S3 location that contains the uploaded archive. You also need the AWS Identity and Access Management (IAM) permission `glue:CreateBlueprint`. For the suggested permissions for an AWS Glue administrator who must register, view, and maintain blueprints, see [AWS Glue administrator permissions for blueprints](blueprints-personas-permissions.md#bp-persona-admin).

You can register a blueprint by using the AWS Glue console, AWS Glue API, or AWS Command Line Interface (AWS CLI).

**To register a blueprint (console)**

1. Ensure that you have read permissions (`s3:GetObject`) on the blueprint ZIP archive in Amazon S3.

1. Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

   Sign in as a user that has permissions to register a blueprint. Switch to the same AWS Region as the Amazon S3 bucket that contains the blueprint ZIP archive.

1. In the navigation pane, choose **blueprints**. Then on the **blueprints** page, choose **Add blueprint**.

1. Enter a blueprint name and optional description.

1. For **ZIP archive location (S3)**, enter the Amazon S3 path of the uploaded blueprint ZIP archive. Include the archive file name in the path and begin the path with `s3://`.

1. (Optional) Add tag one or more tags.

1. Choose **Add blueprint**.

   The **blueprints** page returns and shows that the blueprint status is `CREATING`. Choose the refresh button until the status changes to `ACTIVE` or `FAILED`.

1. If the status is `FAILED`, select the blueprint, and on the **Actions** menu, choose **View**.

   The detail page shows the reason for the failure. If the error message is "Unable to access object at location..." or "Access denied on object at location...", review the following requirements:
   + The user that you are signed in as must have read permission on the blueprint ZIP archive in Amazon S3.
   + The Amazon S3 bucket that contains the ZIP archive must have a bucket policy that grants read permission on the object to your AWS account ID. For more information, see [Developing blueprints in AWS Glue](developing-blueprints.md).
   + The Amazon S3 bucket that you're using must be in the same Region as the Region that you're signed into on the console.

1. Ensure that data analysts have permissions on the blueprint.

   The suggested IAM policy for data analysts is shown in [Data analyst permissions for blueprints](blueprints-personas-permissions.md#bp-persona-analyst). This policy grants `glue:GetBlueprint` on any resource. If your policy is more fine-grained at the resource level, then grant data analysts permissions on this newly created resource.

**To register a blueprint (AWS CLI)**

1. Enter the following command.

   ```
   aws glue create-blueprint --name <blueprint-name> [--description <description>] --blueprint-location s3://<s3-path>/<archive-filename>
   ```

1. Enter the following command to check the blueprint status. Repeat the command until the status goes to `ACTIVE` or `FAILED`.

   ```
   aws glue get-blueprint --name <blueprint-name>
   ```

   If the status is `FAILED` and the error message is "Unable to access object at location..." or "Access denied on object at location...", review the following requirements:
   + The user that you are signed in as must have read permission on the blueprint ZIP archive in Amazon S3.
   + The Amazon S3 bucket containing the ZIP archive must have a bucket policy that grants read permission on the object to your AWS account ID. For more information, see [Publishing a blueprint](developing-blueprints-publishing.md).
   + The Amazon S3 bucket that you're using must be in the same Region as the Region that you're signed into on the console.

**See also:**  
[Overview of blueprints in AWS Glue](blueprints-overview.md)

# Viewing blueprints in AWS Glue
<a name="viewing_blueprints"></a>

View a blueprint to review the blueprint description, status, and parameter specifications, and to download the blueprint ZIP archive.

You can view a blueprint by using the AWS Glue console, AWS Glue API, or AWS Command Line Interface (AWS CLI).

**To view a blueprint (console)**

1. Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **blueprints**.

1. On the **blueprints** page, select a blueprint. Then on the **Actions** menu, choose **View**.

**To view a blueprint (AWS CLI)**
+ Enter the following command to view just the blueprint name, description, and status. Replace *<blueprint-name>* with the name of the blueprint to view.

  ```
  aws glue get-blueprint --name <blueprint-name>
  ```

  The command output looks something like the following.

  ```
  {
      "Blueprint": {
          "Name": "myDemoBP",
          "CreatedOn": 1587414516.92,
          "LastModifiedOn": 1587428838.671,
          "BlueprintLocation": "s3://amzn-s3-demo-bucket1/demo/DemoBlueprintProject.zip",
          "Status": "ACTIVE"
      }
  }
  ```

  Enter the following command to also view the parameter specifications.

  ```
  aws glue get-blueprint --name <blueprint-name>  --include-parameter-spec
  ```

  The command output looks something like the following.

  ```
  {
      "Blueprint": {
          "Name": "myDemoBP",
          "CreatedOn": 1587414516.92,
          "LastModifiedOn": 1587428838.671,
          "ParameterSpec": "{\"WorkflowName\":{\"type\":\"String\",\"collection\":false,\"description\":null,\"defaultValue\":null,\"allowedValues\":null},\"PassRole\":{\"type\":\"String\",\"collection\":false,\"description\":null,\"defaultValue\":null,\"allowedValues\":null},\"DynamoDBTableName\":{\"type\":\"String\",\"collection\":false,\"description\":null,\"defaultValue\":null,\"allowedValues\":null},\"ScriptLocation\":{\"type\":\"String\",\"collection\":false,\"description\":null,\"defaultValue\":null,\"allowedValues\":null}}",
          "BlueprintLocation": "s3://awsexamplebucket1/demo/DemoBlueprintProject.zip",
          "Status": "ACTIVE"
      }
  }
  ```

  Add the `--include-blueprint` argument to include a URL in the output that you can paste into your browser to download the blueprint ZIP archive that AWS Glue stored.

**See also:**  
[Overview of blueprints in AWS Glue](blueprints-overview.md)

# Updating a blueprint in AWS Glue
<a name="updating_blueprints"></a>

You can update a blueprint if you have a revised layout script, a revised set of blueprint parameters, or revised supporting files. Updating a blueprint creates a new version.

Updating a blueprint doesn't affect existing workflows created from the blueprint.

You can update a blueprint by using the AWS Glue console, AWS Glue API, or AWS Command Line Interface (AWS CLI).

The following procedure assumes that the AWS Glue developer has created and uploaded an updated blueprint ZIP archive to Amazon S3.

**To update a blueprint (console)**

1. Ensure that you have read permissions (`s3:GetObject`) on the blueprint ZIP archive in Amazon S3.

1. Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

   Sign in as a user that has permissions to update a blueprint. Switch to the same AWS Region as the Amazon S3 bucket that contains the blueprint ZIP archive.

1. In the navigation pane, choose **blueprints**.

1. On the **blueprints** page, select a blueprint, and on the **Actions** menu, choose **Edit**.

1. On the **Edit a blueprint** page, update the blueprint **Description** or **ZIP archive location (S3)**. Be sure to include the archive file name in the path.

1. Choose **Save**.

   The **blueprints** page returns and shows that the blueprint status is `UPDATING`. Choose the refresh button until the status changes to `ACTIVE` or `FAILED`.

1. If the status is `FAILED`, select the blueprint, and on the **Actions** menu, choose **View**.

   The detail page shows the reason for the failure. If the error message is "Unable to access object at location..." or "Access denied on object at location...", review the following requirements:
   + The user that you are signed in as must have read permission on the blueprint ZIP archive in Amazon S3.
   + The Amazon S3 bucket that contains the ZIP archive must have a bucket policy that grants read permission on the object to your AWS account ID. For more information, see [Publishing a blueprint](developing-blueprints-publishing.md).
   + The Amazon S3 bucket that you're using must be in the same Region as the Region that you're signed into on the console.
**Note**  
If the update fails, the next blueprint run uses the latest version of the blueprint that was successfully registered or updated.

**To update a blueprint (AWS CLI)**

1. Enter the following command.

   ```
   aws glue update-blueprint --name <blueprint-name> [--description <description>] --blueprint-location s3://<s3-path>/<archive-filename>
   ```

1. Enter the following command to check the blueprint status. Repeat the command until the status goes to `ACTIVE` or `FAILED`.

   ```
   aws glue get-blueprint --name <blueprint-name>
   ```

   If the status is `FAILED` and the error message is "Unable to access object at location..." or "Access denied on object at location...", review the following requirements:
   + The user that you are signed in as must have read permission on the blueprint ZIP archive in Amazon S3.
   + The Amazon S3 bucket containing the ZIP archive must have a bucket policy that grants read permission on the object to your AWS account ID. For more information, see [Publishing a blueprint](developing-blueprints-publishing.md).
   + The Amazon S3 bucket that you're using must be in the same Region as the Region that you're signed into on the console.

**See also**  
[Overview of blueprints in AWS Glue](blueprints-overview.md)

# Creating a workflow from a blueprint in AWS Glue
<a name="creating_workflow_blueprint"></a>

You can create an AWS Glue workflow manually, adding one component at a time, or you can create a workflow from an AWS Glue [blueprint](blueprints-overview.md). AWS Glue includes blueprints for common use cases. Your AWS Glue developers can create additional blueprints.

**Important**  
Limit the total number of jobs, crawlers, and triggers within a workflow to 100 or less. If you include more than 100, you might get errors when trying to resume or stop workflow runs.

When you use a blueprint, you can quickly generate a workflow for a specific use case based on the generalized use case defined by the blueprint. You define the specific use case by providing values for the blueprint parameters. For example, a blueprint that partitions a dataset could have the Amazon S3 source and target paths as parameters.

AWS Glue creates a workflow from a blueprint by *running* the blueprint. The blueprint run saves the parameter values that you supplied, and is used to track the progress and outcome of the creation of the workflow and its components. When troubleshooting a workflow, you can view the blueprint run to determine the blueprint parameter values that were used to create a workflow.

To create and view workflows, you require certain IAM permissions. For a suggested IAM policy, see [Data analyst permissions for blueprints](blueprints-personas-permissions.md#bp-persona-analyst).

You can create a workflow from a blueprint by using the AWS Glue console, AWS Glue API, or AWS Command Line Interface (AWS CLI).

**To create a workflow from a blueprint (console)**

1. Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

   Sign in as a user that has permissions to create a workflow.

1. In the navigation pane, choose **blueprints**.

1. Select a blueprint, and on the **Actions** menu, choose **Create workflow**. 

1. On the **Create a workflow from <blueprint-name>** page, enter the following information:  
**Blueprint parameters**  
These vary depending on the blueprint design. For questions about the parameters, see the developer. blueprints typically include a parameter for the workflow name.  
**IAM role**  
The role that AWS Glue assumes to create the workflow and its components. The role must have permissions to create and delete workflows, jobs, crawlers, and triggers. For a suggested policy for the role, see [Permissions for blueprint roles](blueprints-personas-permissions.md#blueprints-role-permissions).

1. Choose **Submit**.

   The **Blueprint Details** page appears, showing a list of blueprint runs at the bottom.

1. In the blueprint runs list, check the topmost blueprint run for workflow creation status. 

   The initial status is `RUNNING`. Choose the refresh button until the status goes to `SUCCEEDED` or `FAILED`. 

1. Do one of the following:
   + If the completion status is `SUCCEEDED`, you can go to the **Workflows** page, select the newly created workflow, and run it. Before running the workflow, you can review the design graph.
   + If the completion status is `FAILED`, select the blueprint run, and on the **Actions** menu, choose **View** to see the error message.

For more information on workflows and blueprints, see the following topics.
+ [Overview of workflows in AWS Glue](workflows_overview.md)
+ [Updating a blueprint in AWS Glue](updating_blueprints.md)
+ [Creating and building out a workflow manually in AWS Glue](creating_running_workflows.md)

# Viewing blueprint runs in AWS Glue
<a name="viewing_blueprint_runs"></a>

View a blueprint run to see the following information:
+ Name of the workflow that was created.
+ blueprint parameter values that were used to create the workflow.
+ Status of the workflow creation operation.

You can view a blueprint run by using the AWS Glue console, AWS Glue API, or AWS Command Line Interface (AWS CLI).

**To view a blueprint run (console)**

1. Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **blueprints**.

1. On the **blueprints** page, select a blueprint. Then on the **Actions** menu, choose **View**.

1. At the bottom of the **Blueprint Details** page, select a blueprint run, and on the **Actions** menu, choose **View**.

**To view a blueprint run (AWS CLI)**
+ Enter the following command. Replace *<blueprint-name>* with the name of the blueprint. Replace *<blueprint-run-id>* with the blueprint run ID.

  ```
  aws glue get-blueprint-run --blueprint-name <blueprint-name> --run-id <blueprint-run-id>
  ```

**See also:**  
[Overview of blueprints in AWS Glue](blueprints-overview.md)