

# Notebook job workflows
<a name="create-notebook-auto-run-dag"></a>

Since a notebook job runs your custom code, you can create a pipeline that includes one or more notebook job steps. ML workflows often contain multiple steps, such as a processing step to preprocess data, a training step to build your model, and a model evaluation step, among others. One possible use of notebook jobs is to handle preprocessing—you might have a notebook that performs data transformation or ingestion, an EMR step that performs data cleaning, and another notebook job that performs featurization of your inputs before initiating a training step. A notebook job may require information from previous steps in the pipeline or from user-specified customization as parameters in the input notebook. For examples that show how to pass environment variables and parameters to your notebook and retrieve information from prior steps, see [Pass information to and from your notebook step](create-notebook-auto-run-dag-seq.md).

In another use case, one of your notebook jobs might call another notebook to perform some tasks during your notebook run—in this scenario you need to specify these sourced notebooks as dependencies with your notebook job step. For information about how to call another notebook, see [Invoke another notebook in your notebook job](create-notebook-auto-run-dag-call.md).

To view sample notebooks that demonstrate how to schedule notebook jobs with the SageMaker AI Python SDK, see [notebook job sample notebooks](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-pipelines/notebook-job-step).

# Pass information to and from your notebook step
<a name="create-notebook-auto-run-dag-seq"></a>

The following sections describe ways to pass information to your notebook as environment variables and parameters.

## Pass environment variables
<a name="create-notebook-auto-run-dag-seq-env-var"></a>

Pass environment variables as a dictionary to the `environment_variable` argument of your `NotebookJobStep`, as shown in the following example:

```
environment_variables = {"RATE": 0.0001, "BATCH_SIZE": 1000}

notebook_job_step = NotebookJobStep(
    ...
    environment_variables=environment_variables,
    ...
)
```

You can use the environment variables in the notebook using `os.getenv()`, as shown in the following example:

```
# inside your notebook
import os
print(f"ParentNotebook: env_key={os.getenv('env_key')}")
```

## Pass parameters
<a name="create-notebook-auto-run-dag-seq-param"></a>

When you pass parameters to the first Notebook Job step in your `NotebookJobStep` instance, you might optionally want to tag a cell in your Jupyter notebook to indicate where to apply new parameters or parameter overrides. For instructions about how to tag a cell in your Jupyter notebook, see [Parameterize your notebook](notebook-auto-run-troubleshoot-override.md).

You pass parameters through the Notebook Job step's `parameters` parameter, as shown in the following snippet:

```
notebook_job_parameters = {
    "company": "Amazon",
}

notebook_job_step = NotebookJobStep(
    ...
    parameters=notebook_job_parameters,
    ...
)
```

Inside your input notebook, your parameters are applied after the cell tagged with `parameters` or at the beginning of the notebook if you don’t have a tagged cell.

```
# this cell is in your input notebook and is tagged with 'parameters'
# your parameters and parameter overrides are applied after this cell
company='default'
```

```
# in this cell, your parameters are applied
# prints "company is Amazon"
print(f'company is {company}')
```

## Retrieve information from a previous step
<a name="create-notebook-auto-run-dag-seq-interstep"></a>

The following discussion explains how you can extract data from a previous step to to pass to your Notebook Job step.

**Use `properties` attribute**

You can use the following properties with the previous step's `properties` attribute:
+ `ComputingJobName`—The training job name
+ `ComputingJobStatus`—The training job status
+ `NotebookJobInputLocation`—The input Amazon S3 location
+ `NotebookJobOutputLocationPrefix`—The path to your training job outputs, more specifically `{NotebookJobOutputLocationPrefix}/{training-job-name}/output/output.tar.gz`. containing outputs
+ `InputNotebookName`—The input notebook file name
+ `OutputNotebookName`—The output notebook file name (which may not exist in the training job output folder if the job fails)

The following code snippet shows how to extract parameters from the properties attribute.

```
notebook_job_step2 = NotebookJobStep(
    ....
    parameters={
        "step1_JobName": notebook_job_step1.properties.ComputingJobName,
        "step1_JobStatus": notebook_job_step1.properties.ComputingJobStatus,
        "step1_NotebookJobInput": notebook_job_step1.properties.NotebookJobInputLocation,
        "step1_NotebookJobOutput": notebook_job_step1.properties.NotebookJobOutputLocationPrefix,
    }
```

**Use JsonGet**

If you want to pass parameters other than the ones previously mentioned and the JSON outputs of your previous step reside in Amazon S3, use `JsonGet`. `JsonGet` is a general mechanism that can directly extract data from JSON files in Amazon S3.

To extract JSON files in Amazon S3 with `JsonGet`, complete the following steps:

1. Upload your JSON file to Amazon S3. If your data is already uploaded to Amazon S3, skip this step. The following example demonstrates uploading a JSON file to Amazon S3.

   ```
   import json
   from sagemaker.s3 import S3Uploader
   
   output = {
       "key1": "value1", 
       "key2": [0,5,10]
   }
               
   json_output = json.dumps(output)
   
   with open("notebook_job_params.json", "w") as file:
       file.write(json_output)
   
   S3Uploader.upload(
       local_path="notebook_job_params.json",
       desired_s3_uri="s3://path/to/bucket"
   )
   ```

1. Provide your S3 URI and the JSON path to the value you want to extract. In the following example, `JsonGet` returns an object representing index 2 of the value associated with key `key2` (`10`).

   ```
   NotebookJobStep(
       ....
       parameters={
           # the key job_key1 returns an object representing the value 10
           "job_key1": JsonGet(
               s3_uri=Join(on="/", values=["s3:/", ..]),
               json_path="key2[2]" # value to reference in that json file
           ), 
           "job_key2": "Amazon" 
       }
   )
   ```

# Invoke another notebook in your notebook job
<a name="create-notebook-auto-run-dag-call"></a>

You can set up a pipeline in which one notebook job calls another notebook. The following sets up an example of a pipeline with a Notebook Job step in which the notebook calls two other notebooks. The input notebook contains the following lines:

```
%run 'subfolder/notebook_to_call_in_subfolder.ipynb'
%run 'notebook_to_call.ipynb'
```

Pass these notebooks into your `NotebookJobStep` instances with `additional_dependencies`, as shown in the following snippet. Note that the paths provided for the notebooks in `additional_dependencies` are provided from the root location. For information about how SageMaker AI uploads your dependent files and folders to Amazon S3 so you can correctly provide paths to your dependencies, see the description for `additional_dependencies` in [NotebookJobStep](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#sagemaker.workflow.notebook_job_step.NotebookJobStep).

```
input_notebook = "inputs/input_notebook.ipynb"
simple_notebook_path = "inputs/notebook_to_call.ipynb"
folder_with_sub_notebook = "inputs/subfolder"

notebook_job_step = NotebookJobStep(
    image_uri=image-uri,
    kernel_name=kernel-name,
    role=role-name,
    input_notebook=input_notebook,
    additional_dependencies=[simple_notebook_path, folder_with_sub_notebook],
    tags=tags,
)
```