

# Using the JupyterLab IDE in Amazon SageMaker Unified Studio
JupyterLab

The JupyterLab page of Amazon SageMaker Unified Studio provides a JupyterLab interactive development environment (IDE) for you to use as you perform data integration, analytics, or machine learning in your projects. Amazon SageMaker Unified Studio notebooks are powered by JupyterLab spaces.

By default, the JupyterLab application comes with the Amazon SageMaker Distribution image. The distribution image includes popular packages such as the following:
+ PyTorch
+ TensorFlow
+ Keras
+ NumPy
+ Pandas
+ Scikit-learn

Amazon SageMaker Unified Studio includes a sample notebook that you can use to get started. You can also choose to create new notebooks for your business use cases.

Amazon SageMaker Unified Studio notebooks include the following key features:
+ Manage configurations to scale the instance vertically if the job being submitted demands it.
+ Access metadata to find out information such as the path to the Amazon S3 bucket where data is being stored.
+ Perform Git operations for version control.
+ Use Amazon Q chat functionality to ask questions and generate code using prompts.
+ Perform code completion using Amazon Q Developer.

**Note**  
The JupyterLab IDE has an idle shutdown feature that shuts down the IDE after it has been idle for 60 minutes. This means that if both the IDE kernel and terminal have been unused for an hour, the IDE stops running. In order to start using the IDE again after idle shutdown, you would need to navigate to the JupyterLab page again and click on the Start button to restart the kernel in the JupyterLab IDE.

# Managing configurations


You can edit your JupyterLab configurations on the JupyterLab page by choosing Configure in the top right corner. A popup appears where you can change the instance type. You can also increase the EBS volume up to 16 GB if allowed by your admin.

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Expand the **Build** menu in the top navigation, then choose **JupyterLab**.

1. Choose the Configure button in the top right corner of the page. A popup appears where you can change the instance type and increase the EBS volume.

1. Specify the instance type and EBS volume that you want for testing.

   1. NOTE: After you increase the EBS volume, you cannot decrease it.

# Configuring Spark compute


Amazon SageMaker Unified Studio provides a set of Jupyter magic commands. Magic commands, or magics, enhance the functionality of the IPython environment. For more information about the magics that Amazon SageMaker Unified Studio provides, run `%help` in a notebook.

Compute-specific configurations can be set by using the `%%configure` Jupyter magic. The `%%configure` magic takes a JSON-formatted dictionary. To use %%configure magic, specify the compute name in the argument `-n`. Including `-f` will restart the session to forcefully apply the new configuration. Otherwise, this configuration will apply when the next session starts. 

For example: `%%configure -n compute_name -f`.

# Library management


You can use the library management widget in JupyterLab to manage the library installations and configurations in your notebook.

To navigate to the library management of a notebook in Amazon SageMaker Unified Studio, complete the following steps:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to a project. You can do this by choosing **Browse all projects** from the center menu and then selecting a project, or by creating a new project.

1. From the **Build** menu, choose **JupyterLab**.

1. Navigate to a notebook or create a new one by selecting **File** > **New** > **Notebook**.

1. Choose the library management icon from the notebook navigation bar.  
![\[The Amazon SageMaker Unified Studio JupyterLab library icon.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/library-icon.png)

The following library configurations are available:

## Jar

+ **Maven artifacts**
+ **S3 paths**
+ **Disk location paths**
+ **Other paths**

## Python

+ **Conda packages**
+ **PyPI packages**
+ **S3 paths**
+ **Disk location paths**
+ **Other paths**

## Adding JupyterLab library configurations




1. Navigate to the JupyterLab library management page.

1. Select the configuration method you would like to add from the left navigation of the library management page.

1. Choose **Add**.

1. Input the URL, package name, coordinates, or other information as the fields indicate.

1. In the left navigation of the library management page, check the box **Apply the change to JupyterLab**.

1. Choose **Save all changes**.

# Compute-specific configuration


 Amazon SageMaker Unified Studio provides a set of Jupyter magic commands. Magic commands, or magics, enhance the functionality of the IPython environment. For more information about the magics that Amazon SageMaker Unified Studio provides, run %help in a notebook. 

 Compute-specific configurations can be configured by %%configure Jupyter magic. The %%configure magic takes a json-formatted dictionary. To use %%configure magic, please specify the compute name in the argument -n. Include —f will restart the session to forcefully apply the new configuration, otherwise this configuration will apply only when next session starts. 

## Configure an EMR Spark session


 When working with EMR on EC2 or EMR Serverless, %%configure command can be used to configure the Spark session creation parameters. Using conf settings, you can configure any Spark configuration that's mentioned in the configuration documentation for Apache Spark. 

```
%%configure -n compute_name -f 
{ 
    "conf": { 
        "spark.sql.shuffle.partitions": "36"
     }
}
```

## Configure a Glue interactive session


Use the `--` prefix for run arguments specified for Glue. 

```
%%configure -n project.spark.compatibility -f
{
   "––enable-auto-scaling": "true"
   "--enable-glue-datacatalog": "false"
}
```

For more information on job parameters, see Job parameters.

You can update Spark configuration via %%configure when working with Glue with --conf in configure magic. You can configure any Spark configuration that's mentioned in the configuration documentation for Apache Spark. 

```
%%configure -n project.spark.compatibility -f 
{ 
    "--conf": "spark.sql.shuffle.partitions=36" 
}
```

# Accessing metadata


You can view metadata for your project in the notebook terminal within Amazon SageMaker Unified Studio. This shows you information such as the `ProjectS3Path`, which is the Amazon S3 bucket where your project data is stored. The project metadata is written to a file named resource-metadata.json in the folder /opt/ml/metadata/. You can get the metadata by opening a terminal from within the notebook.

1. Navigate to the Code page within the project you want to view metadata for.

1. Choose File > New > Terminal.

1. Enter in the following command:

   ```
   cat /opt/ml/metadata/resource-metadata.json
   ```

   The metadata file information then appears in the terminal window.

# Performing Git operations


The JupyterLab IDE in Amazon SageMaker Unified Studio is configured with Git and initialized with the project repository when a project is created.

To access Git operations in the Amazon SageMaker Unified Studio management console, navigate to the Code page of your project, then choose the Git button in the JupyterLab IDE left panel as shown in the image below.

This opens a panel where you can view commit history and perform Git operations. You can use this Git extension to commit and push files back to the project repository, switch your working branch or create a new one, and manage tags.

To fetch notebooks committed by other users, do a pull from the project repository.

**Note**  
When you create and enable a connection for Git access and the user accesses this connection in the JupyterLab IDE in Amazon SageMaker Unified Studio, the repository is cloned. In other words, a local copy of the repository is created in the Amazon SageMaker Unified Studio project. If the administrator later disables or deletes this Git connection, the local repository remains in the user's IDE, but users can no longer push or pull files to or from it. For more information, see [Git connections in Amazon SageMaker Unified Studio](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/adminguide/git-connections.html). 

# Using the Amazon Q data integration in AWS Glue


Amazon SageMaker Unified Studio supports the [Amazon Q data integration](https://docs.aws.amazon.com/glue/latest/dg/q.html) in AWS Glue. It helps data engineers and ETL developers create data integration jobs using natural language letting you automate aspects of code authoring.

When using the Amazon Q data integration in AWS Glue, in the Jupyter Lab IDE, you enter comments using natural language instructions, and then the PySpark kernel generates the code on your behalf. You can customize the generated code to meet your own needs.

1. Open a Python notebook, and ensure the kernel is configured to use a PySpark connection.

1. You can request a prompt response by adding a comment and then placing it in a prompt, which will start Amazon Q processing.

1. If the prompt is AWS Glue related, the data integration generates a AWS Glue job script using PySpark.

1. Alternatively, you can continue to use your default auto-completions from Amazon Q Developer. If a prompt isn't Glue related, Amazon Q Developer will use autocomplete instead.

# Running SQL and Spark code


You can run code against multiple compute in one Jupyter notebook using different programming languages through the use of Jupyter cell magics %%pyspark, %%sql, %%scalaspark. 

For example, to run pyspark code on Spark compute, you can run the following code:

```
%%pyspark compute_name
spark.createDataFrame([('Alice', 1)])
```

The following table represents the supported compute types of each magic:


| magic | supported compute types | 
| --- | --- | 
| %%sql | Redshift, Athena, EMR on EC2, EMR Serverless, Glue Interactive Session | 
| %%pyspark | EMR on EC2, EMR Serverless, Glue Interactive Session | 
| %%scalaspark | EMR on EC2, EMR Serverless, Glue Interactive Session | 

The dropdown available at the top of active cells allows you to select the Connection and Compute type. If no selection is made the code in the cell will be run against the Compute hosting JupyterLab (“Local Python” / “project.python”). The option selected for Connection type dictates the Compute available. The selections dictate the magics code generated in the cell and determine where your code runs.

When a new cell is created, it will select the same connection and compute type as the previous cell automatically. To configure the dropdown, go to Settings → Settings editor → Connection magics settings.

# Visualizing results


%display is a magic that customers can apply against any DataFrame to invoke a visualization for tabular data. use the visualization to scroll through a DataFrame or results of a Redshift or Athena query.

There are four different views:
+ **Table**. You can change the sampling method, sample size, and rows per page that are displayed.
+ **Summary**. Each column in the summary tab has a button labeled with the column’s name. Clicking on one of these buttons opens the a sub-tab in the column view in Tab 3 for the column that was clicked. 
+ **Column**. For each column selected in the column selector above, a sub-tab appears with more details about the contents of the column.
+ **Plotting**. In the default plotting view you can change the graph type, axes, value types, and aggregation functions for plotting. By installing an optional supported third-party plotting library on the Jupyterlab space (pygwalker, ydata-profiling, or dataprep) and running the display magic you can visualize your data using the installed library.

## Shared Project Storage


 The JupyterLab visualization widget offers an option to store visualization data in a shared Amazon S3 location within your project bucket. The data is stored using the following structure: 

```
s3://bucket/domain/project/dev/user/{sts_user_identity}/query-result/{data_uuid}/
  
├── dataframe/           # Contains DataFrame in parquet format
├── head/100/            # Sample data (100 rows)
│   ├── metadata.json
│   ├── summary_schema.json
│   └── column_schema/
└── tail/                # Additional sample data
```

## Storage Options


 The visualization widget supports two storage modes controlled by the `--query-storage` parameter: 
+  **Cell storage** (`--query-storage cell`): Data stored locally in notebook output (current default behavior) 
+  **S3 storage** (`--query-storage s3`): Data stored in project's shared S3 bucket for persistence and sharing 
  + Choose **Store query result in S3** to store the data in project's shared S3 bucket.

## Data Access and Security


 When using Amazon S3 storage, the visualization data is accessible to all project members. Data persists beyond individual JupyterLab sessions. No individual user permissions can be set on stored visualizations. You should consider data classification before storing sensitive information. The storage uses the project's default runtime role for access control. 

**Note**  
 The Amazon S3 storage location is shared across the entire project. All project members can access visualization data stored by any team member. 

# Data Sharing Across Compute Environments


Amazon SageMaker Unified Studio provides magic commands to facilitate data sharing across different compute environments. This section outlines three key commands: `%push`, `%pop`, and `%send_to_remote`.

## %push


The `%push` command allows you to upload specified variables to your project's shared S3 storage within Amazon SageMaker Unified Studio.

```
%push <var_name>
%push <var_name1>,<var_name2>
%push -v <var_name>
%push -v <var_name> --namespace <namespace_name>
```

**Key Features:**
+ Supports multiple variable uploads when comma-separated
+ -v specifies the variable name (alternative syntax)
+ Optional --namespace argument (defaults to kernel ID)
+ Uploaded variables are accessible to all project members

**Supported Connections:**
+ Local Python connections
+ AWS Glue connections
+ AWS EMR connections

**Supported Language:** Python

## %pop


The %pop command enables you to download specified variables from shared project Amazon S3 storage to your current compute environment.

```
%pop <var_name>
%pop <var_name1>,<var_name2>
%pop -v <var_name>
%pop -v <var_name> --namespace <namespace_name>
```

**Key Features:**
+ Supports multiple variable downloads when comma-separated
+ -v specifies the variable name (alternative syntax)
+ Optional --namespace argument (defaults to kernel ID)

**Supported Connections:**
+ Local Python connections
+ AWS Glue connections
+ AWS EMR connections

**Supported Language:** Python

## %send\$1to\$1remote


The %send\$1to\$1remote command allows you to send a variable from the local kernel to a remote compute environment.

```
%send_to_remote --name <connection_name> --language <language> --local <local_variable_name> --remote <remote_variable_name>
```

** Key Features:**
+ Supports both Python and Scala in remote environments
+ Python remote supports dict, df, and str data types
+ Scala remote supports df and str data types

** Arguments:**
+  -l or --language: Specifies the connection language
+ -n or --name: Specifies the connection to be used
+ --local: Defines the local variable name
+ --remote or -r: Defines the remote variable name

**Supported Connections:** local Python connections

**Supported Language:**
+ Python
+ Scala

## Security considerations


 Remember that variables uploaded using `%push` are accessible to all project members within your Amazon SageMaker Unified Studio project. Ensure that sensitive data is handled appropriately and in compliance with your organization's data governance policies. 