

# Unified storage in Amazon SageMaker Unified Studio
<a name="smus-admin-storage-guide"></a>

As an Amazon SageMaker Unified Studio administrator, you are responsible for configuring and managing storage options that support your organization's data science and machine learning workflows. This guide provides essential information for setting up, configuring, and managing storage resources within Amazon SageMaker Unified Studio projects.

Amazon SageMaker Unified Studio provides two primary storage implementations for files used in Amazon SageMaker Unified Studio projects:
+ **Amazon S3 storage**: This is the default option using Amazon Simple Storage Service for shared storage areas. All project members have read, write, update, and delete access by default to the shared storage area. This storage operates on a "last write wins" principle, meaning that files are immediately visible to all project members when modified. Due to this immediate visibility and the potential for concurrent access, team members must coordinate when working on the same files to avoid overwriting each other's changes.
+ **Git-based storage**: This allows advanced version control using Git repositories connected via the Code Connections service to GitHub, GitHub Enterprise Server, GitLab, GitLab Self-Managed, and Bitbucket.

**Topics**
+ [Configuring project storage options](configuring-project-storage.md)
+ [Performance and cost optimization](performance-cost-optimization.md)
+ [Feature comparison matrix](feature-comparison.md)

# Configuring project storage options
<a name="configuring-project-storage"></a>

## Storage type selection guidelines
<a name="storage-type-selection"></a>

Choose S3 storage for teams with limited Git experience, simple projects without complex versioning needs, quick experimentation and ad-hoc analysis, and scenarios requiring maximum regional availability.

Choose Git-based storage for projects requiring strict version control, collaborative development with code reviews, integration with existing development workflows, and cross-project code sharing requirements.

## Amazon S3 storage configuration
<a name="s3-storage-configuration"></a>

S3 storage is the default option and requires minimal configuration. As an administrator, you can enable [S3 bucket versioning](https://docs.aws.amazon.com/AmazonS3/latest/userguide/manage-versioning-examples.html) to configure basic versioning capabilities for projects that require file history tracking.

## Git-based storage configuration
<a name="git-storage-configuration"></a>

For projects requiring advanced version control, you can configure connections to existing Git repositories during project creation and set default branches and branching policies for effective branch management. Additionally, you can enable multiple projects to use the same repository when appropriate, allowing for efficient cross-project sharing of code and resources. However, it's important to note that Git-based storage availability is limited by the CodeConnections service, which may impose regional limitations on deployment options. For more information, see [CodeConnections](https://docs.aws.amazon.com/general/latest/gr/codeconnections.html).

For storage organization, refer to [Managing storage resources](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/managing-storage.html).

# Performance and cost optimization
<a name="performance-cost-optimization"></a>

## File size limitations
<a name="file-size-limitations"></a>

Files over 15 MB cannot be directly uploaded to shared folders through the Amazon SageMaker Unified Studio interface in space-based tools like JupyterLab and Code Editor. Large files must be uploaded to local folder in JupyterLab first, then copy or move to the shared folder if needed.

## Cost management considerations
<a name="cost-management"></a>

Heavy file read/write workloads in shared storage can incur additional S3 access costs, while frequent S3 operations may affect performance for collaborative workflows.

**For space-based tools (like JupyterLab):** Apart from the shared folder, space-based tools such as JupyterLab and Code Editor also have an EBS-based personal folder per user per project. We recommend using this local storage for intermediate and temporary files during development work, as it provides superior performance for frequent file operations. Only move final versions of files that are ready for sharing with other project users to the S3 shared folder. This approach minimizes S3 operations and associated costs while maintaining optimal performance for iterative development work.

**Note**  
This storage strategy applies specifically to space-based tools like JupyterLab and Code Editor that have access to both local EBS storage and shared storage. For web-based tools like Query Editor, intermediate or temporary files are generated during normal operation, but since these tools don't have a dedicated personal folder, all files are saved directly to shared storage. Web-based tools rely entirely on the shared storage for file operations and don't have the option to use local EBS storage for performance optimization.

# Feature comparison matrix
<a name="feature-comparison"></a>

The following table provides a comprehensive comparison of key features between Git-based and S3 storage options to help you make informed decisions when configuring storage for your Amazon SageMaker Unified Studio projects.


| Feature | Git-based projects | S3-based projects | 
| --- | --- | --- | 
| Audit trail | Full Git commit history tracks all changes including author information, timestamps, and detailed commit messages. Complete audit trail is maintained in the Git repository. | No systematic tracking of file changes or user attribution. Basic file modification timestamps are available, but no detailed change history or commit messages are maintained. | 
| Version history | Complete Git versioning with full commit history, branching, and merging capabilities. Version history is accessible through Git commands in JupyterLab or through the Git provider's web interface. | S3 bucket versioning must be enabled from the S3 console by administrators. When enabled, version history will be available from the S3 console, allowing you to view and restore previous versions of files. | 
| Shared storage | All project members work through same Git repository. Files must be "Saved to project" or pushed to the repo | Shared folder (shared\$1files/) accessible by all project members. Direct file sharing. | 
| Cross-project sharing | Multiple SMUS projects can connect to the same Git repository, enabling code and resource sharing across different project teams. | Each project has its own dedicated S3 storage location. Files cannot be directly shared between projects without manual copying. | 
| Regional availability | Limited by availability of CodeConnections service. | Available in all regions where S3 is available. | 
| Change documentation | All changes are documented through Git commit messages that developers write when saving changes. Provides detailed context for each modification. | No built-in mechanism for documenting changes. File modifications occur without requiring or capturing change descriptions. | 
| Setup complexity | Requires Git repository configuration | Minimal configuration required | 