As new data sources and partners arise, a variety of data transfer services can be used to adapt to these changing access patterns. For multi-site environments, S3 File Gateway can be used to transfer while you retain an on-site cache for other applications. Transfer Family lets partnering entities like CROs easily upload study results.
Overview
This Guidance helps you to connect life sciences data instruments and laboratory system files to the AWS Cloud, either through the internet or a direct connection with low latency. You can cut down on storage expenses for data that gets accessed less often or make it accessible for high-performance computing for genomics, imaging, and other intense workloads, all on AWS.
How it works
This architecture diagram helps you learn how to connect file-based life sciences instruments and laboratory systems to the cloud and provide scalable data access and computing using Amazon Web Services (AWS).
Step 1
Well-Architected Pillars
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
Operational Excellence
Security
For data protection purposes, we recommend that you protect AWS account credentials and set up individual user accounts with AWS Identity and Access Management (IAM), so that each user is given only the permissions necessary to fulfill their job duties. We also suggest that you use at-rest encryption, and the services use in-flight encryption by default.
Reliability
DataSync leverages single or multiple VPC endpoints to ensure that if an Availability Zone is unavailable, the agent can reach another endpoint. DataSync is a scalable service that leverages sets of agents to move data. The tasks and agents can be scaled based on the demand of the amount of data that needs to be migrated. DataSync logs all events to Amazon CloudWatch. If a job fails, actions can be taken to better understand the issue and where the task is failing. Once the tasks are complete, post-processing jobs can be initiated to complete the next phase of the pipeline process. Amazon S3 provides a highly durable storage infrastructure designed for mission-critical and primary data storage.
Performance Efficiency
FSx for Lustre storage provides sub-millisecond latencies, up to hundreds of GBs/s of throughput, and millions of IOPS.
Cost Optimization
By using serverless technologies that scale on-demand, you only pay for the resources you use. To further optimize cost, you can stop the notebook environments in SageMaker when they are not in use. If you don’t intend to use the Amazon QuickSight visualization dashboard, you can choose to not deploy it to save costs. Data Transfer charges are comprised of two main areas: DataSync, which is charged on a per GB transferred rate; and Direct Connect or VPN data transferred. Additionally, cross-Availability Zone charges might apply if VPC endpoints are used.
Sustainability
CloudWatch metrics allow users to make data-driven decisions based on alerts and trends. By extensively using managed services and dynamic scaling, you minimize the environmental impact of the backend services. Most components are self-sustaining.
Related content
Building Digitally Connected Labs with AWS
This post discusses the tools, best practices, and partners helping Life Sciences labs take full advantage of the scale and performance of AWS Cloud.
Guidance for a Laboratory Data Mesh on AWS
This Guidance demonstrates how to build a scientific data management system that integrates both laboratory instrument data and software with cloud data governance, data discovery, and bioinformatics pipelines, capturing key metadata events along the way.