# Architecture overview
Architecture overview

This section provides an overview of the architecture of this solution.

 **Topics** 
+  [Architecture diagram](#architecture-diagram) 
+  [Architectural components](#architecture-components) 
+  [Functional components](#functional-overview) 
+  [AWS services](#aws-services-in-this-solution) 

## Architecture diagram


Deploying this solution with the default parameters deploys the following components in your AWS account.

![\[architecture diagram\]](http://docs.aws.amazon.com/solutions/latest/deepracer-on-aws/images/architecture-diagram.png)


## Architectural components


1. A user accesses the DeepRacer on AWS user interface through an [Amazon CloudFront](https://aws.amazon.com/cloudfront/) distribution, which delivers static web assets from the UI assets bucket and video streams from simulations.

1. The user interface assets are hosted in an [Amazon S3](https://aws.amazon.com/s3/) bucket that stores the static web assets comprising the user interface.

1. An [Amazon Cognito](https://aws.amazon.com/cognito/) user pool manages users and user group membership.

1. An Amazon Cognito identity pool manages federation and rule-based role mapping for users.

1.  [AWS IAM](https://aws.amazon.com/iam/) roles define permissions and level-of-access for each user group in the system, used for access control and authorization.

1.  [AWS Lambda](https://aws.amazon.com/lambda/) registration hooks execute pre- and post-registration actions including assigning new users as racers, handling initial admin profile creation, and more.

1.  [AWS WAF](https://aws.amazon.com/waf/) provides intelligent protection for the API against common attack vectors and allows customers to define custom rules based on individual use cases and usage patterns.

1.  [Amazon API Gateway](https://aws.amazon.com/api-gateway/) routes API requests to their appropriate handler using a defined Smithy model.

1. A single [AWS DynamoDB](https://aws.amazon.com/dynamodb/) table is responsible for storing and managing profiles, training jobs, models, evaluation jobs, submissions, and leaderboards.

1. AWS Lambda functions are triggered in response to requests routed from the API and are responsible for CRUD operations, dispatching training/evaluation jobs, and more.

1. A global settings handler (AWS Lambda function) reads and writes application-level settings to the configuration.

1. An [AWS AppConfig](https://aws.amazon.com/systems-manager/features/appconfig/) hosted configuration stores application-level settings, such as usage quotas.

1. Model export handlers (AWS Lambda functions) retrieve the asset URL and package assets for use in exporting models from the system.

1. An [Amazon SQS](https://aws.amazon.com/sqs/) dead-letter queue catches failed export jobs from the asset packaging function.

1. A virtual model bucket stores exported models and provides access to them via pre-signed URL.

1. A model import handler (AWS Lambda function) receives requests to import a model onto the system and creates a new import job.

1. A model import queue (Amazon SQS) receives jobs from the model import function and holds them until they are accepted by the dispatcher; a DLQ handles failed jobs.

1. A failed request handler (AWS Lambda function) manages failed requests and updates their status to reflect their current state.

1. An import dispatching function takes a job from the queue and dispatches it to the workflow.

1. A reward function validator (AWS Lambda function) checks the reward function and validates/sanitizes the customer-provided code before it is saved to the system.

1. An imported model validator function checks and validates the imported model before it is saved to the system.

1. An imported model assets handler (AWS Lambda function) brings in model assets from the upload bucket.

1. An import completion handler (AWS Lambda function) handles status updates when a job is completed successfully.

1. An upload bucket (Amazon S3) stores uploaded (but not yet imported) assets from the user.

1. An Amazon SQS FIFO queue receives requests for training and evaluation jobs and stores them in FIFO order.

1. A job dispatcher function picks a job off the top of the FIFO queue and dispatches it to the workflow.

1. Workflow functions handle setting up the job, setting status, and other workflow tasks.

1.  [Amazon SageMaker AI training jobs](https://aws.amazon.com/sagemaker/) perform the actual training and evaluation of the model using the reward function and hyperparameters provided.

1.  [Amazon Kinesis Video Streams](https://aws.amazon.com/kinesis/video-streams/) handles presenting the simulation video to the user from the training job.

1. A user data bucket stores all user data including trained models, evaluation results, and other assets generated during the DeepRacer workflow.

## Functional components


This solution implements a serverless, microservices-based architecture that enables users to train and evaluate reinforcement learning models for autonomous racing. The architecture is organized around several key functional areas that work together to provide a complete reinforcement learning education platform.

Users access DeepRacer on AWS through a web-based console delivered via Amazon CloudFront, which provides fast, global distribution of the user interface assets. These static web assets are hosted in Amazon S3, ensuring reliable and scalable content delivery to users worldwide. Amazon Cognito manages user authentication and authorization, handling user registration, login, and session management.

When new users register, the system automatically creates user profiles and establishes proper permissions, ensuring a seamless onboarding experience. This authentication layer secures access to the platform while enabling users to maintain their own private workspace for models, training data, and race submissions.

All user interactions with the system flow through Amazon API Gateway, which serves as the central entry point for backend operations. The API Gateway routes requests to appropriate AWS Lambda functions based on the endpoint accessed, providing a clean separation between the user interface and backend processing logic. AWS WAF protects the API layer from common security threats such as bot attacks, DDoS attempts, and malicious traffic patterns.

The solution uses a combination of Amazon DynamoDB and Amazon S3 to handle different types of data storage needs. DynamoDB serves as the primary database for structured data including user profiles, model metadata, training job status, leaderboards, and race submissions. Amazon S3 handles file storage for larger assets such as trained model files, training logs, evaluation videos, and other user-generated content.

The core reinforcement learning functionality is centered around Amazon SageMaker AI training jobs, which provides the compute resources for running reinforcement learning training and evaluation jobs. When users initiate training jobs, the requests are queued in Amazon SQS to manage demand and ensure fair resource allocation. AWS Step Functions orchestrates the workflow of preparing training environments, monitoring job progress, and handling completion tasks. The system pulls a containerized simulation environment from Amazon ECR, which comprises the DeepRacer virtual simulator built on robotics simulation technology.

During model training and evaluation, Amazon Kinesis Video Streams captures video from the simulation environment and streams it in real-time to the console. This allows users to watch their models learn and perform, providing immediate visual feedback on training progress and model behavior. The streaming capability delivers an engaging, visual experience that helps users understand how their models are developing and performing on the virtual race track.

Before any user-provided code executes in the system, it passes through validation functions. These examine reward functions and imported models for security issues, ensuring that malicious or harmful code cannot compromise the system. The functions operate within isolated network environments that prevent external communication, providing an additional security boundary.

Amazon CloudWatch provides comprehensive monitoring and logging across all system components, collecting metrics, logs, and performance data from Lambda functions, SageMaker instances, API Gateway, and other services. This enables cloud administrators to understand system performance, troubleshoot issues, and optimize resource usage.

## AWS services


| AWS service | Function | Description | 
| --- | --- | --- | 
|   [Amazon API Gateway](https://aws.amazon.com/api-gateway/)   |  Core  |  Hosts REST API endpoints in the solution.  | 
|   [AWS CloudFormation](https://aws.amazon.com/cloudformation/)   |  Core  |  Used to deploy the solution.  | 
|   [Amazon CloudFront](https://aws.amazon.com/cloudfront/)   |  Core  |  Serves the web content hosted in Amazon S3.  | 
|   [Amazon Cognito](https://aws.amazon.com/cognito/)   |  Core  |  Handles user management and authentication for the API.  | 
|   [Amazon DynamoDB](https://aws.amazon.com/dynamodb/)   |  Core  |  Stores all user data related to user profiles, models, leaderboards, and submissions in a single table.  | 
|   [Amazon Elastic Container Registry](https://aws.amazon.com/ecr/)   |  Core  |  Stores the Simulation Application (SimApp) image as a public container image, which is used by SageMaker instances to run the DeepRacer simulation application.  | 
|   [Amazon Kinesis Video Streams](https://aws.amazon.com/kinesis/video-streams/)   |  Core  |  Streams videos from SageMaker AI training jobs to the user console, providing real-time visual feedback of model performance.  | 
|   [Amazon S3](https://aws.amazon.com/s3/)   |  Core  |  Hosts static web assets for the user console and stores user-generated artifacts such as model files, training logs, and evaluation videos.  | 
|   [Amazon SageMaker](https://aws.amazon.com/sagemaker/)   |  Core  |  Runs the Simulation Application (SimApp) for training and evaluating DeepRacer models.  | 
|   [Amazon SQS](https://aws.amazon.com/sqs/)   |  Core  |  Provides a first-in-first-out job queue that holds simulation jobs before they are forwarded to the job dispatcher.  | 
|   [AWS Lambda](https://aws.amazon.com/lambda/)   |  Core  |  Powers various functions including API request handling, model validation, reward function validation, job dispatching, and workflow management.  | 
|   [AWS Step Functions](https://aws.amazon.com/step-functions/)   |  Core  |  Manages workflow functions that orchestrate training and evaluation jobs on SageMaker instances.  | 
|   [AWS WAF](https://aws.amazon.com/waf/)   |  Core  |  Provides system protection against bot spam, DDoS attacks, credential stuffing, and other common attack vectors.  | 
|   [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/)   |  Core  |  Provides monitoring and logging capabilities for all components of the DeepRacer on AWS solution.  | 
|   [AWS Identity and Access Management (IAM)](https://aws.amazon.com/iam/)   |  Core  |  Manages access control and permissions for various components of the DeepRacer on AWS solution.  | 
|   [Amazon Virtual Private Cloud (VPC)](https://aws.amazon.com/vpc/)   |  Optional  |  Can be used to provide network isolation for SageMaker AI training jobs for enhanced security.  | 

# AWS Well-Architected design considerations


This solution follows best practices from the [AWS Well-Architected Framework](https://aws.amazon.com/architecture/well-architected/), which helps customers design and operate reliable, secure, efficient, and cost-effective workloads in the cloud.

This section describes how the design principles and best practices of the Well-Architected Framework benefit this solution.

 **Topics** 
+  [Operational excellence](#operational-excellence) 
+  [Security](#security) 
+  [Reliability](#reliability) 
+  [Performance efficiency](#performance-efficiency) 
+  [Cost optimization](#cost-optimization) 
+  [Sustainability](#sustainability) 

## Operational excellence


This section describes how we architected this solution using the principles and best practices of the [operational excellence pillar](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/welcome.html).
+ All resources are defined as infrastructure as code using AWS CloudFormation templates generated from AWS CDK constructs.
+ The solution pushes metrics to Amazon CloudWatch at various stages to provide observability into AWS Lambda functions, Amazon SageMaker, AWS Step Functions, Amazon S3 buckets, and other solution components.

## Security


This section describes how we architected this solution using the principles and best practices of the [security pillar](https://docs.aws.amazon.com/wellarchitected/latest/security-pillar/welcome.html).
+ Amazon Cognito authenticates and authorizes web console users and API requests.
+ All interservice communications use [AWS Identity and Access Management](https://aws.amazon.com/iam/) (IAM) roles with least privilege access, containing only the minimum permissions required.
+ All data storage, including S3 buckets and DynamoDB tables, encrypts data at rest using AWS managed keys.
+ Logging, tracing, and versioning are enabled where applicable for audit and compliance purposes.

## Reliability


This section describes how we architected this solution using the principles and best practices of the [reliability pillar](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html).
+ The solution uses AWS serverless services wherever possible (examples: Lambda, API Gateway, Amazon S3, AWS Step Functions and Amazon DynamoDB) to ensure high availability and recovery from service failure.
+ Data is stored in DynamoDB and Amazon S3, so it persists in multiple Availability Zones by default.

## Performance efficiency


This section describes how we architected this solution using the principles and best practices of the [performance efficiency pillar](https://docs.aws.amazon.com/wellarchitected/latest/performance-efficiency-pillar/welcome.html).
+ The solution uses a serverless architecture with the ability to scale horizontally as needed.
+ The solution can be launched in any region that supports the AWS services in this solution, which include: AWS Lambda, Amazon API Gateway, Amazon S3, AWS Step Functions, Amazon DynamoDB, and Amazon Cognito.
+ The solution uses managed services throughout to reduce the operational burden of resource provisioning and management.
+ The solution is automatically tested and deployed daily to achieve consistency as AWS services change, as well as reviewed by solution architects and subject matter experts for areas to experiment and improve.

## Cost optimization


This section describes how we architected this solution using the principles and best practices of the [cost optimization pillar](https://docs.aws.amazon.com/wellarchitected/latest/cost-optimization-pillar/welcome.html).
+ The solution uses a serverless architecture; therefore, you are only charged for what you use.
+ Amazon DynamoDB scales capacity on demand, so you only pay for the capacity you use.
+ Amazon SageMaker allows you to pay only for the compute resources you use, with no upfront expenses.

## Sustainability


This section describes how we architected this solution using the principles and best practices of the [sustainability pillar](https://docs.aws.amazon.com/wellarchitected/latest/sustainability-pillar/sustainability-pillar.html).
+ The solution uses managed, serverless services to minimize the environmental impact of the backend services compared to continually operating on-premises services.
+ Serverless services allow you to scale up or down as needed.