# 2 – Modernize deployment of the analytics jobs and applications **How do you deploy jobs and applications in a controlled and reproducible way?** Using modern development practices, such as continuous integration/continuous delivery (CI/CD), can help ensure that changes are rolled out in a controlled and repeatable way. Your team should use test automation to verify infrastructure, code changes, and data updates at every stage of your deployment lifecycle. The analytics processing often requires management of complex workflows. It includes job scheduling, managing dependencies between jobs, and monitoring jobs. You also need an orchestration tool for data movement. | **ID** | **Priority** | **Best practice** | | --- | --- | --- | | ☐ BP 2.1 | Recommended | Use version control for job and application changes. | | ☐ BP 2.2 | Recommended | Create test data and provision staging environment. | | ☐ BP 2.3 | Recommended | Test and validate analytics jobs and application deployments. | | ☐ BP 2.4 | Recommended | Build standard operating procedures for deployment, test, rollback, and backfill tasks. | For more details, refer to the following information: + Reference architecture: [Deployment Pipeline Reference Architecture](https://pipelines.devops.aws.dev/) + AWS Big Data Blog: [Build, Test and Deploy ETL solutions using AWS Glue and AWS CDK based CI/CD pipelines ](https://aws.amazon.com/blogs/big-data/build-test-and-deploy-etl-solutions-using-aws-glue-and-aws-cdk-based-ci-cd-pipelines/) + AWS Big Data Blog: [AWS serverless data analytics pipeline reference architecture](https://aws.amazon.com/blogs/big-data/aws-serverless-data-analytics-pipeline-reference-architecture/) + AWS Whitepaper: [Building a Cloud Operating Model](https://docs.aws.amazon.com/whitepapers/latest/building-cloud-operating-model/building-cloud-operating-model.html) + AWS Big Data Blog: [Build a DataOps platform to break silos between engineers and analysts](https://aws.amazon.com/blogs/big-data/build-a-dataops-platform-to-break-silos-between-engineers-and-analysts/) # Best practice 2.1 – Use version control for job and application changes Version control systems support tracking changes and the ability to revert to previous versions of an analytics system should changes cause unintended consequences. Your team should version control code repositories for both analytics infrastructure as code (IaC) and analytics applications logic. ## Suggestion 2.1.1 – Use infrastructure as code and version control systems so that a failed deployment can be rolled back to a previous good state Follow software development best practices when building analytics systems. For example, deploy resources using code templates, such as AWS CloudFormation or Hashicorp Terraform, so that all deployments occur exactly as intended. Use version control systems (for example, code repositories such as GitHub) to hold current and previous versions of your code templates. Using these tools, if a new change results in unwanted outcomes, you can easily roll back to the previous code template. For more details, refer to the following information: + AWS Whitepaper: [Introduction to DevOps on AWS](https://docs.aws.amazon.com/whitepapers/latest/introduction-devops-aws/infrastructure-as-code.html) + AWS Blog: [Automate building an integrated analytics solution with AWS Analytics Automation Toolkit](https://aws.amazon.com/blogs/big-data/automate-building-an-integrated-analytics-solution-with-aws-analytics-automation-toolkit/) # Best practice 2.2 – Create test data and provision staging environment Using a known and unchanging dataset for test purposes helps ensure that when changes are made to the analytics environment or analytics application code, test results can be compared to previous versions. Confirming that the test datasets accurately represent real-world data allows the analytics workload developer to confirm the outcomes from the analytics job, as well as comparing test results to previous versions. Your organization should use a staging environment for user access testing. Your organization should create logically separated AWS accounts for your development, test, staging, and production environments depending upon your development standards. For more details, refer to the following information: AWS Whitepaper: [Establishing your best practice AWS environment](https://docs.aws.amazon.com/whitepapers/latest/organizing-your-aws-environment/organizing-your-aws-environment.html) ## Suggestion 2.2.1 – Use a curated dataset to test application logic and performance improvements Analytics projects that are being developed should use the same curated dataset to compare results between tests of different versions of your code. Using the same dataset for all tests allows demonstrating improvement over time, as well as making it easier to recognize regressions in your code. To help control access to sensitive data, your organization should use data masking techniques when restoring development data to non-production environments. More information on data minimization techniques can be found in [Security](security.md). For more details, refer to the following information: + AWS Database Blog: [Data Masking using AWS DMS (AWS Data Migration Service)](https://aws.amazon.com/blogs/database/data-masking-using-aws-dms/) + Amazon Redshift Data Masking: [Dynamic data masking (DDM) in Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/dg/t_ddm.html) ## Suggestion 2.2.2 – Use a random sample of recent data to validate application edge cases and help ensure that regressions have not been introduced Use a statistically valid random sample of recent data to confirm that the analytics solution continues to perform under real-world conditions. Using a sample of recent data also allows you to recognize whether your dataset characteristics have shifted, or whether anomalous data has recently been introduced to your data. For more information, see the AWS Machine Learning Blog: [Create random and stratified samples of data with Amazon SageMaker AI Data Wrangler](https://aws.amazon.com/blogs/machine-learning/create-random-and-stratified-samples-of-data-with-amazon-sagemaker-data-wrangler/). # Best practice 2.3 – Test and validate analytics jobs and application deployments Before making changes in production environments, use standard and repeatable automated tests to validate performance and accuracy of results. ## Suggestion 2.3.1 – Establish separate staging environments to test changes before going live Use separate environments, such as development, test, and production, to allow feature development to be introduced without disrupting production systems. Test changes for accuracy and performance before changes are deployed into the production environment. ## Suggestion 2.3.2 – Automate the deployment and testing when infrastructure and applications changes are introduced The deployment of data pipelines and data infrastructure changes should be an automated process. When code is checked into version control, a CI/CD process should run tests and apply the changes to the staging environment, and once tested and confirmed correct, it should be deployed to the production environment. You can use the AWS CodePipeline service to define a CI/CD process. For more details refer to the following information: + AWS Perspective Guidance: [Deploy an AWS Glue job with an AWS CodePipeline CI/CD pipeline](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/deploy-an-aws-glue-job-with-an-aws-codepipeline-ci-cd-pipeline.html) + AWS DevOps Blog: [How to unit test and deploy AWS Glue jobs using AWS CodePipeline](https://aws.amazon.com/blogs/devops/how-to-unit-test-and-deploy-aws-glue-jobs-using-aws-codepipeline/) + AWS DevOps Blog: [10 ways to build applications faster with Amazon CodeWhisperer](https://aws.amazon.com/blogs/devops/10-ways-to-build-applications-faster-with-amazon-codewhisperer/) # Best practice 2.4 – Build standard operating procedures for deployment, test, rollback, and backfill tasks Standard operating procedures for deployment, test, rollback, and data backfill tasks allow faster deployments, reduce the number of errors that reach production. Using a standard approach also makes remediation easier if a deployment results in unintended consequences. ## Suggestion 2.4.1 – Document and use standard operating procedures for implementing changes in your analytics workload Standard operating procedures allow teams to make changes confidently, thus avoiding repeatable mistakes and reducing the chance of human error. ## Suggestion 2.4.2 – Use automation to perform changes to underlying analytics infrastructure or application logic Automated tests can determine when changes have unintended consequences and can roll back without human intervention.