# SEC 10  How do you anticipate, respond to, and recover from incidents?
<a name="sec-10"></a>

Preparation is critical to timely and effective investigation, response to, and recovery from security incidents to help minimize disruption to your organization.

**Topics**
+ [

# SEC10-BP01 Identify key personnel and external resources
](sec_incident_response_identify_personnel.md)
+ [

# SEC10-BP02 Develop incident management plans
](sec_incident_response_develop_management_plans.md)
+ [

# SEC10-BP03 Prepare forensic capabilities
](sec_incident_response_prepare_forensic.md)
+ [

# SEC10-BP04 Automate containment capability
](sec_incident_response_auto_contain.md)
+ [

# SEC10-BP05 Pre-provision access
](sec_incident_response_pre_provision_access.md)
+ [

# SEC10-BP06 Pre-deploy tools
](sec_incident_response_pre_deploy_tools.md)
+ [

# SEC10-BP07 Run game days
](sec_incident_response_run_game_days.md)

# SEC10-BP01 Identify key personnel and external resources
<a name="sec_incident_response_identify_personnel"></a>

 Identify internal and external personnel, resources, and legal obligations that would help your organization respond to an incident. 

When you define your approach to incident response in the cloud, in unison with other teams (such as your legal counsel, leadership, business stakeholders, AWS Support Services, and others), you must identify key personnel, stakeholders, and relevant contacts. To reduce dependency and decrease response time, make sure that your team, specialist security teams, and responders are educated about the services that you use and have opportunities to practice hands-on.

We encourage you to identify external AWS security partners that can provide you with outside expertise and a different perspective to augment your response capabilities. Your trusted security partners can help you identify potential risks or threats that you might not be familiar with.

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Identify key personnel in your organization: Maintain a contact list of personnel within your organization that you would need to involve to respond to and recover from an incident. 
+  Identify external partners: Engage with external partners if necessary that can help you respond to and recover from an incident. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [AWS Incident Response Guide](https://docs.aws.amazon.com/whitepapers/latest/aws-security-incident-response-guide/welcome.html) 

 **Related videos:** 
+  [Prepare for and respond to security incidents in your AWS environment ](https://youtu.be/8uiO0Z5meCs)

 **Related examples:** 

# SEC10-BP02 Develop incident management plans
<a name="sec_incident_response_develop_management_plans"></a>

 Create plans to help you respond to, communicate during, and recover from an incident. For example, you can start an incident response plan with the most likely scenarios for your workload and organization. Include how you would communicate and escalate both internally and externally. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 An incident management plan is critical to respond, mitigate, and recover from the potential impact of security incidents. An incident management plan is a structured process for identifying, remediating, and responding in a timely matter to security incidents. 

 The cloud has many of the same operational roles and requirements found in an on-premises environment. When creating an incident management plan, it is important to factor response and recovery strategies that best align with your business outcome and compliance requirements. For example, if you are operating workloads in AWS that are FedRAMP compliant in the United States, it’s useful to adhere to [NIST SP 800-61 Computer Security Handling Guide](https://nvlpubs.nist.gov/nistpubs/specialpublications/nist.sp.800-61r2.pdf). Similarly, when operating workloads with European PII (personally identifiable information) data, consider scenarios like how you might protect and respond to issues related to data residency as mandated by [EU General Data Protection Regulation (GDPR) Regulations](https://ec.europa.eu/info/law/law-topic/data-protection/reform/what-does-general-data-protection-regulation-gdpr-govern_en). 

 When building an incident management plan for your workloads operating in AWS, start with the [AWS Shared Responsibility Model](https://aws.amazon.com/compliance/shared-responsibility-model/), for building a defense-in-depth approach towards incident response. In this model, AWS manages security of the cloud, and you are responsible for security in the cloud. This means that you retain control and are responsible for the security controls you choose to implement. The [AWS Security Incident Response Guide](https://docs.aws.amazon.com/whitepapers/latest/aws-security-incident-response-guide/welcome.html) details key concepts and foundational guidance for building a cloud-centric incident management plan.

 An effective incident management plan must be continually iterated upon, remaining current with your cloud operations goal. Consider using the implementation plans detailed below as you create and evolve your incident management plan. 
+  **Educate and train for incident response:** When a deviation from your defined baseline occurs (for example, an erroneous deployment or misconfiguration), you might need to respond and investigate. To successfully do so, you must understand which controls and capabilities you can use for security incident response within your AWS environment, as well as processes you need to consider to prepare, educate, and train your cloud teams participating in an incident response. 
  +  [Playbooks](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_ready_to_support_use_playbooks.html) and [runbooks](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_ready_to_support_use_runbooks.html) are effective mechanisms for building consistency in training how to respond to incidents. Start with building an initial list of frequently run procedures during an incident response, and continue to iterate as you learn or use new procedures. 
  +  Socialize the playbooks and runbooks through scheduled [game days](https://docs.aws.amazon.com/wellarchitected/latest/security-pillar/sec_incident_response_run_game_days.html). During game days, simulate the incident response in a controlled environment so that your team can recall how to respond, and to verify that the teams involved in incident response are well-versed with the workflows. Review the outcomes of the simulated event to identify improvements and determine the need for further training or additional tools. 
  +  Security should be considered everyone’s job. Build collective knowledge of the incident management process by involving all personnel that normally operate your workloads. This includes all aspects of your business: operations, test, development, security, business operations, and business leaders. 
+  **Document the incident management plan:** Document the tools and process to record, act on, communicate the progress of, and provide notifications about active incidents. The goal of the incident management plan is to verify that normal operation is restored as quickly as possible, business impact is minimized, and all concerned parties are kept informed. Examples of incidents include (but are not restricted to) loss or degradation of network connectivity, a non-responsive process or API, a scheduled task not being performed (for example, failed patching), unavailability of application data or service, unplanned service disruption due to security events, credential leakage, or misconfiguration errors. 
  +  Identify the primary owner responsible for incident resolution, such as the workload owner. Have clear guidance on who will run the incident and how communication will be handled. When you have more than one party participating in the incident resolution process, such as an external vendor, consider building a *responsibility (RACI) matrix*, detailing the roles and responsibilities of various teams or people required for incident resolution. 

     A RACI matrix details the following: 
    +  **R:** *Responsible* party that does the work to complete the task. 
    +  **A:** *Accountable* party or stakeholder with final authority over the successful completion of the specific task. 
    +  **C:** *Consulted* party whose opinions are sought, typically as subject matter experts. 
    +  **I:** *Informed* party that is notified of progress, often only on completion of the task or deliverable. 
+  **Categorize incidents:** Defining and categorizing incidents based on severity and impact score allows for a structured approach to triaging and resolving incidents. The following recommendations illustrate an *impact-to-resolution urgency matrix* to quantify an incident. For example, a low-impact, low-urgency incident is considered a low-severity incident. 
  +  **High (H):** Your business is significantly impacted. Critical functions of your application related to AWS resources are unavailable. Reserved for the most critical events affecting production systems. The impact of the incident increases rapidly with remediation being time sensitive. 
  +  **Medium (M):** A business service or application related to AWS resources is moderately impacted and is functioning in a degraded state. Applications that contribute to service level objectives (SLOs) are affected within the service level agreement (SLA) limits. Systems can perform with reduced capability without much financial and reputational impact. 
  +  **Low (L):** Non-critical functions of your business service or application related to AWS resources are impacted. Systems can perform with reduced capability with minimal financial and reputational impact. 
+  **Standardize security controls:** The goal of standardizing security controls is to achieve consistency, traceability, and repeatability regarding operational outcomes. Drive standardization across key activities that are critical for incident response, such as: 
  +  **Identity and access management:** Establish mechanisms for controlling access to your data and managing privileges for both human and machine identities. Extend your own identity and access management to the cloud, using federated security with single sign-on and roles-based privileges to optimize access management. For best practice recommendations and improvement plans to standardize access management, refer to the [identity and access management section](https://docs.aws.amazon.com/wellarchitected/latest/security-pillar/identity-and-access-management.html) of the Security Pillar whitepaper. 
  +  **Vulnerability management:** Establish mechanisms to identify vulnerabilities in your AWS environment that are likely to be used by attackers to compromise and misuse your system. Implement both preventive and detective controls as security mechanisms to respond to and mitigate the potential impact of security incidents. Standardize processes such as threat modeling as part of your infrastructure build and application delivery lifecycle.
  +  **Configuration management:** Define standard configurations and automate procedures for deploying resources in the AWS Cloud. Standardizing both infrastructure and resource provisioning helps mitigate the risk of misconfiguration through erroneous deployments or accidental human misconfigurations. Refer to the [design principles section](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/design-principles.html) of the Operational Excellence Pillar whitepaper for guidance and improvement plans for implementing this control.
  +  **Logging and monitoring for audit control:** Implement mechanisms to monitor your resources for failures, performance degradation, and security issues. Standardizing these controls also provides audit trails of activities that occur in your system, helping timely triage and remediation of issues. Best practices under [SEC04 (“How do you detect and investigate security events?”)](https://docs.aws.amazon.com/wellarchitected/latest/security-pillar/detection.html) provide guidance for implementing this control.
+  **Use automation:** Automation allows timely incident resolution at scale. AWS provides several services to automate within the context of the incident response strategy. Focus on finding an appropriate balance between automation and manual intervention. As you build your incident response in playbooks and runbooks, automate repeatable steps. Use AWS services such as AWS Systems Manager Incident Manager to [resolve IT incidents faster](https://aws.amazon.com/blogs/aws/resolve-it-incidents-faster-with-incident-manager-a-new-capability-of-aws-systems-manager/). Use [developer tools](https://aws.amazon.com/devops/) to provide version control and automate [Amazon Machine Images (AMI)](https://aws.amazon.com/amis/) and Infrastructure as Code (IaC) deployments without human intervention. Where applicable, automate detection and compliance assessment using managed services like Amazon GuardDuty, Amazon Inspector, AWS Security Hub CSPM, AWS Config, and Amazon Macie. Optimize detection capabilities with machine learning like Amazon DevOps Guru to detect abnormal operating patterns issues before they occur. 
+  **Conduct root cause analysis and action lessons learned:** Implement mechanisms to capture lessons learned as part of a post-incident response review. When the root cause of an incident reveals a larger defect, design flaw, misconfiguration, or possibility of recurrence, it is classified as a problem. In such cases, analyze and resolve the problem to minimize disruption of normal operations. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [AWS Security Incident Response Guide](https://docs.aws.amazon.com/whitepapers/latest/aws-security-incident-response-guide/welcome.html) 
+ [ NIST: Computer Security Incident Handling Guide ](https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-61r2.pdf)

 **Related videos:** 
+  [Automating Incident Response and Forensics in AWS](https://youtu.be/f_EcwmmXkXk)
+ [ DIY guide to runbooks, incident reports, and incident response ](https://www.youtube.com/watch?v=E1NaYN_fJUo)
+ [ Prepare for and respond to security incidents in your AWS environment ](https://www.youtube.com/watch?v=8uiO0Z5meCs)

 **Related examples:** 
+  [Lab: Incident Response Playbook with Jupyter - AWS IAM](https://www.wellarchitectedlabs.com/Security/300_Incident_Response_Playbook_with_Jupyter-AWS_IAM/README.html) 
+ [ Lab: Incident Response with AWS Console and CLI ](https://wellarchitectedlabs.com/security/300_labs/300_incident_response_with_aws_console_and_cli/)

# SEC10-BP03 Prepare forensic capabilities
<a name="sec_incident_response_prepare_forensic"></a>

 It’s important for your incident responders to understand when and how the forensic investigation fits into your response plan. Your organization should define what evidence is collected and what tools are used in the process. Identify and prepare forensic investigation capabilities that are suitable, including external specialists, tools, and automation. A key decision that you should make upfront is if you will collect data from a live system. Some data, such as the contents of volatile memory or active network connections, will be lost if the system is powered off or rebooted. 

Your response team can combine tools, such as AWS Systems Manager, Amazon EventBridge, and AWS Lambda, to automatically run forensic tools within an operating system and VPC traffic mirroring to obtain a network packet capture, to gather non-persistent evidence. Conduct other activities, such as log analysis or analyzing disk images, in a dedicated security account with customized forensic workstations and tools accessible to your responders.

Routinely ship relevant logs to a data store that provides high durability and integrity. Responders should have access to those logs. AWS offers several tools that can make log investigation easier, such as Amazon Athena, Amazon OpenSearch Service (OpenSearch Service), and Amazon CloudWatch Logs Insights. Additionally, preserve evidence securely using Amazon Simple Storage Service (Amazon S3) Object Lock. This service follows the WORM (write-once- read-many) model and prevents objects from being deleted or overwritten for a defined period. As forensic investigation techniques require specialist training, you might need to engage external specialists.

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Identify forensic capabilities: Research your organization's forensic investigation capabilities, available tools, and external specialists. 
+  [Automating Incident Response and Forensics ](https://youtu.be/f_EcwmmXkXk)

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [How to automate forensic disk collection in AWS](https://aws.amazon.com/blogs/security/how-to-automate-forensic-disk-collection-in-aws/) 

# SEC10-BP04 Automate containment capability
<a name="sec_incident_response_auto_contain"></a>

Automate containment and recovery of an incident to reduce response times and organizational impact. 

Once you create and practice the processes and tools from your playbooks, you can deconstruct the logic into a code-based solution, which can be used as a tool by many responders to automate the response and remove variance or guess-work by your responders. This can speed up the lifecycle of a response. The next goal is to enable this code to be fully automated by being invoked by the alerts or events themselves, rather than by a human responder, to create an event-driven response. These processes should also automatically add relevant data to your security systems. For example, an incident involving traffic from an unwanted IP address can automatically populate an AWS WAF block list or Network Firewall rule group to prevent further activity.

![\[AWS architecture diagram showing WAF WebACL logs processing and IP address blocking flow between accounts.\]](http://docs.aws.amazon.com/wellarchitected/2022-03-31/framework/images/aws-waf-automate-block.png)


*Figure 3: AWS WAF automate blocking of known malicious IP addresses.*

With an event-driven response system, a detective mechanism triggers a responsive mechanism to automatically remediate the event. You can use event-driven response capabilities to reduce the time-to-value between detective mechanisms and responsive mechanisms. To create this event-driven architecture, you can use AWS Lambda, which is a serverless compute service that runs your code in response to events and automatically manages the underlying compute resources for you. For example, assume that you have an AWS account with the AWS CloudTrail service enabled. If CloudTrail is ever disabled (through the `cloudtrail:StopLogging` API call), you can use Amazon EventBridge to monitor for the specific `cloudtrail:StopLogging` event, and invoke a Lambda function to call `cloudtrail:StartLogging` to restart logging. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Automate containment capability. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+ [AWS Incident Response Guide](https://docs.aws.amazon.com/whitepapers/latest/aws-security-incident-response-guide/welcome.html) 

 **Related videos:** 
+  [Prepare for and respond to security incidents in your AWS environment](https://youtu.be/8uiO0Z5meCs) 

# SEC10-BP05 Pre-provision access
<a name="sec_incident_response_pre_provision_access"></a>

Verify that incident responders have the correct access pre-provisioned in AWS to reduce the time needed for investigation through to recovery.

 **Common anti-patterns:** 
+  Using the root account for incident response. 
+  Altering existing accounts. 
+  Manipulating IAM permissions directly when providing just-in-time privilege elevation. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

AWS recommends reducing or eliminating reliance on long-lived credentials wherever possible, in favor of temporary credentials and *just-in-time* privilege escalation mechanisms. Long-lived credentials are prone to security risk and increase operational overhead. For most management tasks, as well as incident response tasks, we recommend you implement [identity federation](https://aws.amazon.com/identity/federation/) alongside [temporary escalation for administrative access](https://aws.amazon.com/blogs/security/managing-temporary-elevated-access-to-your-aws-environment/). In this model, a user requests elevation to a higher level of privilege (such as an incident response role) and, provided the user is eligible for elevation, a request is sent to an approver. If the request is approved, the user receives a set of temporary [AWS credentials](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html) which can be used to complete their tasks. After these credentials expire, the user must submit a new elevation request.

 We recommend the use of temporary privilege escalation in the majority of incident response scenarios. The correct way to do this is to use the [AWS Security Token Service](https://docs.aws.amazon.com/STS/latest/APIReference/welcome.html) and [session policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies.html#policies_session) to scope access. 

 There are scenarios where federated identities are unavailable, such as: 
+  Outage related to a compromised identity provider (IdP). 
+  Misconfiguration or human error causing broken federated access management system. 
+  Malicious activity such as a distributed denial of service (DDoS) event or rendering unavailability of the system. 

 In the preceding cases, there should be emergency *break glass* access configured to allow investigation and timely remediation of incidents. We recommend that you use a [user, group, or role with appropriate permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html#lock-away-credentials) to perform tasks and access AWS resources. Use the root user only for [tasks that require root user credentials](https://docs.aws.amazon.com/accounts/latest/reference/root-user-tasks.html). To verify that incident responders have the correct level of access to AWS and other relevant systems, we recommend the pre-provisioning of dedicated accounts. The accounts require privileged access, and must be tightly controlled and monitored. The accounts must be built with the fewest privileges required to perform the necessary tasks, and the level of access should be based on the playbooks created as part of the incident management plan. 

 Use purpose-built and dedicated users and roles as a best practice. Temporarily escalating user or role access through the addition of IAM policies both makes it unclear what access users had during the incident, and risks the escalated privileges not being revoked. 

 It is important to remove as many dependencies as possible to verify that access can be gained under the widest possible number of failure scenarios. To support this, create a playbook to verify that incident response users are created as users in a dedicated security account, and not managed through any existing Federation or single sign-on (SSO) solution. Each individual responder must have their own named account. The account configuration must enforce [strong password policy](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_passwords_account-policy.html) and multi-factor authentication (MFA). If the incident response playbooks only require access to the AWS Management Console, the user should not have access keys configured and should be explicitly disallowed from creating access keys. This can be configured with IAM policies or service control policies (SCPs) as mentioned in the AWS Security Best Practices for [AWS Organizations SCPs](https://docs.aws.amazon.com/organizations/latest/userguide/orgs_manage_policies_scps.html). The users should have no privileges other than the ability to assume incident response roles in other accounts. 

 During an incident it might be necessary to grant access to other internal or external individuals to support investigation, remediation, or recovery activities. In this case, use the playbook mechanism mentioned previously, and there must be a process to verify that any additional access is revoked immediately after the incident is complete. 

 To verify that the use of incident response roles can be properly monitored and audited, it is essential that the IAM accounts created for this purpose are not shared between individuals, and that the AWS account root user is not used unless [required for a specific task](https://docs.aws.amazon.com/accounts/latest/reference/root-user-tasks.html). If the root user is required (for example, IAM access to a specific account is unavailable), use a separate process with a playbook available to verify availability of the root user sign-in credentials and MFA token. 

 To configure the IAM policies for the incident response roles, consider using [IAM Access Analyzer](https://docs.aws.amazon.com/IAM/latest/UserGuide/access-analyzer-policy-generation.html) to generate policies based on AWS CloudTrail logs. To do this, grant administrator access to the incident response role on a non-production account and run through your playbooks. Once complete, a policy can be created that allows only the actions taken. This policy can then be applied to all the incident response roles across all accounts. You might wish to create a separate IAM policy for each playbook to allow easier management and auditing. Example playbooks could include response plans for ransomware, data breaches, loss of production access, and other scenarios. 

 Use the incident response accounts to assume dedicated incident response [IAM roles in other AWS accounts](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_common-scenarios_aws-accounts.html). These roles must be configured to only be assumable by users in the security account, and the trust relationship must require that the calling principal has authenticated using MFA. The roles must use tightly-scoped IAM policies to control access. Ensure that all `AssumeRole` requests for these roles are logged in CloudTrail and alerted on, and that any actions taken using these roles are logged. 

 It is strongly recommended that both the IAM accounts and the IAM roles are clearly named to allow them to be easily found in CloudTrail logs. An example of this would be to name the IAM accounts `<USER_ID>-BREAK-GLASS` and the IAM roles `BREAK-GLASS-ROLE`. 

 [CloudTrail](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html) is used to log API activity in your AWS accounts and should be used to [configure alerts on usage of the incident response roles](https://aws.amazon.com/blogs/security/how-to-receive-notifications-when-your-aws-accounts-root-access-keys-are-used/). Refer to the blog post on configuring alerts when root keys are used. The instructions can be modified to configure the [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/) metric filter-to-filter on `AssumeRole` events related to the incident response IAM role: 

```
{ $.eventName = "AssumeRole" && $.requestParameters.roleArn = "<INCIDENT_RESPONSE_ROLE_ARN>" && $.userIdentity.invokedBy NOT EXISTS && $.eventType != "AwsServiceEvent" }
```

 As the incident response roles are likely to have a high level of access, it is important that these alerts go to a wide group and are acted upon promptly. 

 During an incident, it is possible that a responder might require access to systems which are not directly secured by IAM. These could include Amazon Elastic Compute Cloud instances, Amazon Relational Database Service databases, or software-as-a-service (SaaS) platforms. It is strongly recommended that rather than using native protocols such as SSH or RDP, [AWS Systems Manager Session Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager.html) is used for all administrative access to Amazon EC2 instances. This access can be controlled using IAM, which is secure and audited. It might also be possible to automate parts of your playbooks using [AWS Systems Manager Run Command documents](https://docs.aws.amazon.com/systems-manager/latest/userguide/execute-remote-commands.html), which can reduce user error and improve time to recovery. For access to databases and third-party tools, we recommend storing access credentials in AWS Secrets Manager and granting access to the incident responder roles. 

 Finally, the management of the incident response IAM accounts should be added to your [Joiners, Movers, and Leavers processes](https://docs.aws.amazon.com/wellarchitected/latest/security-pillar/permissions-management.html) and reviewed and tested periodically to verify that only the intended access is allowed. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Managing temporary elevated access to your AWS environment](https://aws.amazon.com/blogs/security/managing-temporary-elevated-access-to-your-aws-environment/) 
+  [AWS Security Incident Response Guide ](https://docs.aws.amazon.com/whitepapers/latest/aws-security-incident-response-guide/welcome.html)
+  [AWS Elastic Disaster Recovery](https://aws.amazon.com/disaster-recovery/) 
+  [AWS Systems Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/what-is-incident-manager.html) 
+  [Setting an account password policy for IAM users](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_passwords_account-policy.html) 
+  [Using multi-factor authentication (MFA) in AWS](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_mfa.html) 
+ [ Configuring Cross-Account Access with MFA ](https://aws.amazon.com/blogs/security/how-do-i-protect-cross-account-access-using-mfa-2/)
+ [ Using IAM Access Analyzer to generate IAM policies ](https://aws.amazon.com/blogs/security/use-iam-access-analyzer-to-generate-iam-policies-based-on-access-activity-found-in-your-organization-trail/)
+ [ Best Practices for AWS Organizations Service Control Policies in a Multi-Account Environment ](https://aws.amazon.com/blogs/industries/best-practices-for-aws-organizations-service-control-policies-in-a-multi-account-environment/)
+ [ How to Receive Notifications When Your AWS Account’s Root Access Keys Are Used ](https://aws.amazon.com/blogs/security/how-to-receive-notifications-when-your-aws-accounts-root-access-keys-are-used/)
+ [ Create fine-grained session permissions using IAM managed policies ](https://aws.amazon.com/blogs/security/create-fine-grained-session-permissions-using-iam-managed-policies/)

 **Related videos:** 
+ [ Automating Incident Response and Forensics in AWS](https://www.youtube.com/watch?v=f_EcwmmXkXk)
+  [DIY guide to runbooks, incident reports, and incident response](https://youtu.be/E1NaYN_fJUo) 
+ [ Prepare for and respond to security incidents in your AWS environment ](https://www.youtube.com/watch?v=8uiO0Z5meCs)

 **Related examples:** 
+ [ Lab: AWS Account Setup and Root User ](https://www.wellarchitectedlabs.com/security/300_labs/300_incident_response_playbook_with_jupyter-aws_iam/)
+ [ Lab: Incident Response with AWS Console and CLI ](https://wellarchitectedlabs.com/security/300_labs/300_incident_response_with_aws_console_and_cli/)

# SEC10-BP06 Pre-deploy tools
<a name="sec_incident_response_pre_deploy_tools"></a>

 Ensure that security personnel have the right tools pre-deployed into AWS to reduce the time for investigation through to recovery. 

To automate security engineering and operations functions, you can use a comprehensive set of APIs and tools from AWS. You can fully automate identity management, network security, data protection, and monitoring capabilities and deliver them using popular software development methods that you already have in place. When you build security automation, your system can monitor, review, and initiate a response, rather than having people monitor your security position and manually react to events. An effective way to automatically provide searchable and relevant log data across AWS services to your incident responders is to enable [Amazon Detective](https://aws.amazon.com/detective/).

If your incident response teams continue to respond to alerts in the same way, they risk alert fatigue. Over time, the team can become desensitized to alerts and can either make mistakes handling ordinary situations or miss unusual alerts. Automation helps avoid alert fatigue by using functions that process the repetitive and ordinary alerts, leaving humans to handle the sensitive and unique incidents. Integrating anomaly detection systems, such as Amazon GuardDuty, AWS CloudTrail Insights, and Amazon CloudWatch Anomaly Detection, can reduce the burden of common threshold-based alerts.

You can improve manual processes by programmatically automating steps in the process. After you define the remediation pattern to an event, you can decompose that pattern into actionable logic, and write the code to perform that logic. Responders can then execute that code to remediate the issue. Over time, you can automate more and more steps, and ultimately automatically handle whole classes of common incidents.

For tools that execute within the operating system of your Amazon Elastic Compute Cloud (Amazon EC2) instance, you should evaluate using the AWS Systems Manager Run Command, which enables you to remotely and securely administrate instances using an agent that you install on your Amazon EC2 instance operating system. It requires the Systems Manager Agent (SSM Agent), which is installed by default on many Amazon Machine Images (AMIs). Be aware, though, that once an instance has been compromised, no responses from tools or agents running on it should be considered trustworthy.

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Pre-deploy tools: Ensure that security personnel have the right tools pre-deployed in AWS so that an appropriate response can be made to an incident. 
  +  [Lab: Incident response with AWS Management Console and CLI ](https://wellarchitectedlabs.com/Security/300_Incident_Response_with_AWS_Console_and_CLI/README.html)
  + [ Incident Response Playbook with Jupyter - AWS IAM ](https://wellarchitectedlabs.com/Security/300_Incident_Response_Playbook_with_Jupyter-AWS_IAM/README.html)
  +  [AWS Security Automation ](https://github.com/awslabs/aws-security-automation)
+  Implement resource tagging: Tag resources with information, such as a code for the resource under investigation, so that you can identify resources during an incident. 
  + [AWS Tagging Strategies ](https://aws.amazon.com/answers/account-management/aws-tagging-strategies/)

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [AWS Incident Response Guide ](https://docs.aws.amazon.com/whitepapers/latest/aws-security-incident-response-guide/welcome.html)

 **Related videos:** 
+  [ DIY guide to runbooks, incident reports, and incident response ](https://youtu.be/E1NaYN_fJUo)

# SEC10-BP07 Run game days
<a name="sec_incident_response_run_game_days"></a>

Game days, also known as simulations or exercises, are internal events that provide a structured opportunity to practice your incident management plans and procedures during a realistic scenario. These events should exercise responders using the same tools and techniques that would be used in a real-world scenario - even mimicking real-world environments. Game days are fundamentally about being prepared and iteratively improving your response capabilities. Some of the reasons you might find value in performing game day activities include: 
+ Validating readiness
+ Developing confidence – learning from simulations and training staff
+ Following compliance or contractual obligations
+ Generating artifacts for accreditation
+ Being agile – incremental improvement
+ Becoming faster and improving tools
+ Refining communication and escalation
+ Developing comfort with the rare and the unexpected

For these reasons, the value derived from participating in a simulation activity increases an organization's effectiveness during stressful events. Developing a simulation activity that is both realistic and beneficial can be a difficult exercise. Although testing your procedures or automation that handles well-understood events has certain advantages, it is just as valuable to participate in creative [Security Incident Response Simulations (SIRS)](https://docs.aws.amazon.com/whitepapers/latest/aws-security-incident-response-guide/security-incident-response-simulations.html) activities to test yourself against the unexpected and continuously improve.

Create custom simulations tailored to your environment, team, and tools. Find an issue and design your simulation around it. This could be something like a leaked credential, a server communicating with unwanted systems, or a misconfiguration that results in unauthorized exposure. Identify engineers who are familiar with your organization to create the scenario and another group to participate. The scenario should be realistic and challenging enough to be valuable. It should include the opportunity to get hands on with logging, notifications, escalations, and executing runbooks or automation. During the simulation, your responders should exercise their technical and organizational skills, and leaders should be involved to build their incident management skills. At the end of the simulation, celebrate the efforts of the team and look for ways to iterate, repeat, and expand into further simulations.

[AWS has created Incident Response Runbook templates](https://github.com/aws-samples/aws-incident-response-playbooks) that you can use not only to prepare your response efforts, but also as a basis for a simulation. When planning, a simulation can be broken into five phases.

**Evidence gathering: **In this phase, a team will get alerts through various means, such as an internal ticketing system, alerts from monitoring tooling, anonymous tips, or even public news. Teams then start to review infrastructure and application logs to determine the source of the compromise. This step should also involve internal escalations and incident leadership. Once identified, teams move on to containing the incident

**Contain the incident: **Teams will have determined there has been an incident and established the source of the compromise. Teams now should take action to contain it, for example, by disabling compromised credentials, isolating a compute resource, or revoking a role’s permission.

**Eradicate the incident: **Now that they’ve contained the incident, teams will work towards mitigating any vulnerabilities in applications or infrastructure configurations that were susceptible to the compromise. This could include rotating all credentials used for a workload, modifying Access Control Lists (ACLs) or changing network configurations.

**Level of risk exposed if this best practice is not established:** Medium

## Implementation guidance
<a name="implementation-guidance"></a>
+  Run [game days](https://wa.aws.amazon.com/wat.concept.gameday.en.html): Run simulated [incident](https://wa.aws.amazon.com/wat.concept.incident.en.html) response [events (game days)](https://wa.aws.amazon.com/wat.concept.event.en.html) for different threats that involve key staff and management. 
+  Capture lessons learned: Lessons learned from running [game days](https://wa.aws.amazon.com/wat.concept.gameday.en.html) should be part of a feedback loop to improve your processes. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+ [AWS Incident Response Guide](https://docs.aws.amazon.com/whitepapers/latest/aws-security-incident-response-guide/welcome.html) 
+ [AWS Elastic Disaster Recovery](https://aws.amazon.com/disaster-recovery/) 

 **Related videos:** 
+ [ DIY guide to runbooks, incident reports, and incident response ](https://youtu.be/E1NaYN_fJUo)