

# Tags for operations and support
<a name="tags-for-operations-and-support"></a>

 An AWS environment will have multiple accounts, resources, and workloads with differing operational requirements. Tags can be used to provide context and guidance to support operations teams to enhance management of your services. Tags can also be used to provide operational governance transparency of the managed resources. 

 Some of the main factors driving consistent definition of operational tags are: 
+  **To filter resources during automated infrastructure activities.** For example, when deploying, updating or deleting resources. Another is the scaling of resources for cost optimization and out of hours usage reductions. See [AWS Instance Scheduler](https://aws.amazon.com/solutions/implementations/instance-scheduler/) solution for a working example. 
+  **Identifying isolated or deprecating resources.** Resources that have exceeded their defined lifespan or have been flagged for isolation by internal mechanisms should be appropriately tagged so as to assist support personnel in their investigation. Deprecating resources should be tagged before isolation, archival and deletion. 
+  **Support requirements for a group of resources.** Resources often have different support requirements, for example, these requirements could be negotiated between teams or set as part of an applications criticality. Further guidance on operating models can be found in the [Operational Excellence Pillar](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/operating-model.html). 
+  **Enhance the incident management process.** By tagging resources with tags that offer greater transparency in incident management process, support teams and engineers as well as Major Incident Management (MIM) teams can more effectively manage events. 
+  **Backups.** Tags can also be used to identify the frequency your resources need to be backed up, and where the backup copies need to go or where to restore the backups. [Prescriptive guidance for Backup and recovery approaches on AWS](https://docs.aws.amazon.com/prescriptive-guidance/latest/backup-recovery/welcome.html). 
+  **Patching.** Patching mutable instances running in AWS is crucial in both your overarching patching strategy and for the patching of zero-day vulnerabilities. Deeper guidance on the wider patching strategy can be found in the [prescriptive guidance](https://docs.aws.amazon.com/prescriptive-guidance/latest/patch-management-hybrid-cloud/welcome.html). Patching of zero-day vulnerabilities is discussed in this [blog](https://aws.amazon.com/blogs/mt/avoid-zero-day-vulnerabilities-same-day-security-patching-aws-systems-manager/). 
+  **Operational observability**. Having an operational KPI strategy translated to resource tags will help operations teams to better track whether targets are being met to enhance business requirements. Developing a KPI strategy is a separate topic, but tends to be focused on a business operating in a steady state or where to measure the impact and outcomes of change. The [KPI Dashboards](https://wellarchitectedlabs.com/cost/200_labs/200_cloud_intelligence/cost-usage-report-dashboards/dashboards/2c_kpi_dashboard/) (AWS Well-Architected labs) and the Operations KPI Workshop (an [AWS Enterprise Support proactive service](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/)) both address measure performance in a steady state. The AWS enterprise strategy blog article [Measuring the Success of Your Transformation](https://aws.amazon.com/blogs/enterprise-strategy/measuring-the-success-of-your-transformation/), explores KPI measurement for a transformation program, such as IT modernization or migrating from on premises to AWS. 

# Automated infrastructure activities
<a name="automated-infrastructure-activities"></a>

 Tags can be used in a wide range of automation activities when managing infrastructure. Use of [AWS Systems Manager](https://docs.aws.amazon.com/systems-manager/index.html), for example, will allow you to manage automations and runbooks on resources specified by the defined key-value pair you create. For managed nodes, you could define a set of tags to track or target nodes by operating system and environment. You could then run an update script for all nodes in a group or review the status of those nodes. [Systems Manager Resources](https://docs.aws.amazon.com/systems-manager/latest/userguide/taggable-resources.html) can also be tagged to further refine and track your automated activities. 

 Automating the start and stop lifecycle of environment resources can provide a significant cost reduction to any organization. [Instance scheduler on AWS](https://aws.amazon.com/solutions/implementations/instance-scheduler/) is an example of a solution that can start and stop Amazon EC2 and Amazon RDS instances when they are not required. For example, developer environments utilizing Amazon EC2 or Amazon RDS instances that are not required to be running on weekends are not utilizing the cost saving potential that the shutting down of those instances can provide. By analyzing the needs of teams and their environments, and properly tagging these resources to automate their management, you can utilize your budget effectively. 

 *An example schedule tag used by instance scheduler on an Amazon EC2 instance:* 

```
{
    "Tags": [
        {
            "Key": "Schedule",
            "ResourceId": "i-1234567890abcdef8",
            "ResourceType": "instance",
            "Value": "mon-9am-fri-5pm"
        }
    ]
}
```

# Workload lifecycle
<a name="workload-lifecycle"></a>

**Review accuracy of supporting operational data.** Make sure that there are periodic reviews of the tags associated with your workload lifecycle, and that the appropriate stakeholders are involved in these reviews. 

 *Table 7 – Review operational tags as part of the workload lifecycle* 


|  Use Case  |  Tag Key  |  Rationale  |  Example Values  | 
| --- | --- | --- | --- | 
|  Account Owner  | example-inc:account-owner:owner  |  The owner of the account and it's contained resources.  | ops-center, dev-ops, app-team  | 
|  Account Owner Review  | example-inc:account-owner:review  |  Review of account ownership details being up to date and correct.  | <review date in the correct format defined in your tagging library>  | 
|  Data Owner  | example-inc:data-owner:owner  |  The data owner of the accounts residing data.  | bi-team, logistics, security  | 
|  Data Owner Review  | example-inc:data-owner:review  |  Review of data ownership details being up to date and correct.  | <review date in the correct format defined in your tagging library>  | 

## Assigning tags to suspending accounts before migrating to the suspended OU
<a name="assigning-tags-to-suspending-accounts"></a>

 Before suspending an account and moving into the suspended OU as detailed in the [Organizing Your AWS Environment Using Multiple Accounts](https://docs.aws.amazon.com/whitepapers/latest/organizing-your-aws-environment/organizing-your-aws-environment.html) whitepaper, tags should be added to the account to aid in your internal tracing and auditing of an account’s lifecycle. For example, a relative URL or ticket reference on an organization’s ITSM ticketing system, that shows the audit trail for an application being suspended. 

 *Table 8 - Add operational tags when workload lifecycle enters new stage* 


|  Use Case  |  Tag Key  |  Rationale  |  Example Values  | 
| --- | --- | --- | --- | 
|  Account Owner  | example-inc:account-owner:owner  |  The owner of the account and it's contained resources.  | ops-center, dev-ops, app-team  | 
|  Data Owner  | example-inc:data-owner:owner  |  The data owner of the accounts residing data.  | bi-team, logistics, security  | 
|  Suspended Date  | example-inc:suspension:date  |  The date that the account was suspended  |  <suspended date in the correct format defined in your tagging library>  | 
|  Approval for suspension  | example-inc:suspension:approval  |  The link to the approval of account suspension  | workload/deprecation  | 

# Incident management
<a name="incident-management"></a>

 Tags can play a vital part in all phases of incident management starting from incident logging, prioritization, investigation, communication, resolution to closure. 

 Tags can detail where an incident should be logged, the team or teams that should be informed of the incident, and the defined escalation priority. It’s important to remember that tags are not encrypted, so consider what information you store in them. Also, in organizations, teams, and reporting lines, responsibilities change, so consider storing a link to a secure portal where this information can be more effectively managed. These tags don’t need to be exclusive. For instance, the application ID could be used to lookup the escalation paths in an IT service management portal. Make sure it is clear in your operational definitions that this tag is being used for multiple purposes. 

 Operational requirement tags can be detailed as well, to help incident managers and operations personnel further refine their objectives in response to an incident or event. 

 Relative links (to the knowledge system base URL) for [runbooks](https://wa.aws.amazon.com/wellarchitected/2020-07-02T19-33-23/wat.concept.runbook.en.html) and [playbooks](https://wa.aws.amazon.com/wellarchitected/2020-07-02T19-33-23/wat.concept.playbook.en.html) can be included as tags to assist the responding teams in identifying corresponding process, procedure and documentation. 

 *Table 9 - Use operational tags to inform incident management* 


|  Use Case  |  Tag Key  |  Rationale  |  Example Values  | 
| --- | --- | --- | --- | 
|  Incident Management  | example-inc:incident-management:escalationlog  |  The system in use by the supporting team to log incidents  | jira, servicenow, zendesk  | 
|  Incident Management  | example-inc:incident-management:escalationpath  |  Path of escalation  | ops-center, dev-ops, app-team  | 
|  Cost Allocation and Incident Management  | example-inc:cost-allocation:CostCenter |  Monitor costs by cost center. This is an example of a dual use tag where the cost center is being used as an application code for incident logging  | 123-\$1  | 
|  Backup Schedule  | example-inc:backup:schedule  |  Backup schedule of the resource  | Daily  | 
|  Playbook / Incident Management  | example-inc:incident-management:playbook  |  Documented playbook  | webapp/incident/playbook  | 

# Patching
<a name="patching"></a>

 Organizations can automate their patching strategy for mutable compute environments and keep mutable instances in-line with the defined patch baseline of that application environment by using AWS Systems Manager Patch Manager and AWS Lambda. A tagging strategy for mutable instances within these environments can be managed by assigning said instances to **Patch Groups** and **Maintenance Windows**. See the following examples for a Dev → Test → Prod split. AWS prescriptive guidance is available for the [patch management of mutable instances.](https://docs.aws.amazon.com/prescriptive-guidance/latest/patch-management-hybrid-cloud/welcome.html) 

 *Table 10 - Operational tags can be environment specific* 


|  Development  |  Staging  |  Production  | 
| --- | --- | --- | 
|  <pre>{<br />"Tags": [<br />{<br />"Key": "Maintenance Window",<br />"ResourceId": "i-012345678ab9ab111",<br />"ResourceType": "instance",<br />"Value": "cron(30 23 ? * TUE#1 *)"<br />},<br />{<br />"Key": "Name",<br />"ResourceId": "i-012345678ab9ab222",<br />"ResourceType": "instance",<br />"Value": "WEBAPP"<br />},<br />{<br />"Key": "Patch Group",<br />"ResourceId": "i-012345678ab9ab333",<br />"ResourceType": "instance",<br />"Value": "WEBAPP-DEV-AL2"<br />}<br />]<br />}<br /></pre>  |  <pre>{<br />"Tags": [<br />{<br />"Key": "Maintenance Window",<br />"ResourceId": "i-012345678ab9ab444",<br />"ResourceType": "instance",<br />"Value": "cron(30 23 ? * TUE#2 *)"<br />},<br />{<br />"Key": "Name",<br />"ResourceId": "i-012345678ab9ab555",<br />"ResourceType": "instance",<br />"Value": "WEBAPP"<br />},<br />{<br />"Key": "Patch Group",<br />"ResourceId": "i-012345678ab9ab666",<br />"ResourceType": "instance",<br />"Value": "WEBAPP-TEST-AL2"<br />}<br />]<br />}<br /></pre>  |  <pre>{<br />"Tags": [<br />{<br />"Key": "Maintenance Window",<br />"ResourceId": "i-012345678ab9ab777",<br />"ResourceType": "instance",<br />"Value": "cron(30 23 ? * TUE#3 *)"<br />},<br />{<br />"Key": "Name",<br />"ResourceId": "i-012345678ab9ab888",<br />"ResourceType": "instance",<br />"Value": "WEBAPP"<br />},<br />{<br />"Key": "Patch Group",<br />"ResourceId": "i-012345678ab9ab999",<br />"ResourceType": "instance",<br />"Value": "WEBAPP-PROD-AL2"<br />}<br />]<br />}<br /></pre>  | 

 Zero-day vulnerabilities can also be managed by having tags defined to complement your patching strategy. Refer to [Avoid zero-day vulnerabilities with same-day security patching using AWS Systems Manager](https://aws.amazon.com/blogs/mt/avoid-zero-day-vulnerabilities-same-day-security-patching-aws-systems-manager/) for detailed guidance. 

# Operational observability
<a name="operational-observability"></a>

 Observability is required to gain actionable insights into the performance of your environments and help you to detect and investigate problems. It also has a secondary purpose that allows you to define and measure key performance indicators (KPIs) and service level objectives (SLOs) such as uptime. For most organizations, important operations KPIs are mean time to detect (MTTD) and mean time to recover (MTTR) from an incident. 

Throughout observability, context is important, because data is collected and then associated tags are gathered. Regardless of the service, application, or application tier that you are focusing on, you can filter and analyze for that specific dataset. Tags can be used to automate onboarding to CloudWatch Alarms so that the right teams can be alerted when certain metric thresholds are breached. For example, a tag key `example-inc:ops:alarm-tag` and the value on it could indicate creation of the CloudWatch Alarm. A solution demonstrating this is described in [Use tags to create and maintain Amazon CloudWatch alarms for Amazon EC2 instances](https://aws.amazon.com/blogs/mt/use-tags-to-create-and-maintain-amazon-cloudwatch-alarms-for-amazon-ec2-instances-part-1/).

 Having too many alarms configured can easily create an alert storm—when a large number of alarms or notifications rapidly overwhelm operators and reduce their overall effectiveness while operators are manually triaging and prioritizing individual alarms. Additional context for the alarms can be provided in the form of tags, which means that rules can be defined within Amazon EventBridge to help ensure that focus is given to the upstream issue rather than downstream dependencies. 

 The role of operations alongside DevOps is often overlooked, but for many organizations, central operations teams still provide a critical first response outside of normal business hours. (More details can be found about this model in the [Operational Excellence whitepaper](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/separated-aeo-ieo-with-cent-gov-and-partner.html).) Unlike the DevOps team that owns the workload, they typically do not have the same depth of knowledge, so the context that tags provide within dashboards and alerts, can direct them to the correct runbook for the issue, or initiate an automated runbook (refer to the blog post [Automating Amazon CloudWatch Alarms with AWS Systems Manager](https://aws.amazon.com/blogs/mt/automating-amazon-cloudwatch-alarms-with-aws-systems-manager/)). 