

# Traffic monitoring and troubleshooting
<a name="traffic-monitoring"></a>

This section provides traffic monitoring and troubleshooting instructions for deploying and using the solution. If this information don’t help address your issue, [Contact Support](contact-aws-support.md) provides instructions for opening an AWS Support case for this solution.

## Amazon CloudWatch alarms
<a name="amazon-cloudwatch-alarms"></a>

Amazon CloudWatch alarms monitor specific metrics in real time and proactively notify AWS Management Console users when predefined conditions are met. This solution has several CloudWatch alarms to help monitor its health and performance. In this section, each of the solution’s alarms are listed with details on what metrics they track and what can invoke the alarms.

These alarms are enabled automatically when the AWS CDK stack is deployed. There are no further actions required to review the alarms.

**Note**  
There are no subscriptions to alarm notifications by default. Add your team’s email alias, paging address, or connection to an operational dashboard to be notified when an alarm changes state.

The following diagram shows the conceptual relationship between cloud resources created by this solution and pre-configured CloudWatch monitoring alarms.

 **Diagram showing overview of resources and their related CloudWatch alarms** 

![\[aws solution for prebid server cloudwatch alarms\]](http://docs.aws.amazon.com/solutions/latest/prebid-server-deployment-on-aws/images/aws-solution-for-prebid-server-cloudwatch-alarms.png)


Network traffic flow is monitored by ALB, CloudFront, NAT gateway, and AWS WAF alarms. ECS alarms focus on problems related to creating new instances. EFS alarms monitor throughput problems. Glue alarms change state on failures of the periodic AWS Glue job. The customer is responsible for subscribing to these alarms to a notification mechanism, such as email or text message.

# AWS WAF
<a name="aws-waf-2"></a>

## Blocked requests
<a name="blocked-requests"></a>
+ The alarm changes state if there is a large amount of blocked requests (greater than 75% of requests are blocked) within 1 minute.
+ This alarm indicates that there is something wrong with the requests passing through the WAF or there could be malicious requests in the traffic.
+ The alarm returns to the `OK` state if the data is within the acceptable threshold for 5 minutes.
+ Metric: `BlockedRequests` > 75%

## HTTP flood detected
<a name="http-flood-detected"></a>
+ The alarm changes state if there is an HTTP flood attack detected within a 1-minute period.
+ The alarm returns to the `OK` state if the data is within the acceptable threshold for 5 minutes.
+ If detailed WAF logging is enabled, it will log the HTTP flood requests in the chosen destination. A datapoint will be logged in the CloudWatch metrics for the rule.
+ Metric: `HttpFloodDetected` > 0

## Allowed requests
<a name="allowed-requests"></a>
+ The alarm changes state if there is an anomaly in traffic with a high number of allowed requests within 1 minute.
+ This alarm indicates a spike or burst in traffic.
+ The alarm returns to the `OK` state if the data is within the acceptable threshold for 5 minutes.
+ The alarm is an anomaly alarm and will form the threshold based on the previous history of the metric.
+ Metric: `AllowedRequests` anomaly

# CloudFront
<a name="cloudfront"></a>

## Alarm: 5xx error rate
<a name="alarm-5xx-error-rate"></a>
+ The alarm changes state if there are any 500 type status codes. Reported as a percentage of total requests within a 1-minute period.
+ This indicates a server failure. Check the CloudWatch logs to get further detail on the cause of the error.
+ The alarm returns to the `OK` state if the error rate is within the acceptable threshold for 5 minutes.
+ Metric: `5xxErrorRate` > 0%

## Alarm: 4xx error rate
<a name="alarm-4xx-error-rate"></a>
+ The alarm changes state if greater than or equal to 1% of requests are 400 type status codes. Reported a percentage of total requests within a 1-minute period.
+ This indicates a bad request or a possible configuration error. Check the CloudWatch logs to get further detail on the cause of the error.
+ The alarm returns to the `OK` state if the error rate is within the acceptable threshold for 5 minutes.
+ Metric: `4xxErrorRate` > 1%

## Alarm: Requests
<a name="alarm-requests"></a>
+ The alarm changes state if there is an anomaly in traffic with a high number of requests within 1 minute.
+ This indicates a spike or burst in traffic.
+ The alarm returns to the `OK` state if the data is within the acceptable threshold for 5 minutes.
+ The alarm is an anomaly alarm and will form the threshold based on the previous history of the metric.
+ Metric: `Requests` anomaly

# Application Load Balancer (ALB)
<a name="application-load-balancer-alb-2"></a>

## Target HTTP 4xx error rate
<a name="target-http-4xx-error-rate"></a>
+ The alarm changes state if there are `400` type status codes originating from the target (ECS). Reported as a percentage.
+ This indicates a bad request or a possible configuration error. Check the CloudWatch logs to get further detail on the cause of the error.
+ The alarm returns to the `OK` state if the error rate is within the acceptable threshold for 5 minutes.
+ Metric: `HTTPCode_Target_4xxErrorRate` > 1%

## Target HTTP 5xx error rate
<a name="target-http-5xx-error-rate"></a>
+ The alarm changes state if there are `500` type status codes originating from the target (ECS). Reported as a percentage.
+ This indicates a server failure. Check the CloudWatch logs to get further detail on the cause of the error.
+ The alarm returns to the `OK` state if the error rate is within the acceptable threshold for 5 minutes.
+ Metric: `HTTPCode_Target_5xxErrorRate` > 0%

## ALB HTTP 4xx error rate
<a name="alb-http-4xx-error-rate"></a>
+ The alarm changes state if there are `400` type status codes originating from ALB. Reported as a percentage.
+ This indicates a bad request or a possible configuration error. Check the CloudWatch logs to get further detail on the cause of the error.
+ The alarm returns to the `OK` state if the error rate is within the acceptable threshold for 5 minutes.
+ Metric: `HTTPCode_ELB_4xxErrorRate` > 1%

## ALB HTTP 5xx error rate
<a name="alb-http-5xx-error-rate"></a>
+ The alarm changes state if there are `500` type status codes originating from the target ALB. Reported as a percentage.
+ This indicates a server failure. Check the CloudWatch logs to get further detail on the cause of the error.
+ The alarm returns to the `OK` state if the data is within the acceptable threshold for 5 minutes.
+ Metric: `HTTPCode_ELB_5xxErrorRate` > 0%

## Target response time (Latency)
<a name="target-reponse-time-latency"></a>
+ The alarm changes state if there is a large amount of latency (greater than 100ms) reported within a 1-minute period.
+ This could indicate a performance issue or scaling failure from ECS. Check the CloudWatch logs to get further detail on the cause of the error.
+ The alarm returns to the `OK` state if the data is within the acceptable threshold for 5 minutes.
+ Metric: `TargetResponseTime average` > 100 ms

## Unhealthy host count
<a name="unhealthy-host-count"></a>
+ The alarm changes state if there is a target that is considered unhealthy within a 1-minute period.
+ The alarm returns to the `OK` state if the data is within the acceptable threshold for 5 minutes.
+ Check the CloudWatch logs to get further detail on the cause of the error.
+ Metric: `UnhealthyHotCount` > 0

# NAT gateway
<a name="nat-gateway"></a>

## Port allocation errors
<a name="port-allocation-errors"></a>
+ The alarm changes state if there is a port allocation error in the NAT gateway.
+ The alarm returns to the `OK` state if the data is within the acceptable threshold for 5 minutes.
+ This can mean that too many concurrent connections are open through the NAT gateway and it caused a port allocation error.
+ Metric: `ErrorPortAllocation` > 0

## Packets dropped count
<a name="packets-dropped-count"></a>
+ The alarm changes state if a value greater than 0.01% is reached within a 1-minute period.
+ This might indicate an ongoing transient issue with the NAT gateway.
+ The alarm returns to the `OK` state if the data is within the acceptable threshold for 5 minutes.
+ If this value exceeds 0.01 percent of the total traffic on the NAT gateway, check the AWS Service Health dashboard.
+ Metric: `PacketsDropCount` > 0.01%

# Elastic Container Service (ECS)
<a name="elastic-container-service-ecs"></a>

## CPU and memory utilization
<a name="cpu-and-memory-utilization"></a>
+ The alarm changes state if the container CPU utilization or memory utilization exceed 70% within 1 minute.
+ Our scaling policies’ target is 50%. If these alarms change state, it means the solution’s Auto Scaling is not working.
+ You might need to check if Auto Scaling is turned on or adjust the Auto Scaling settings.
+ The alarm returns to the `OK` state if the CPU utilization and memory utilization are within the acceptable threshold for 5 minutes.
+ Metric: `CPUUtilization` > 70%, `MemoryUtilization` > 70%

# Elastic File System (EFS)
<a name="elastic-file-system-efs"></a>

## Percent of I/O utilization
<a name="percent-of-io-utilization"></a>
+ The alarm changes state if the I/O utilization is consistently equal to or greater than 100% for 1 minute, indicating the need for additional capacity.
+ The alarm returns to the `OK` state if the I/O utilization is within the acceptable threshold for 5 minutes.
+ If this metric is at 100% often, then consider moving the application to an EFS using the Max I/O performance mode.
+ Metric: `PercentIOLimit` > 100%

# AWS Lake Formation permission errors
<a name="aws-lake-formation-permission-errors"></a>

This solution is configured to use IAM permissions for all AWS AWS Glue Data Catalog resources. If you had previously configured your AWS account to use Lake Formation for all new database and tables prior to deploying the solution you might see the following error when the Metric ETL Glue Job attempts to run for the first time:

```
AccessDeniedException: An error occurred (AccessDeniedException) when calling the GetTable operation: Insufficient Lake Formation permission(s)
```

To fix this error without reverting your account-wide permissions back to the default settings, you must grant the Glue Job IAM role permission to access the solution database and table resources.

1. To grant the `MetricsEtlJobRole` **Super** permissions to all tables within the solution database, see [Granting table permissions using the named resource method](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-table-permissions.html) in the *AWS Lake Formation Developer Guide*.

1. Re-run any failed Glue jobs, making sure to pass in the `--object_keys` parameter with the failed parameter values from previous runs.

# Get Further Assistance
<a name="contact-aws-support"></a>

If you have questions or need further assistance with the implementation of the solution guidance please reach out to your AWS Account team and chat with an AWS subject matter expert on this solution guidance.