

# Collect NVIDIA GPU metrics
<a name="CloudWatch-Agent-NVIDIA-GPU"></a>

 You can use the CloudWatch agent to collect NVIDIA GPU metrics from Linux servers. To set this up, add a `nvidia_gpu` section inside the `metrics_collected` section of the CloudWatch agent configuration file. For more information, see [Linux section](CloudWatch-Agent-Configuration-File-Details.md#CloudWatch-Agent-Linux-section). 

Additionally, the instance must have an NVIDIA driver installed. NVIDIA drivers on pre-installed on some Amazon Machine Images (AMIs). Otherwise, you can manually install the driver. For more information, see [ Install NVIDIA drivers on Linux instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html). 

The following metrics can be collected. All of these metrics are collected with no CloudWatch `Unit`, but you can specify a unit for each metric by adding a parameter to the CloudWatch agent configuration file. For more information, see [Linux section](CloudWatch-Agent-Configuration-File-Details.md#CloudWatch-Agent-Linux-section).


| Metric | Metric name in CloudWatch | Description | 
| --- | --- | --- | 
|  `utilization_gpu` |  `nvidia_smi_utilization_gpu` |  The percentage of time over the past sample period during which one or more kernals on the GPU was running.  | 
|  `temperature_gpu` |  `nvidia_smi_temperature_gpu` |  The core GPU temperature in degrees Celsius.  | 
|  `power_draw` |  `nvidia_smi_power_draw` |  The last measured power draw for the entire board, in watts.  | 
|  `utilization_memory` |  `nvidia_smi_utilization_memory` |  The percentage of time over the past sample period during which global (device) memory was being read or written.  | 
|  `fan_speed` |  `nvidia_smi_fan_speed` |  The percentage of maximum fan speed that the device's fan is currently intended to run at.  | 
|  `memory_total` |  `nvidia_smi_memory_total` |  Reported total memory, in MB.  | 
|  `memory_used` |  `nvidia_smi_memory_used` |  Memory used, in MB.  | 
|  `memory_free` |  `nvidia_smi_memory_free` |  Memory free, in MB.  | 
|  `pcie_link_gen_current` |  `nvidia_smi_pcie_link_gen_current` |  The current link generation.  | 
|  `pcie_link_width_current` |  `nvidia_smi_pcie_link_width_current` |  The current link width.  | 
|  `encoder_stats_session_count` |  `nvidia_smi_encoder_stats_session_count` |  Current number of encoder sessions.  | 
|  `encoder_stats_average_fps` |  `nvidia_smi_encoder_stats_average_fps` |  The moving average of the encode frames per second.  | 
|  `encoder_stats_average_latency` |  `nvidia_smi_encoder_stats_average_latency` |  The moving average of the encode latency in microseconds.  | 
|  `clocks_current_graphics` |  `nvidia_smi_clocks_current_graphics` |  The current frequency of the graphics (shader) clock.  | 
|  `clocks_current_sm` |  `nvidia_smi_clocks_current_sm` |  The current frequency of the Streaming Multiprocessor (SM) clock.  | 
|  `clocks_current_memory` |  `nvidia_smi_clocks_current_memory` |  The current frequency of the memory clock.  | 
|  `clocks_current_video` |  `nvidia_smi_clocks_current_video` |  The current frequency of the video (encoder plus decoder) clocks.  | 

All of these metrics are collected with the following dimensions:


| Dimension | Description | 
| --- | --- | 
|  `index` |  A unique identifier for the GPU on this server. Represents the NVIDIA Management Library (NVML) index of the device.  | 
|  `name` |  The type of GPU. For example, `NVIDIA Tesla A100`  | 
|  `arch` |  The server architecture.  | 