Configure the Prometheus receiver to collect NVIDIA GPU metrics

Learn how to configure and activate the component for Nvidia GPU.

You can monitor the performance of NVIDIA GPUs by configuring your Kubernetes cluster to send NVIDIA GPU metrics to Splunk Observability Cloud. This solution uses the Prometheus receiver to collect metrics from the NVIDIA DCGM Exporter, which can be installed independently or as part of the NVIDIA GPU Operator.

For more information on these NVIDIA components, see the NVIDIA DCGM Exporter GitHub repository and About the NVIDIA GPU Operator in the NVIDIA documentation. The NVIDIA DCGM Exporter exposes a /metrics endpoint that publishes Prometheus-compatible metrics.

Complete the following steps to collect metrics from NVIDIA GPUs.

To configure the Prometheus receiver to collect metrics from NVIDIA GPUs, you must install the NVIDIA DCGM Exporter on your Kubernetes cluster. You can use one of the following methods:
  1. Install the Splunk Distribution of the OpenTelemetry Collector for Kubernetes using Helm.
  2. To activate the Prometheus receiver for the NVIDIA DCGM Exporter manually in the Collector configuration, make the following changes to your configuration file:
    1. Add receiver_creator to the receivers section. For more information on using the receiver creator receiver, see Receiver creator receiver.
    2. Add receiver_creator to the metrics pipeline of the service section.
    Example configuration file:
    YAML
    agent:
      config:
        receivers:
          receiver_creator:
            watch_observers: [ k8s_observer ]
            receivers:
              prometheus:
                config:
                  config:
                    scrape_configs:
                      - job_name: gpu-metrics
                        static_configs:
                          - targets:
                              - '`endpoint`:9400'
                 rule: type == "pod" && labels["app"] == "nvidia-dcgm-exporter"
        service:
          pipelines:
            metrics/nvidia-gpu-metrics:
              exporters:
                - signalfx
              processors:
                - memory_limiter
                - batch
                - resourcedetection
                - resource
              receivers:
                - receiver_creator
  3. Restart the Splunk Distribution of the OpenTelemetry Collector.

Configuration settings

Learn about the configuration settings for the Prometheus receiver.

To view the configuration options for the Prometheus receiver, see Settings.

Metrics

Learn about the available monitoring metrics for Nvidia GPUs.

The following metrics are available for NVIDIA GPUs. These metrics fall under the default metric category.

For more information on these metrics, see metrics-configmap.yaml in the NVIDIA DCGM Exporter GitHub repository.
Metric name Type Unit Description
DCGM_FI_DEV_SM_CLOCK gauge MHz SM clock frequency.
DCGM_FI_DEV_MEM_CLOCK gauge MHz Memory clock frequency.
DCGM_FI_DEV_MEMORY_TEMP gauge C Memory temperature.
DCGM_FI_DEV_GPU_TEMP gauge C GPU temperature.
DCGM_FI_DEV_POWER_USAGE gauge W Power draw.
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION counter mJ Total energy consumption since boot.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER counter count Total number of PCIe retries.
DCGM_FI_DEV_GPU_UTIL gauge percent GPU utilization.
DCGM_FI_DEV_MEM_COPY_UTIL gauge percent Memory utilization.
DCGM_FI_DEV_ENC_UTIL gauge percent Encoder utilization.
DCGM_FI_DEV_DEC_UTIL gauge percent Decoder utilization.
DCGM_FI_DEV_FB_FREE gauge MiB Framebuffer memory free.
DCGM_FI_DEV_FB_USED gauge MiB Framebuffer memory used.
DCGM_FI_PROF_PCIE_TX_BYTES counter bytes Number of bytes of active PCIe TX data, including both header and payload.
DCGM_FI_PROF_PCIE_RX_BYTES counter bytes Number of bytes of active PCIe RX data, including both header and payload.

Attributes

Learn about the available attributes for NVIDIA GPUs.

The following attributes are available for NVIDIA GPUs.

Attribute name Type Description Example value
app string The name of the application attached to the GPU. nvidia-dcgm-exporter
DCGM_FI_DRIVER_VERSION string The version of the NVIDIA DCGM driver installed on the system. 570.124.06
device string The identifier for the specific NVIDIA device or GPU instance. nvidia0
gpu number The index number of the GPU within the system. 0
modelName string The commercial model of the NVIDIA GPU. NVIDIA A10G
UUID string A unique identifier assigned to the GPU. GPU-3ca2f6af-10d6-30a5-b45b-158fc83e6d33

Next steps

How to monitor your AI components after you set up Observability for AI.

After you set up data collection from supported AI components to Splunk Observability Cloud, the data populates built-in experiences that you can use to monitor and troubleshoot your AI components.

The following table describes the tools you can use to monitor and troubleshoot your AI components.
Monitoring tool Use this tool to Link to documentation
Built-in navigators Orient and explore different layers of your AI tech stack.
Built-in dashboards Assess service, endpoint, and system health at a glance.
Splunk Application Performance Monitoring (APM) service map and trace view View all of your LLM service dependency graphs and user interactions in the service map or trace view.

Monitor LLM services with Splunk APM