Configure the Prometheus receiver to collect Ray cluster metrics

Learn how to configure the Prometheus receiver to collect Ray cluster metrics.

You can monitor the performance of Ray cluster large-language model (LLM) applications by configuring your Ray cluster applications to send metrics to Splunk Observability Cloud. This solution uses the Prometheus receiver to collect metrics from Ray, which exposes a /metrics endpoint that publishes Prometheus-compatible metrics.

Complete the following steps to collect metrics from Ray cluster applications.

To configure the Prometheus receiver to collect metrics from Ray cluster, you must first deploy Ray on local or on a cloud server.

  1. Deploy the Splunk Distribution of the OpenTelemetry Collector to your host or container platform:
  2. To activate the Prometheus receiver for Ray cluster manually in the Collector configuration, make the following changes to your configuration file:
    1. Add prometheus/ray to the receivers section. For example:
      YAML
      prometheus/ray:
              config:
                scrape_configs:
                  - job_name: ray-metrics
                    metrics_path: /metrics
                    static_configs:
                      - targets: ['localhost:8080']
    2. Add prometheus/ray to the metrics pipeline of the service section. For example:
      YAML
      service:
        pipelines:
          metrics:
            receivers: [prometheus/ray]
  3. Restart the Splunk Distribution of the OpenTelemetry Collector.

Configuration settings

Learn about the configuration settings for the Prometheus receiver.

To view the configuration options for the Prometheus receiver, see Settings.

Metrics

Learn about the available metrics for Ray cluster applications.

The following metrics are available for Ray cluster applications. These metrics fall under the default metric category. For more information on these metrics, see System metrics in the Ray documentation.
Metric name Type Unit Description
ray_actors gauge count The number of actor processes.
ray_cluster_active_nodes gauge count The number of active Ray nodes in the cluster.
ray_component_cpu_percentage gauge percent Total CPU usage of the components on a node.
ray_component_mem_shared_bytes gauge bytes SHM usage of all components of the node. Equivalent to the top command's SHR column.
ray_component_rss_mb gauge megabytes RSS usage of all components on the node.
ray_gcs_placement_group_count gauge count Number of placement groups broken down by state {Registered, Pending, Infeasible}.
ray_gcs_storage_operation_count_total cumulative counter counter Number of operations invoked on Google Cloud storage.
ray_gcs_storage_operation_latency_ms histogram ms Time to invoke an operation on Google Cloud storage.
ray_gcs_task_manager_task_events_dropped gauge count Number of task events dropped per type {PROFILEEVENT, STATUSEVENT}.
ray_grpc_server_req_finished_total cumulative counter counter Number of finished requests in the grpc server.
ray_grpc_server_req_handling_total cumulative counter counter Number of handling requests in the grpc server.

ray_grpc_server_req_new_total

count count Number of new requests in the grpc server.
ray_grpc_server_req_process_time_ms histogram ms Request latency in grpc server.
ray_internal_num_infeasible_scheduling_classes gaugecount count The number of unique scheduling classes that are infeasible.
ray_internal_num_processes_skipped_job_mismatch gauge count The total number of cached workers skipped due to job mismatch.
ray_internal_num_processes_skipped_runtime_environment_mismatch gauge count The total number of cached workers skipped due to runtime environment mismatch.
ray_internal_num_processes_started count count The number of Ray worker processes started.
ray_internal_num_processes_started_from_cache gauge count The total number of workers started from a cached worker process.
ray_internal_num_spilled_tasks gauge count The cumulative number of lease requests that this raylet has spilled to other raylets.
ray_node_cpu_count gauge cpu Total CPUs available on a Ray node.
ray_node_cpu_utilization gauge percent Total CPU usage on a Ray node.
ray_node_disk_io_read_speed gauge bytes/s Disk read speed.

ray_node_disk_io_write_count

gauge operation Total write ops to disk.

ray_node_disk_io_write_speed

gauge bytes Disk write speed.
ray_node_disk_read_iops gauge operations/s Disk read IOPS.
ray_node_disk_utilization_percentage gauge percent Total disk utilization (percentage) on a Ray node.
ray_node_disk_write_iops gauge operations/s Disk write IOPS.
ray_node_mem_total gauge bytes Total memory on a Ray node.
ray_node_mem_used gauge bytes Memory usage on a Ray node.
ray_node_network_received gauge bytes Total network received.
ray_node_network_send_speed gauge bytes/s Network send speed.
ray_node_network_sent gauge

bytes

Total network sent.
ray_object_directory_added_locations gauge location Number of object locations added per second.If this is high, a lot of objects have been added on this node.
ray_object_directory_lookups gauge count Number of object location lookups per second. If this is high, the raylet is waiting on a high number of objects.
ray_object_directory_removed_locations gauge count Number of object locations removed per second. If this is high, a high number of objects have been removed from this node.
ray_object_directory_subscriptions gauge count Number of object location subscriptions. If this is high, the raylet is attempting to pull a high number of objects.
ray_object_directory_updates gauge count Number of object location updates per second. If this is high, the raylet is attempting to pull a high number of objects and/or the locations for objects are frequently changing (e.g. due to many object copies or evictions).
ray_object_manager_bytes gauge bytes Number of bytes pushed or received by type {PushedFromLocalPlasma, PushedFromLocalDisk, Received}.
ray_object_manager_num_pull_requests gauge count Number of active pull requests for objects.
ray_object_manager_received_chunks gauge count Number object chunks received by type {Total, FailedTotal, FailedCancelled, FailedPlasmaFull}.
ray_object_store_available_memory gauge bytes Amount of memory currently available in the object store.
ray_object_store_fallback_memory gauge bytes Amount of memory in fallback allocations in the filesystem.
ray_object_store_memory gauge bytes

Object store memory by various sub-kinds on this node.

ray_object_store_used_memory gauge bytes Used memory in the Ray object store.
ray_object_store_num_local_objects gauge count Number of objects currently in the object store.
ray_pull_manager_active_bundles gauge count Number of active bundle requests.
ray_pull_manager_num_object_pins count count Number of object pin attempts by the pull manager. Can be {Success, Failure}.
ray_pull_manager_requests count count

Number of requested bundles per type {Get, Wait, TaskArgs}.

ray_pull_manager_retries_total gauge count Number of cumulative pull retries.
ray_pull_manager_usage_bytes gauge bytes The total number of bytes usage per type {Available, BeingPulled, Pinned}.
ray_push_manager_chunks count count Number of data chunks pushed by the push manager.
ray_push_manager_in_flight_pushes gauge count Number of in-flight object push requests.
ray_resources gauge resource

Logical Ray resources by state {AVAILABLE, USED}.

ray_scheduler_failed_worker_startup_total gauge count Number of tasks that fail to be scheduled because workers were not available. Labels are broken up per reason {JobConfigMissing, RegistrationTimedOut, RateLimited}.
ray_scheduler_tasks gauge count Number of tasks waiting for scheduling by state {Cancelled, Executing, Waiting, Dispatched, Received}.
ray_scheduler_unscheduleable_tasks gauge count Number of pending tasks (not scheduleable tasks) by reason {Infeasible, WaitingForResources, WaitingForPlasmaMemory, WaitingForRemoteResources, WaitingForWorkers}.
ray_serve_controller_control_loop_duration_s gauge count Number of pending tasks (not scheduleable tasks) by reason {Infeasible, WaitingForResources, WaitingForPlasmaMemory, WaitingForRemoteResources, WaitingForWorkers}.
ray_serve_deployment_queued_queries gauge count The current number of queries to this deployment waiting to be assigned to a replica.
ray_serve_deployment_replica_healthy gauge boolean Tracks whether this deployment replica is healthy. 1 means healthy, 0 means unhealthy.
ray_serve_num_deployment_http_error_requests gauge count The number of non-200 HTTP responses returned by each deployment.
ray_serve_num_http_error_requests gauge count The number of non-200 HTTP responses.
ray_serve_num_http_requests gauge count

The number of HTTP requests processed.

ray_spill_manager_objects gauge count Number of local objects by state {Pinned, PendingRestore, PendingSpill}.
ray_spill_manager_objects_bytes gauge bytes Byte size of local objects by state {Pinned, PendingSpill}.
ray_spill_manager_request_total gauge count Number of {spill, restore} requests.
ray_tasks gauge count Current number of tasks currently in a particular state.
ray_worker_register_time_ms histogram process End-to-end latency of register worker processes.

Attributes

Learn about the available attributes for Ray clusters.sec

The following attributes are available for all supported Ray metrics:

  • host_kernel_release

  • host_physical_cpus

  • server.address

  • host_cpu_cores

  • host_kernel_version

  • SessionName

  • net.host.port

  • http.scheme

  • url.scheme

  • service.instance.id

  • server.port

  • k8s.node.name

  • sf_environment

  • node_type

  • Version

  • host_kernel_name

  • host.name

  • deployment.environment

  • host_machine

  • net.host.name

  • k8s.cluster.name

  • os.type

  • host_mem_total

  • host_logical_cpus

  • sf_service

  • host_processor

  • service.name

Next steps

How to monitor your AI components after you set up Observability for AI.

After you set up data collection from supported AI components to Splunk Observability Cloud, the data populates built-in experiences that you can use to monitor and troubleshoot your AI components.

The following table describes the tools you can use to monitor and troubleshoot your AI components.
Monitoring tool Use this tool to Link to documentation
Built-in navigators Orient and explore different layers of your AI tech stack.
Built-in dashboards Assess service, endpoint, and system health at a glance.
Splunk Application Performance Monitoring (APM) service map and trace view View all of your LLM service dependency graphs and user interactions in the service map or trace view.

Monitor LLM services with Splunk APM