Configure the Prometheus receiver to collect Ray cluster metrics

Learn how to configure the Prometheus receiver to collect Ray cluster metrics.

You can monitor the performance of Ray cluster large-language model (LLM) applications by configuring your Ray cluster applications to send metrics to Splunk Observability Cloud. This solution uses the Prometheus receiver to collect metrics from Ray, which exposes a /metrics endpoint that publishes Prometheus-compatible metrics.

Complete the following steps to collect metrics from Ray cluster applications.

To configure the Prometheus receiver to collect metrics from Ray cluster, you must first deploy Ray on local or on a cloud server.

Deploy the Splunk Distribution of the OpenTelemetry Collector to your host or container platform:
To activate the Prometheus receiver for Ray cluster manually in the Collector configuration, make the following changes to your configuration file:
1. Add prometheus/ray to the receivers section. For example:
  YAML
  prometheus/ray: config: scrape_configs: - job_name: ray-metrics metrics_path: /metrics static_configs: - targets: ['localhost:8080']
```
prometheus/ray:
        config:
          scrape_configs:
            - job_name: ray-metrics
              metrics_path: /metrics
              static_configs:
                - targets: ['localhost:8080']
```
2. Add prometheus/ray to the metrics pipeline of the service section. For example:
  YAML
  service: pipelines: metrics: receivers: [prometheus/ray]
```
service:
  pipelines:
    metrics:
      receivers: [prometheus/ray]
```
Restart the Splunk Distribution of the OpenTelemetry Collector.

Configuration settings

Learn about the configuration settings for the Prometheus receiver.

To view the configuration options for the Prometheus receiver, see Settings.

Metrics

Learn about the available metrics for Ray cluster applications.

The following metrics are available for Ray cluster applications. These metrics fall under the default metric category. For more information on these metrics, see System metrics in the Ray documentation.


Metric name	Type	Unit	Description
`ray_actors`	gauge	count	The number of actor processes.
`ray_cluster_active_nodes`	gauge	count	The number of active Ray nodes in the cluster.
`ray_component_cpu_percentage`	gauge	percent	Total CPU usage of the components on a node.
`ray_component_mem_shared_bytes`	gauge	bytes	SHM usage of all components of the node. Equivalent to the top command's SHR column.
`ray_component_rss_mb`	gauge	megabytes	RSS usage of all components on the node.
`ray_gcs_placement_group_count`	gauge	count	Number of placement groups broken down by state {Registered, Pending, Infeasible}.
`ray_gcs_storage_operation_count_total`	cumulative counter	counter	Number of operations invoked on Google Cloud storage.
`ray_gcs_storage_operation_latency_ms`	histogram	ms	Time to invoke an operation on Google Cloud storage.
`ray_gcs_task_manager_task_events_dropped`	gauge	count	Number of task events dropped per type {PROFILEEVENT, STATUSEVENT}.
`ray_grpc_server_req_finished_total`	cumulative counter	counter	Number of finished requests in the grpc server.
`ray_grpc_server_req_handling_total`	cumulative counter	counter	Number of handling requests in the grpc server.
`ray_grpc_server_req_new_total`	count	count	Number of new requests in the grpc server.
`ray_grpc_server_req_process_time_ms`	histogram	ms	Request latency in grpc server.
`ray_internal_num_infeasible_scheduling_classes`	gaugecount	count	The number of unique scheduling classes that are infeasible.
`ray_internal_num_processes_skipped_job_mismatch`	gauge	count	The total number of cached workers skipped due to job mismatch.
`ray_internal_num_processes_skipped_runtime_environment_mismatch`	gauge	count	The total number of cached workers skipped due to runtime environment mismatch.
`ray_internal_num_processes_started`	count	count	The number of Ray worker processes started.
`ray_internal_num_processes_started_from_cache`	gauge	count	The total number of workers started from a cached worker process.
`ray_internal_num_spilled_tasks`	gauge	count	The cumulative number of lease requests that this raylet has spilled to other raylets.
`ray_node_cpu_count`	gauge	cpu	Total CPUs available on a Ray node.
`ray_node_cpu_utilization`	gauge	percent	Total CPU usage on a Ray node.
`ray_node_disk_io_read_speed`	gauge	bytes/s	Disk read speed.
`ray_node_disk_io_write_count`	gauge	operation	Total write ops to disk.
`ray_node_disk_io_write_speed`	gauge	bytes	Disk write speed.
`ray_node_disk_read_iops`	gauge	operations/s	Disk read IOPS.
`ray_node_disk_utilization_percentage`	gauge	percent	Total disk utilization (percentage) on a Ray node.
`ray_node_disk_write_iops`	gauge	operations/s	Disk write IOPS.
`ray_node_mem_total`	gauge	bytes	Total memory on a Ray node.
`ray_node_mem_used`	gauge	bytes	Memory usage on a Ray node.
`ray_node_network_received`	gauge	bytes	Total network received.
`ray_node_network_send_speed`	gauge	bytes/s	Network send speed.
`ray_node_network_sent`	gauge	bytes	Total network sent.
`ray_object_directory_added_locations`	gauge	location	Number of object locations added per second.If this is high, a lot of objects have been added on this node.
`ray_object_directory_lookups`	gauge	count	Number of object location lookups per second. If this is high, the raylet is waiting on a high number of objects.
`ray_object_directory_removed_locations`	gauge	count	Number of object locations removed per second. If this is high, a high number of objects have been removed from this node.
`ray_object_directory_subscriptions`	gauge	count	Number of object location subscriptions. If this is high, the raylet is attempting to pull a high number of objects.
`ray_object_directory_updates`	gauge	count	Number of object location updates per second. If this is high, the raylet is attempting to pull a high number of objects and/or the locations for objects are frequently changing (e.g. due to many object copies or evictions).
`ray_object_manager_bytes`	gauge	bytes	Number of bytes pushed or received by type {PushedFromLocalPlasma, PushedFromLocalDisk, Received}.
`ray_object_manager_num_pull_requests`	gauge	count	Number of active pull requests for objects.
`ray_object_manager_received_chunks`	gauge	count	Number object chunks received by type {Total, FailedTotal, FailedCancelled, FailedPlasmaFull}.
`ray_object_store_available_memory`	gauge	bytes	Amount of memory currently available in the object store.
`ray_object_store_fallback_memory`	gauge	bytes	Amount of memory in fallback allocations in the filesystem.
`ray_object_store_memory`	gauge	bytes	Object store memory by various sub-kinds on this node.
`ray_object_store_used_memory`	gauge	bytes	Used memory in the Ray object store.
`ray_object_store_num_local_objects`	gauge	count	Number of objects currently in the object store.
`ray_pull_manager_active_bundles`	gauge	count	Number of active bundle requests.
`ray_pull_manager_num_object_pins`	count	count	Number of object pin attempts by the pull manager. Can be {Success, Failure}.
`ray_pull_manager_requests`	count	count	Number of requested bundles per type {Get, Wait, TaskArgs}.
`ray_pull_manager_retries_total`	gauge	count	Number of cumulative pull retries.
`ray_pull_manager_usage_bytes`	gauge	bytes	The total number of bytes usage per type {Available, BeingPulled, Pinned}.
`ray_push_manager_chunks`	count	count	Number of data chunks pushed by the push manager.
`ray_push_manager_in_flight_pushes`	gauge	count	Number of in-flight object push requests.
`ray_resources`	gauge	resource	Logical Ray resources by state {AVAILABLE, USED}.
`ray_scheduler_failed_worker_startup_total`	gauge	count	Number of tasks that fail to be scheduled because workers were not available. Labels are broken up per reason {JobConfigMissing, RegistrationTimedOut, RateLimited}.
`ray_scheduler_tasks`	gauge	count	Number of tasks waiting for scheduling by state {Cancelled, Executing, Waiting, Dispatched, Received}.
`ray_scheduler_unscheduleable_tasks`	gauge	count	Number of pending tasks (not scheduleable tasks) by reason {Infeasible, WaitingForResources, WaitingForPlasmaMemory, WaitingForRemoteResources, WaitingForWorkers}.
`ray_serve_controller_control_loop_duration_s`	gauge	count	Number of pending tasks (not scheduleable tasks) by reason {Infeasible, WaitingForResources, WaitingForPlasmaMemory, WaitingForRemoteResources, WaitingForWorkers}.
`ray_serve_deployment_queued_queries`	gauge	count	The current number of queries to this deployment waiting to be assigned to a replica.
`ray_serve_deployment_replica_healthy`	gauge	boolean	Tracks whether this deployment replica is healthy. 1 means healthy, 0 means unhealthy.
`ray_serve_num_deployment_http_error_requests`	gauge	count	The number of non-200 HTTP responses returned by each deployment.
`ray_serve_num_http_error_requests`	gauge	count	The number of non-200 HTTP responses.
`ray_serve_num_http_requests`	gauge	count	The number of HTTP requests processed.
`ray_spill_manager_objects`	gauge	count	Number of local objects by state {Pinned, PendingRestore, PendingSpill}.
`ray_spill_manager_objects_bytes`	gauge	bytes	Byte size of local objects by state {Pinned, PendingSpill}.
`ray_spill_manager_request_total`	gauge	count	Number of {spill, restore} requests.
`ray_tasks`	gauge	count	Current number of tasks currently in a particular state.
`ray_worker_register_time_ms`	histogram	process	End-to-end latency of register worker processes.

Attributes

Learn about the available attributes for Ray clusters.sec

The following attributes are available for all supported Ray metrics:

host_kernel_release
host_physical_cpus
server.address
host_cpu_cores
host_kernel_version
SessionName
net.host.port
http.scheme
url.scheme
service.instance.id
server.port
k8s.node.name
sf_environment
node_type
Version
host_kernel_name
host.name
deployment.environment
host_machine
net.host.name
k8s.cluster.name
os.type
host_mem_total
host_logical_cpus
sf_service
host_processor
service.name

Next steps

How to monitor your AI components after you set up Observability for AI.

After you set up data collection from supported AI components to Splunk Observability Cloud, the data populates built-in experiences that you can use to monitor and troubleshoot your AI components.

The following table describes the tools you can use to monitor and troubleshoot your AI components.


Monitoring tool	Use this tool to	Link to documentation
Built-in navigators	Orient and explore different layers of your AI tech stack.	Use navigators Monitor LLM costs with navigators
Built-in dashboards	Assess service, endpoint, and system health at a glance.	Built-in dashboards View dashboards in Splunk Observability Cloud
Splunk Application Performance Monitoring (APM) service map and trace view	View all of your LLM service dependency graphs and user interactions in the service map or trace view.	Monitor LLM services with Splunk APM

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

Enterprise Security

SOAR

IT Service Intelligence

Content Packs

Splunk Observability Cloud

AppDynamics SaaS

AppDynamics On-Premises

SAP Agent

Developer Documentation

Splunkbase

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

DATA MANAGEMENT

SEARCH AND ANALYTICS

ADMINISTRATION

Enterprise Security

SOAR

ENTERPRISE SECURITY

SOAR

RELATED APPS

IT Service Intelligence

Content Packs

ITSI

IT Ops

ADMINISTRATION

EXTENSIONS

Splunk Observability Cloud

MONITORING

DATA MANAGEMENT

ADMINISTRATION

AppDynamics SaaS

AppDynamics On-Premises

SAP Agent

ESSENTIALS

MONITORING

ADMINISTRATION

Developer Documentation

Splunkbase

PLATFORM

OBSERVABILITY

REFERENCE

Resources

REFERENCE

Learn More

Support

Configure the Prometheus receiver to collect Ray cluster metrics

Configuration settings

Metrics

Attributes

Next steps