Metrics
Slurm-web can be integrated with Prometheus (or any compatible solution) to manage Slurm metrics.
When this feature is enabled, Slurm-web agent exports these metrics in standard
OpenMetrics format on /metrics
endpoint. This is designed to be collected by
Prometheus (or compatible) in order to store these metrics in timeseries
database. The agent can then query this database, so the frontend can produce
charts with historical values.
This first section explains how to enable this metrics feature. For security reasons, access to exported metrics can be restricted to specific hosts. Next section explains how to configure Prometheus to collect Slurm-web metrics. Then metrics query settings are explained and the last section provides a reference list of all available metrics.
Get Started
The metrics feature is disabled by default. It can be enabled with the following
lines in /etc/slurm-web/agent.ini
:
[metrics]
enabled=yes
Access Restriction
For security reasons, Slurm-web agent restricts access to /metrics
endpoint to
localhost only. When Prometheus is running on external hosts, you must define
restrict
parameter in /etc/slurm-web/agent.ini
to allow other
networks explicitely. For example:
[metrics]
enabled=yes
restrict=
192.168.1.0/24
10.0.0.251/32
In this example, all IP addresses in range 192.168.1.[0-254]
and 10.0.0.251
are permitted to request metrics.
Prometheus Integration
Prometheus must be configured to request /metrics
endpoint of Slurm-web agent.
Edit /etc/prometheus/prometheus.yml
to add one of the following
configuration snippets depending of your setup:
-
Slurm-web agent running as native service (ie. with
slurm-web-agent.service
):
scrape_configs:
- job_name: slurm
scrape_interval: 30s
static_configs:
- targets: ['localhost:5012']
-
Slurm-web agent running on production HTTP server:
scrape_configs:
- job_name: slurm
scrape_interval: 30s
metrics_path: /agent/metrics
static_configs:
- targets: ['localhost:80']
You may need to adjust the target hostname, typically if Prometheus is running on a remote host, and destination port (for example 443 for HTTPS). |
Check prometheus the scraping job is running properly with this command:
$ curl -s http://localhost:9090/api/v1/targets?scrapePool=slurm | jq
This command reports the status of the Prometheus scraping job, for example:
{
"status": "success",
"data": {
"activeTargets": [
{
"discoveredLabels": {
"__address__": "localhost:80",
"__metrics_path__": "/agent/metrics",
"__scheme__": "http",
"__scrape_interval__": "30s",
"__scrape_timeout__": "10s",
"job": "slurm"
},
"labels": {
"instance": "localhost:80",
"job": "slurm"
},
"scrapePool": "slurm",
"scrapeUrl": "http://localhost:80/agent/metrics",
"globalUrl": "http://localhost:80/agent/metrics",
"lastError": "", (1)
"lastScrape": "2024-10-30T12:08:41.494167925+01:00",
"lastScrapeDuration": 0.107884764,
"health": "up", (2)
"scrapeInterval": "30s",
"scrapeTimeout": "10s"
}
],
"droppedTargets": []
}
}
1 | lastError field must be empty. |
2 | health must be up . |
Query Settings
In order to query Prometheus database, Slurm-web must know:
-
The URL to access Prometheus HTTP API,
-
The name of Prometheus job that scrapes Slurm-web metrics. This corresponds to
job_name
field in/etc/prometheus/prometheus.yml
.
By default, Slurm-web uses http://localhost:9090
and slurm
values
respectively. This can be changed with the following settings in
/etc/slurm-web/agent.ini
, for example:
[metrics]
enabled=yes
host=https://metrics.company.ltd
job=slurmweb
Available Metrics
This table describes all metrics exported by Slurm-web:
Metric | Description |
---|---|
slurm_nodes[state] |
Number of compute nodes in a given state. Supported states are: idle, mixed, allocated, down, drain and unknown. |
slurm_nodes_total |
Total number of compute nodes managed by Slurm. |
slurm_cores[state] |
Number of cores of compute nodes in a given state. Supported states are: idle, mixed, allocated, down, drain and unknown. |
slurm_cores_total |
Total number of cores on compute nodes managed by Slurm. |
slurm_jobs[state] |
Number of jobs in a given state in Slurm controller queue. Supported states are: running, completed, completing, cancelled, pending and unknown. |
slurm_jobs_total |
Total number of jobs in Slurm controller queue. |
Do want more Slurm metrics exported by Slurm-web? Contact us to tell your needs. |