Metrics
Slurm-web can be integrated with Prometheus (or any compatible solution) to manage Slurm metrics.
 
When this feature is enabled, Slurm-web agent exports these metrics in standard
OpenMetrics format on /metrics endpoint. This is designed to be collected by
Prometheus (or compatible) in order to store these metrics in timeseries
database. The agent can then query this database, so the frontend can produce
charts with historical values.
This first section explains how to enable this metrics feature. For security reasons, access to exported metrics can be restricted to specific hosts. Next section explains how to configure Prometheus to collect Slurm-web metrics. Then metrics query settings are explained and the last section provides a reference list of all available metrics.
Get Started
The metrics feature is disabled by default. It can be enabled with the following
lines in /etc/slurm-web/agent.ini:
[metrics]
enabled=yesAccess Restriction
For security reasons, Slurm-web agent restricts access to /metrics endpoint to
localhost only. When Prometheus is running on external hosts, you must define
restrict parameter in /etc/slurm-web/agent.ini to allow other
networks explicitely. For example:
[metrics]
enabled=yes
restrict=
  192.168.1.0/24
  10.0.0.251/32In this example, all IP addresses in range 192.168.1.[0-254] and 10.0.0.251
are permitted to request metrics.
Prometheus Integration
Prometheus must be configured to request /metrics endpoint of Slurm-web agent.
Edit /etc/prometheus/prometheus.yml to add one of the following
configuration snippets depending of your setup:
- 
Slurm-web agent running as native service (ie. with slurm-web-agent.service):
scrape_configs:
  - job_name: slurm
    scrape_interval: 30s
    static_configs:
      - targets: ['localhost:5012']- 
Slurm-web agent running on production HTTP server: 
scrape_configs:
  - job_name: slurm
    scrape_interval: 30s
    metrics_path: /agent/metrics
    static_configs:
      - targets: ['localhost:80']| You may need to adjust the target hostname, typically if Prometheus is running on a remote host, and destination port (for example 443 for HTTPS). | 
Check prometheus the scraping job is running properly with this command:
$ curl -s http://localhost:9090/api/v1/targets?scrapePool=slurm | jqThis command reports the status of the Prometheus scraping job, for example:
{
  "status": "success",
  "data": {
    "activeTargets": [
      {
        "discoveredLabels": {
          "__address__": "localhost:80",
          "__metrics_path__": "/agent/metrics",
          "__scheme__": "http",
          "__scrape_interval__": "30s",
          "__scrape_timeout__": "10s",
          "job": "slurm"
        },
        "labels": {
          "instance": "localhost:80",
          "job": "slurm"
        },
        "scrapePool": "slurm",
        "scrapeUrl": "http://localhost:80/agent/metrics",
        "globalUrl": "http://localhost:80/agent/metrics",
        "lastError": "", (1)
        "lastScrape": "2024-10-30T12:08:41.494167925+01:00",
        "lastScrapeDuration": 0.107884764,
        "health": "up", (2)
        "scrapeInterval": "30s",
        "scrapeTimeout": "10s"
      }
    ],
    "droppedTargets": []
  }
}| 1 | lastErrorfield must be empty. | 
| 2 | healthmust beup. | 
Query Settings
In order to query Prometheus database, Slurm-web must know:
- 
The URL to access Prometheus HTTP API, 
- 
The name of Prometheus job that scrapes Slurm-web metrics. This corresponds to job_namefield in/etc/prometheus/prometheus.yml.
By default, Slurm-web uses http://localhost:9090 and slurm values
respectively. This can be changed with the following settings in
/etc/slurm-web/agent.ini, for example:
[metrics]
enabled=yes
host=https://metrics.company.ltd
job=slurmwebAvailable Metrics
This table describes all metrics exported by Slurm-web:
| Metric | Description | 
|---|---|
| slurm_nodes[state] | Number of compute nodes in a given state. Supported states are: idle, mixed, allocated, down, drain and unknown. | 
| slurm_nodes_total | Total number of compute nodes managed by Slurm. | 
| slurm_cores[state] | Number of cores of compute nodes in a given state. Supported states are: idle, mixed, allocated, down, drain and unknown. | 
| slurm_cores_total | Total number of cores on compute nodes managed by Slurm. | 
| slurm_jobs[state] | Number of jobs in a given state in Slurm controller queue. Supported states are: running, completed, completing, cancelled, pending and unknown. | 
| slurm_jobs_total | Total number of jobs in Slurm controller queue. | 
| Do want more Slurm metrics exported by Slurm-web? Contact us to tell your needs. |