gres.conf - Slurm configuration file for Generic RESource (GRES) management.
gres.conf is an ASCII file which describes the configuration of Generic
RESource(s) (GRES) on each compute node. If the GRES information in the
slurm.conf file does not fully describe those resources, then a gres.conf file
should be included on each compute node and the slurm controller. The file
will always be located in the same directory as
slurm.conf.
If the GRES information in the slurm.conf file fully describes those resources
(i.e. no "Cores", "File" or "Links"
specification is required for that GRES type or that information is
automatically detected), that information may be omitted from the gres.conf
file and only the configuration information in the slurm.conf file will be
used. The gres.conf file may be omitted completely if the configuration
information in the slurm.conf file fully describes all GRES.
If using the
gres.conf file to describe the resources available to nodes,
the first parameter on the line should be
NodeName. If configuring
Generic Resources without specifying nodes, the first parameter on the line
should be
Name.
Parameter names are case insensitive. Any text following a "#" in the
configuration file is treated as a comment through the end of that line.
Changes to the configuration file take effect upon restart of Slurm daemons,
daemon receipt of the SIGHUP signal, or execution of the command
"scontrol reconfigure" unless otherwise noted.
NOTE: Slurm support for gres/[mps|shard] requires the use of the
select/cons_tres plugin. For more information on how to configure MPS, see
https://slurm.schedmd.com/gres.html#MPS_Management. For more
information on how to configure Sharding, see
https://slurm.schedmd.com/gres.html#Sharding.
For more information on GRES scheduling in general, see
https://slurm.schedmd.com/gres.html.
The overall configuration parameters available include:
- AutoDetect
- The hardware detection mechanisms to enable for automatic
GRES configuration. Currently, the options are:
- nvml
- Automatically detect NVIDIA GPUs. Requires the NVIDIA
Management Library (NVML).
-
- off
- Do not automatically detect any GPUs. Used to override
other options.
-
- rsmi
- Automatically detect AMD GPUs. Requires the ROCm System
Management Interface (ROCm SMI) Library.
-
AutoDetect can be on a line by itself, in which case
it will globally apply to all lines in gres.conf by default. In addition,
AutoDetect can be combined with NodeName to only apply to
certain nodes. Node-specific AutoDetects will trump the global
AutoDetect. A node-specific AutoDetect only needs to be
specified once per node. If specified multiple times for the same nodes,
they must all be the same value. To unset AutoDetect for a node
when a global AutoDetect is set, simply set it to "off"
in a node-specific GRES line. E.g.: NodeName=tux3 AutoDetect=off
Name=gpu File=/dev/nvidia[0-3]. AutoDetect cannot be used with
cloud nodes.
AutoDetect will automatically detect files, cores, links, and any
other hardware. If a parameter such as File, Cores, or
Links are specified when AutoDetect is used, then the
specified values are used to sanity check the auto detected values. If
there is a mismatch, then the node's state is set to invalid and the node
is drained.
-
- Count
- Number of resources of this name/type available on this
node. The default value is set to the number of File values
specified (if any), otherwise the default value is one. A suffix of
"K", "M", "G", "T" or
"P" may be used to multiply the number by 1024, 1048576,
1073741824, etc. respectively. For example: "Count=10G".
-
- Cores
- Optionally specify the core index numbers for the specific
cores which can use this resource. For example, it may be strongly
preferable to use specific cores with specific GRES devices (e.g. on a
NUMA architecture). While Slurm can track and assign resources at the CPU
or thread level, its scheduling algorithms used to co-allocate GRES
devices with CPUs operates at a socket or NUMA level for job allocations.
Therefore it is not possible to preferentially assign GRES with different
specific CPUs on the same NUMA or socket and this option should generally
be used to identify all cores on some socket. Though, job step allocation
with --exact will look at cores directly for which more specific core
identification may be useful.
Multiple cores may be specified using a comma-delimited list or a range may
be specified using a "-" separator (e.g. "0,1,2,3" or
"0-3"). If a job specifies --gres-flags=enforce-binding,
then only the identified cores can be allocated with each generic
resource. This will tend to improve performance of jobs, but delay the
allocation of resources to them. If specified and a job is not
submitted with the --gres-flags=enforce-binding option the
identified cores will be preferred for scheduling with each generic
resource.
If --gres-flags=disable-binding is specified, then any core can be
used with the resources, which also increases the speed of Slurm's
scheduling algorithm but can degrade the application performance. The
--gres-flags=disable-binding option is currently required to use
more CPUs than are bound to a GRES (e.g. if a GPU is bound to the CPUs on
one socket, but resources on more than one socket are required to run the
job). If any core can be effectively used with the resources, then do not
specify the cores option for improved speed in the Slurm scheduling
logic. A restart of the slurmctld is needed for changes to the Cores
option to take effect.
NOTE: Since Slurm must be able to perform resource management on
heterogeneous clusters having various processing unit numbering schemes, a
logical core index must be specified instead of the physical core index.
That logical core index might not correspond to your physical core index
number. Core 0 will be the first core on the first socket, while core 1
will be the second core on the first socket. This numbering coincides with
the logical core number (Core L#) seen in "lstopo -l" command
output.
-
- File
- Fully qualified pathname of the device files associated
with a resource. The name can include a numeric range suffix to be
interpreted by Slurm (e.g. File=/dev/nvidia[0-3]).
This field is generally required if enforcement of generic resource
allocations is to be supported (i.e. prevents users from making use of
resources allocated to a different user). Enforcement of the file
allocation relies upon Linux Control Groups (cgroups) and Slurm's
task/cgroup plugin, which will place the allocated files into the job's
cgroup and prevent use of other files. Please see Slurm's Cgroups Guide
for more information: https://slurm.schedmd.com/cgroups.html.
If File is specified then Count must be either set to the
number of file names specified or not set (the default value is the number
of files specified). The exception to this is MPS/Sharding. For either of
these GRES, each GPU would be identified by device file using the
File parameter and Count would specify the number of entries
that would correspond to that GPU. For MPS, typically 100 or some multiple
of 100. For Sharding typically the maximum number of jobs that could
simultaneously share that GPU.
If using a card with Multi-Instance GPU functionality, use
MultipleFiles instead. File and MultipleFiles are
mutually exclusive.
NOTE: File is required for all gpu typed GRES.
NOTE: If you specify the File parameter for a resource on some node,
the option must be specified on all nodes and Slurm will track the
assignment of each specific resource on each node. Otherwise Slurm will
only track a count of allocated resources rather than the state of each
individual device file.
NOTE: Drain a node before changing the count of records with File
parameters (e.g. if you want to add or remove GPUs from a node's
configuration). Failure to do so will result in any job using those GRES
being aborted.
NOTE: When specifying File, Count is limited in size
(currently 1024) for each node.
-
- Flags
- Optional flags that can be specified to change configured
behavior of the GRES.
Allowed values at present are:
- CountOnly
- Do not attempt to load plugin as this GRES will only be
used to track counts of GRES used. This avoids attempting to load
non-existent plugin which can affect filesystems with high latency
metadata operations for non-existent files.
-
- one_sharing
- To be used on a shared gres. If using a shared gres (mps)
on top of a sharing gres (gpu) only allow one of the sharing gres to be
used by the shared gres. This is the default for MPS.
NOTE: If a gres has this flag configured it is global, so all other nodes
with that gres will have this flag implied. This flag is not combatible
with all_sharing for a specific gres.
-
- all_sharing
- To be used on a shared gres. This is the opposite of
one_sharing and can be used to allow all sharing gres (gpu) on a node to
be used for shared gres (mps).
NOTE: If a gres has this flag configured it is global, so all other nodes
with that gres will have this flag implied. This flag is not combatible
with one_sharing for a specific gres.
-
- nvidia_gpu_env
- Set environment variable CUDA_VISIBLE_DEVICES for
all GPUs on the specified node(s).
-
- amd_gpu_env
- Set environment variable ROCR_VISIBLE_DEVICES for
all GPUs on the specified node(s).
-
- intel_gpu_env
- Set environment variable ZE_AFFINITY_MASK for all
GPUs on the specified node(s).
-
- opencl_env
- Set environment variable GPU_DEVICE_ORDINAL for all
GPUs on the specified node(s).
-
- no_gpu_env
- Set no GPU-specific environment variables. This is mutually
exclusive to all other environment-related flags.
- If no environment-related flags are specified, then
nvidia_gpu_env, amd_gpu_env, intel_gpu_env, and
opencl_env will be implicitly set by default. If AutoDetect
is used and environment-related flags are not specified, then
AutoDetect=nvml will set nvidia_gpu_env,
AutoDetect=rsmi will set amd_gpu_env, and
AutoDetect=oneapi will set intel_gpu_env. Conversely,
specified environment-related flags will always override
AutoDetect.
Environment-related flags set on one GRES line will be inherited by the GRES
line directly below it if no environment-related flags are specified on
that line and if it is of the same node, name, and type.
Environment-related flags must be the same for GRES of the same node,
name, and type.
Note that there is a known issue with the AMD ROCm runtime where
ROCR_VISIBLE_DEVICES is processed first, and then
CUDA_VISIBLE_DEVICES is processed. To avoid the issues caused by
this, set Flags=amd_gpu_env for AMD GPUs so only
ROCR_VISIBLE_DEVICES is set.
-
- Links
- A comma-delimited list of numbers identifying the number of
connections between this device and other devices to allow coscheduling of
better connected devices. This is an ordered list in which the number of
connections this specific device has to device number 0 would be in the
first position, the number of connections it has to device number 1 in the
second position, etc. A -1 indicates the device itself and a 0 indicates
no connection. If specified, then this line can only contain a single GRES
device (i.e. can only contain a single file via File).
This is an optional value and is usually automatically determined if
AutoDetect is enabled. A typical use case would be to identify GPUs
having NVLink connectivity. Note that for GPUs, the minor number assigned
by the OS and used in the device file (i.e. the X in /dev/nvidiaX)
is not necessarily the same as the device number/index. The device number
is created by sorting the GPUs by PCI bus ID and then numbering them
starting from the smallest bus ID. See
https://slurm.schedmd.com/gres.html#GPU_Management
-
- MultipleFiles
- Fully qualified pathname of the device files associated
with a resource. Graphics cards using Multi-Instance GPU (MIG) technology
will present multiple device files that should be managed as a single
generic resource. The file names can be a comma separated list or it can
include a numeric range suffix (e.g. MultipleFiles=/dev/nvidia[0-3]).
Drain a node before changing the count of records with the
MultipleFiles parameter, such as when adding or removing GPUs from
a node's configuration. Failure to do so will result in any job using
those GRES being aborted.
When not using GPUs with MIG functionality, use File instead.
MultipleFiles and File are mutually exclusive.
-
- Name
- Name of the generic resource. Any desired name may be used.
The name must match a value in GresTypes in slurm.conf. Each
generic resource has an optional plugin which can provide
resource-specific functionality. Generic resources that currently include
an optional plugin are:
- gpu
- Graphics Processing Unit
-
- mps
- CUDA Multi-Process Service (MPS)
-
- nic
- Network Interface Card
-
- shard
- Shards of a gpu
- NodeName
- An optional NodeName specification can be used to permit
one gres.conf file to be used for all compute nodes in a cluster by
specifying the node(s) that each line should apply to. The NodeName
specification can use a Slurm hostlist specification as shown in the
example below.
-
- Type
- An optional arbitrary string identifying the type of
generic resource. For example, this might be used to identify a specific
model of GPU, which users can then specify in a job request. A restart of
the slurmctld and slurmd daemons is required for changes to
the Type option to take effect.
NOTE: If using autodetect functionality and defining the Type in
your gres.conf file, the Type specified should match or be a substring of
the value that is detected, using an underscore in lieu of any
spaces.
-
##################################################################
# Slurm's Generic Resource (GRES) configuration file
# Define GPU devices with MPS support, with AutoDetect sanity checking
##################################################################
AutoDetect=nvml
Name=gpu Type=gtx560 File=/dev/nvidia0 COREs=0,1
Name=gpu Type=tesla File=/dev/nvidia1 COREs=2,3
Name=mps Count=100 File=/dev/nvidia0 COREs=0,1
Name=mps Count=100 File=/dev/nvidia1 COREs=2,3
##################################################################
# Slurm's Generic Resource (GRES) configuration file
# Overwrite system defaults and explicitly configure three GPUs
##################################################################
Name=gpu Type=tesla File=/dev/nvidia[0-1] COREs=0,1
# Name=gpu Type=tesla File=/dev/nvidia[2-3] COREs=2,3
# NOTE: nvidia2 device is out of service
Name=gpu Type=tesla File=/dev/nvidia3 COREs=2,3
##################################################################
# Slurm's Generic Resource (GRES) configuration file
# Use a single gres.conf file for all compute nodes - positive method
##################################################################
## Explicitly specify devices on nodes tux0-tux15
# NodeName=tux[0-15] Name=gpu File=/dev/nvidia[0-3]
# NOTE: tux3 nvidia1 device is out of service
NodeName=tux[0-2] Name=gpu File=/dev/nvidia[0-3]
NodeName=tux3 Name=gpu File=/dev/nvidia[0,2-3]
NodeName=tux[4-15] Name=gpu File=/dev/nvidia[0-3]
##################################################################
# Slurm's Generic Resource (GRES) configuration file
# Use NVML to gather GPU configuration information
# for all nodes except one
##################################################################
AutoDetect=nvml
NodeName=tux3 AutoDetect=off Name=gpu File=/dev/nvidia[0-3]
##################################################################
# Slurm's Generic Resource (GRES) configuration file
# Specify some nodes with NVML, some with RSMI, and some with no AutoDetect
##################################################################
NodeName=tux[0-7] AutoDetect=nvml
NodeName=tux[8-11] AutoDetect=rsmi
NodeName=tux[12-15] Name=gpu File=/dev/nvidia[0-3]
##################################################################
# Slurm's Generic Resource (GRES) configuration file
# Define 'bandwidth' GRES to use as a way to limit the
# resource use on these nodes for workflow purposes
##################################################################
NodeName=tux[0-7] Name=bandwidth Type=lustre Count=4G Flags=CountOnly
Copyright (C) 2010 The Regents of the University of California. Produced at
Lawrence Livermore National Laboratory (cf, DISCLAIMER).
Copyright (C) 2010-2022 SchedMD LLC.
This file is part of Slurm, a resource management program. For details, see
<
https://slurm.schedmd.com/>.
Slurm is free software; you can redistribute it and/or modify it under the terms
of the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any later
version.
Slurm is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR
A PARTICULAR PURPOSE. See the GNU General Public License for more details.
slurm.conf(5)