SLURM User Documentation
SLURM batch job system
SLURM cheatsheet:
https://slurm.schedmd.com/pdfs/summary.pdf
You can submit jobs using an SLURM job script. Below is an example of a simple script:
#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --mem=10GB
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:1
echo 'Hello, world!'
sleep 30
You can use
SBATCH
variables like
--mem
for example the one above will assign 10GB of RAM to the job.
For CPU cores allocation, you can use
--cpus-per-task
, for example the one above will assign 4 cores to the job.
The
--gres=gpu:1
will assign 1x GPU to your job.
Running the script
To run the script, simply run
sbatch your_script.sh
in any of the SLURM node(s).
Queues
To look at the queue of jobs currently, you can use
squeue
to display it.
Where does the output go?
By default the output is placed in a file named "slurm-", suffixed with the job ID number and ".out", e.g.
slurm-123456.out
, in the directory from which the job was submitted. Having the job ID as part of the file name is convenient for troubleshooting.
A different name or location can be specified if your workflow requires it by using the
--output
directive. Certain replacement symbols can be used in a filename specified this way, such as the job ID number, the job name, or the job array task ID. See the
vendor documentation on sbatch for a complete list of replacement symbols and some examples of their use.
Error output will normally appear in the same file as standard output, just as it would if you were typing commands interactively. If you want to send the standard error channel (stderr) to a separate file, use
--error
.
Monitoring jobs
Current jobs
By default
squeue
will show all the jobs the scheduler is managing at the moment. It will run much faster if you ask only about your own jobs with
$ squeue -u $USER
You can show only running jobs, or only pending jobs:
$ squeue -u <username> -t RUNNING
$ squeue -u <username> -t PENDING
You can show detailed information for a specific job with
scontrol
:
$ scontrol show job -dd
*Do not* run
squeue
from a script or program at high frequency, e.g., every few seconds. Responding to
squeue
adds load to Slurm, and may interfere with its performance or correct operation.
Cancelling jobs
Use
scancel
with the job ID to cancel a job:
$ scancel
You can also use it to cancel all your jobs, or all your pending jobs:
$ scancel -u $USER
$ scancel -t PENDING -u $USER