SLURM-installation on Ubuntu
Preparing for SLURM installation
Munge and SLURM user creations
You want to ensure that you have the Munge and Slurm users created on EACH machine (master + nodes) so that the UIDs and GIDs are synced across the cluster for these. To do so, run the following:
sudo adduser -u 1111 munge --disabled-password --gecos ""
sudo adduser -u 1121 slurm --disabled-password --gecos ""
Passwordless ssh from Master to worker nodes
First we need passwordless
SSH between the master and compute nodes. We are still using
master
as the master node hostname and
worker
as the worker hostname. On the master:
ssh-keygen
ssh-copy-id admin@worker
Install munge on the master:
sudo apt-get install libmunge-dev libmunge2 munge -y
sudo systemctl enable munge
sudo systemctl start munge
Test munge if you like:
munge -n | unmunge | grep STATUS
Copy the munge key to all the WORKER nodes at /etc/munge/munge.key (ensure to
chown -R munge /etc/munge/munge.key
)
ssh WORKER mkdir -p /etc/munge
scp /etc/munge/munge.key WORKER:/etc/munge/
ssh WORKER chown -R munge:munge /etc/munge/
Install munge on worker nodes:
sudo apt-get install libmunge-dev libmunge2 munge
sudo systemctl enable munge
sudo systemctl start munge
If you want, you can test munge:
munge -n | unmunge | grep STATUS
Prepare DB for SLURM
These instructions more or less follow this github repo:
https://github.com/mknoxnv/ubuntu-slurm
First we want to clone the repo:
git clone https://github.com/mknoxnv/ubuntu-slurm.git
Install prereqs on BOTH Master and Worker:
sudo apt-get install git gcc make ruby ruby-dev libpam0g-dev libmysqlclient-dev mariadb-server build-essential libssl-dev -y
sudo gem install fpm
Next we set up
MariaDB for storing SLURM data:
sudo systemctl enable mysql
sudo systemctl start mysql
sudo mysql -u root
Within mysql:
create database slurm_acct_db;
create user 'slurm'@'localhost';
set password for 'slurm'@'localhost' = password('slurmdbpass');
grant usage on *.* to 'slurm'@'localhost';
grant all privileges on slurm_acct_db.* to 'slurm'@'localhost';
flush privileges;
exit
Ideally you want to change the password to something different than
slurmdbpass
. This must also be set in the config file
ubuntu-slurm/slurmdbd.conf
.
Install SLURM
Download and install SLURM on Master
Build the SLURM .deb install file
It’s best to check the downloads page and use the latest version (right click link for download and use in the wget command). Ideally we’d have a script to scrape the latest version and use that dynamically.
You can use the -j option to specify the number of CPU cores to use for 'make', like
make -j12
.
htop
is a nice package that will show usage stats and quickly show how many cores you have.
wget https://download.schedmd.com/slurm/slurm-20.11.9.tar.bz2
tar xvjf slurm-20.11.9.tar.bz2
cd slurm-20.11.9
./configure --prefix=/tmp/slurm-build --sysconfdir=/etc/slurm --enable-pam --with-pam_dir=/lib/x86_64-linux-gnu/security/ --without-shared-libslurm
make
make contrib
make install
cd ..
Install SLURM
sudo fpm -s dir -t deb -v 1.0 -n slurm-20.11.9 --prefix=/usr -C /tmp/slurm-build .
sudo dpkg -i slurm-20.11.9_1.0_amd64.deb
Make all the directories we need:
sudo mkdir -p /etc/slurm /etc/slurm/prolog.d /etc/slurm/epilog.d /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm
sudo chown slurm /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm
Copy slurm control and db services:
sudo cp ubuntu-slurm/slurmdbd.service /etc/systemd/system/
sudo cp ubuntu-slurm/slurmctld.service /etc/systemd/system/
The slurmdbd.conf file should be copied before starting the slurm services:
sudo cp ubuntu-slurm/slurmdbd.conf /etc/slurm/
Start the slurm services:
sudo systemctl daemon-reload
sudo systemctl enable slurmdbd
sudo systemctl start slurmdbd
sudo systemctl enable slurmctld
sudo systemctl start slurmctld
If the master is also going to be a worker/compute node, you should do:
sudo cp ubuntu-slurm/slurmd.service /etc/systemd/system/
sudo systemctl enable slurmd
sudo systemctl start slurmd
Worker nodes
Now install SLURM on worker nodes:
Copy the slurm-20.11.9_1.0_amd64.deb file from the master to the worker nodes. And then on the WORKER node(s):
root@WORKER:
sudo dpkg -i slurm-20.11_1.0_amd64.deb
sudo cp ubuntu-slurm/slurmd.service /etc/systemd/system/
sudo systemctl enable slurmd
sudo systemctl start slurmd
Configuring SLURM
Next we need to set up the configuration file. Copy the default config from the github repo:
cp ubuntu-slurm/slurm.conf /etc/slurm/slurm.conf
Note: for job limits for users, you should add the [AccountingStorageEnforce=limits](
https://slurm.schedmd.com/resource_limits.html) line to the config file.
Once SLURM is installed on all nodes, we can use the command
sudo slurmd -C
to print out the machine specs. Then we can copy this line into the config file and modify it slightly. To modify it, we need to add the number of GPUs we have in the system (and remove the last part which shows
UpTime). You can get the number of GPUs by running
nvidia-smi
. Here is an example of a config line:
NodeName=worker1 Gres=gpu:2 CPUs=12 Boards=1 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=128846
Take this line and put it at the bottom of
slurm.conf
.
Next, setup the
gres.conf
file. Lines in
gres.conf
should look like:
NodeName=worker1 Name=gpu File=/dev/nvidia0
NodeName=worker1 Name=gpu File=/dev/nvidia1
NodeName=worker2 Name=gpu File=/dev/nvidia0
If you have multiple GPUs, keep adding lines for each node and increment the last number after nvidia.
Gres has more options detailed in the docs:
https://slurm.schedmd.com/slurm.conf.html (near the bottom).
Finally, we need to copy .conf files on
*all* machines. This includes the
slurm.conf
file,
gres.conf
,
cgroup.conf
, and
cgroup_allowed_devices_file.conf
. Without these files it seems like things don’t work.
sudo cp ubuntu-slurm/cgroup* /etc/slurm/
sudo cp ubuntu-slurm/slurm.conf /etc/slurm/
sudo cp ubuntu-slurm/gres.conf /etc/slurm/
This directory should also be created on ALL workers:
sudo mkdir -p /var/spool/slurm/d
sudo chown slurm /var/spool/slurm/d
After the conf files have been copied to all workers and the master node, you may want to reboot the computers, or at least restart the slurm services:
Workers:
sudo systemctl restart slurmd
Master:
sudo systemctl restart slurmctld
sudo systemctl restart slurmdbd
sudo systemctl restart slurmd (if your master node is part of the slurm queue, which ideally we don't have, e.g. daytona-login, watgpu.cs)
Next we just create a cluster:
sudo sacctmgr add cluster compute-cluster
Configure cgroups
cgroups allows memory limitations from SLURM jobs and users to be implemented. Set memory cgroups on all workers with:
sudo nano /etc/default/grub
And change the following variable to:
GRUB_CMDLINE_LINUX="cgroup_enable=memory swapaccount=1"
sudo update-grub
Finally at the end, I did one last
sudo apt update
,
sudo apt upgrade
, and
sudo apt autoremove
, then rebooted the computers:
sudo reboot
If you are running cgroups on a config that requires nodes to be LXC containers, then the main host where the LXC container(s) will run, needs to have the grub updated to the following config:
Edit the /etc/default/grub
sudo nano /etc/default/grub
## change the following line to:
GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=0"=
sudo update-grub
And you have to change the LXC config to NOT use cgroups2, so change any and all lines in the lxc configs:
lxc config edit container_name
change all instances of =cgroup2= to just =cgroup=
Save and Exit
lxc restart container_name
Partitioning and Priority
Prior to creating partitions, you need ensure a few things. First that the
SelectType
variable in
/etc/slurm/slurm.conf
is set to
select/linear
Next you also want to add the following two lines to
etc/slurm/slurm.conf
PreemptType=preempt/partition_prio
PreemptMode=SUSPEND,GANG
Remember to make these changes across ALL the machines or simply edit and copy the latest
slurm.conf
from the master to ALL the worker machines using
scp
Once these changes have been made, you can create the partitions.
To create partitions, you need to add the following lines to the
/etc/slurm/slurm.conf
file on ALL the machines.
PartitionName=VISION Nodes=ALL Default=NO PriorityTier=100 AllowAccounts=vision_group OverSubscribe=FORCE:3
PartitionName=SCHOOL Nodes=ALL Default=YES PriorityTier=1 OverSubscribe=NO
Remember to run
systemctl restart slurmctld
on the master and
systemctl restart slurmd
on ALL the worker nodes.
So the above partitions are examples from watgpu cluster. The first one is for the VISION group, where only users that are part of the "vision_group" (which is a SLURM group, different from an AD group). Each user who is part of the vision research group must be added to this group separately from AD grouping. The other configs such as "PriorityTier" are priority rankings. The greater
PriorityTier number takes precedence over the other smaller ones. So in this case, anybody who uses the partition name "SCHOOL" (which is the default partition in the cluster), can have their jobs bumped off and paused if a Vision group user submits a job to the VISION partition.
Troubleshooting
When in doubt, first try updating software with
sudo apt update; sudo apt upgrade -y
and rebooting (
sudo reboot
).
Log files
When in doubt, you can check the log files. The locations are set in the slurm.conf file, and are
/var/log/slurmd.log
and
/var/log/slurmctld.log
by default. Open them with
sudo nano /var/log/slurmctld.log
. To go to the bottom of the file, use ctrl+_ and ctrl+v. I also changed the log paths to
/var/log/slurm/slurmd.log
and so on, and changed the permissions of the folder to be owner by slurm:
sudo chown slurm:slurm /var/log/slurm
.
Checking SLURM states
Some helpful commands:
scontrol ping
-- this checks if the controller node can be reached. If this isn't working (i.e. the command returns 'DOWN' and not 'UP'), you might need to allow connections to the slurmctrlport (in the slurm.conf file). This is set to 6817 in the config file. To allow connections with the firewall, execute:
sudo ufw allow from any to any port 6817
and
sudo ufw reload
Error codes 1:0 and 2:0
If trying to run a job with
sbatch
and the exit code is 1:0, this is usually a file writing error. The first thing to check is that your output and error file paths in the .job file are correct. Also check the .py file you want to run has the correct filepath in your .job file. Then you should go to the logs (
/var/log/slurm/slurmctld.log
) and see which node the job was trying to run on. Then go to that node and open the logs (
/var/log/slurm/slurmd.log
) to see what it says. It may say something about the path for the output/error files, or the path to the .py file is incorrect.
It could also mean your common storage location is not r/w accessible to all nodes. In the logs, this would show up as something about permissions and unable to write to the filesystem. Double-check that you can create files on the /storage location on all workers with something like
touch testing.txt
. If you can't create a file from the worker nodes, you probably have some sort of NFS issue. Go back to the NFS section and make sure everything looks ok. You should be able to create directories/files in /storage from any node with the admin account and they should show up as owned by the admin user. If not, you may have some issue in your /etc/exports or with your GID/UIDs not matching.
If the exit code is 2:0, this can mean there is some problem with either the location of the python executable, or some other error when running the python script. Double check that the srun or python script is working as expected with the python executable specified in the sbatch job file.
If some workers are 'draining', down, or unavailable, you might try:
sudo scontrol update NodeName=worker1 State=RESUME
Node is stuck draining (drng from sinfo
)
This has happened due to the memory size in slurm.conf being higher than actual memor size. Double check the memory from
free -m
or
sudo slurmd -C
and update slurm.conf on all machines in the cluster. Then run
sudo scontrol update NodeName=worker1 State=RESUME
Nodes are not visible upon restart
After restarting the master node, sometimes the workers aren't there. I've found I often have to do
sudo scontrol update NodeName=worker1 State=RESUME
to get them working/available.
Taking a node offline
The best way to take a node offline for maintenance is to drain it:
sudo scontrol update NodeName=worker1 State=DRAIN Reason='Maintenance'
Users can see the reason with
sinfo -R
Testing GPU load
Using
watch -n 0.1 nvidia-smi
will show the GPU load in real-time. You can use this to monitor jobs as they are scheduled to make sure all the GPUs are being utilized.
Setting account options
You may want to limit jobs or submissions. Here is how to set attributes (-1 means no limit):
bash
sudo sacctmgr modify account students set GrpJobs=-1
sudo sacctmgr modify account students set GrpSubmitJobs=-1
sudo sacctmgr modify account students set MaxJobs=-1
sudo sacctmgr modify account students set MaxSubmitJobs=-1
Groups and membership
These are the commands to create the group:
sacctmgr add account name=vision_group description="VISION Group"
sacctmgr modify account name=vision_group set
GrpNodes=VISION
sacctmgr add association cluster=compute-cluster account=vision_group partition=VISION
And then to create and add the user to the group:
sacctmgr add user name=ldpaniak account=vision_group
You can also do the following to view associations:
sacctmgr show association
Better sacct
This shows all running jobs with the user who is running them.
sacct --format=jobid,jobname,state,exitcode,user,account
More on sacct
here.
Changing IPs
If the IP addresses of your machines change, you will need to update these in the file
/etc/hosts
on all machines and
/etc/exports
on the master node. It's best to restart after making these changes.
Node not able to connect to slurmctld
If a node isn't able to connnect to the controller (server/master), first check that time is properly synced. Try using the
date
command to see if the times are synced across the servers.
Configure resource restricted SSH access
To configure resource limited
SSH access where the users would share resources on a single-node (such as shared GPU access), you need to have the
pam_slurm_adopt.so
shared object file on the worker/compute nodes.
Make Install pam_slurm_adopt.so
Go back to the "head node" where slurm-20.11.9 was installed and do the following:
cd slurm-20.11.9/contribs/pam_slurm_adopt/
make && make install
This will run through the install and create the shared object file. Locate the shared object and copy it to the worker node.
scp slurm-20.11.9/contribs/pam_slurm_adopt/.libs/pam_slurm_adopt.so WORKER1:/lib/x86_64-linux-gnu/security/pam_slurm_adopt.so
Edit
/etc/slurm/slurm.conf
on ALL nodes including Daytona-login and add this line:
PrologFlags=contain
Edit
/etc/pam.d/sshd
and add this line:
account required /lib/x86_64-linux-gnu/security/pam_slurm_adopt.so=
Make sure no instances of the phrase
pam_systemd.so
exists on any of the
/etc/pam.d/*
files, i.e.
grep systemd /etc/pam.d/*
and see if there are lines that have
pam_systemd.so
, and if they do, comment them out using
#
Stop and unmask systemd-logind.service on the nodes:
systemctl stop sytemd-logind
systemctl unmask systemd-logind
May need to run run a slurm reconfigure on all the nodes (including master node) if it doesn’t work right away:
scontrol reconfigure