You want to ensure that you have the Munge and Slurm users created on EACH machine (master + nodes) so that the UIDs and GIDs are synced across the cluster for these. To do so, run the following:
sudo adduser -u 1111 munge --disabled-password --gecos "" sudo adduser -u 1121 slurm --disabled-password --gecos ""
First we need passwordless SSH between the master and compute nodes. We are still using master
as the master node hostname and worker
as the worker hostname. On the master:
ssh-keygen ssh-copy-id admin@worker
sudo apt-get install libmunge-dev libmunge2 munge -y sudo systemctl enable munge sudo systemctl start munge
Test munge if you like:
munge -n | unmunge | grep STATUS
Copy the munge key to all the WORKER nodes at /etc/munge/munge.key (ensure to chown -R munge /etc/munge/munge.key
)
ssh WORKER mkdir -p /etc/munge scp /etc/munge/munge.key WORKER:/etc/munge/ ssh WORKER chown -R munge:munge /etc/munge/
sudo apt-get install libmunge-dev libmunge2 munge sudo systemctl enable munge sudo systemctl start munge
If you want, you can test munge:
munge -n | unmunge | grep STATUS
These instructions more or less follow this github repo: https://github.com/mknoxnv/ubuntu-slurm
First we want to clone the repo:
git clone https://github.com/mknoxnv/ubuntu-slurm.git
Install prereqs on BOTH Master and Worker:
sudo apt-get install git gcc make ruby ruby-dev libpam0g-dev libmysqlclient-dev mariadb-server build-essential libssl-dev -y sudo gem install fpm
Next we set up MariaDB for storing SLURM data:
sudo systemctl enable mysql sudo systemctl start mysql sudo mysql -u root
Within mysql:
create database slurm_acct_db; create user 'slurm'@'localhost'; set password for 'slurm'@'localhost' = password('slurmdbpass'); grant usage on *.* to 'slurm'@'localhost'; grant all privileges on slurm_acct_db.* to 'slurm'@'localhost'; flush privileges; exit
Ideally you want to change the password to something different than slurmdbpass
. This must also be set in the config file ubuntu-slurm/slurmdbd.conf
.
You can use the -j option to specify the number of CPU cores to use for 'make', like make -j12
. htop
is a nice package that will show usage stats and quickly show how many cores you have.
wget https://download.schedmd.com/slurm/slurm-20.11.9.tar.bz2 tar xvjf slurm-20.11.9.tar.bz2 cd slurm-20.11.9 ./configure --prefix=/tmp/slurm-build --sysconfdir=/etc/slurm --enable-pam --with-pam_dir=/lib/x86_64-linux-gnu/security/ --without-shared-libslurm make make contrib make install cd ..
sudo fpm -s dir -t deb -v 1.0 -n slurm-20.11.9 --prefix=/usr -C /tmp/slurm-build . sudo dpkg -i slurm-20.11.9_1.0_amd64.deb
Make all the directories we need:
sudo mkdir -p /etc/slurm /etc/slurm/prolog.d /etc/slurm/epilog.d /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm sudo chown slurm /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm
Copy slurm control and db services:
sudo cp ubuntu-slurm/slurmdbd.service /etc/systemd/system/ sudo cp ubuntu-slurm/slurmctld.service /etc/systemd/system/
The slurmdbd.conf file should be copied before starting the slurm services:
sudo cp ubuntu-slurm/slurmdbd.conf /etc/slurm/
Start the slurm services:
sudo systemctl daemon-reload sudo systemctl enable slurmdbd sudo systemctl start slurmdbd sudo systemctl enable slurmctld sudo systemctl start slurmctld
If the master is also going to be a worker/compute node, you should do:
sudo cp ubuntu-slurm/slurmd.service /etc/systemd/system/ sudo systemctl enable slurmd sudo systemctl start slurmd
Copy the slurm-20.11.9_1.0_amd64.deb file from the master to the worker nodes. And then on the WORKER node(s):
root@WORKER: sudo dpkg -i slurm-20.11_1.0_amd64.deb sudo cp ubuntu-slurm/slurmd.service /etc/systemd/system/ sudo systemctl enable slurmd sudo systemctl start slurmd
Next we need to set up the configuration file. Copy the default config from the github repo:
cp ubuntu-slurm/slurm.conf /etc/slurm/slurm.conf
Note: for job limits for users, you should add the [AccountingStorageEnforce=limits](https://slurm.schedmd.com/resource_limits.html) line to the config file.
Once SLURM is installed on all nodes, we can use the command
sudo slurmd -C
to print out the machine specs. Then we can copy this line into the config file and modify it slightly. To modify it, we need to add the number of GPUs we have in the system (and remove the last part which shows UpTime). You can get the number of GPUs by running nvidia-smi
. Here is an example of a config line:
NodeName=worker1 Gres=gpu:2 CPUs=12 Boards=1 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=128846
Take this line and put it at the bottom of slurm.conf
.
Next, setup the gres.conf
file. Lines in gres.conf
should look like:
NodeName=worker1 Name=gpu File=/dev/nvidia0 NodeName=worker1 Name=gpu File=/dev/nvidia1 NodeName=worker2 Name=gpu File=/dev/nvidia0
If you have multiple GPUs, keep adding lines for each node and increment the last number after nvidia.
Gres has more options detailed in the docs: https://slurm.schedmd.com/slurm.conf.html (near the bottom).
Finally, we need to copy .conf files on *all* machines. This includes the slurm.conf
file, gres.conf
, cgroup.conf
, and cgroup_allowed_devices_file.conf
. Without these files it seems like things don’t work.
sudo cp ubuntu-slurm/cgroup* /etc/slurm/ sudo cp ubuntu-slurm/slurm.conf /etc/slurm/ sudo cp ubuntu-slurm/gres.conf /etc/slurm/
This directory should also be created on ALL workers:
sudo mkdir -p /var/spool/slurm/d sudo chown slurm /var/spool/slurm/d
After the conf files have been copied to all workers and the master node, you may want to reboot the computers, or at least restart the slurm services:
Workers:
sudo systemctl restart slurmd
Master:
sudo systemctl restart slurmctld sudo systemctl restart slurmdbd sudo systemctl restart slurmd (if your master node is part of the slurm queue, which ideally we don't have, e.g. daytona-login, watgpu.cs)
Next we just create a cluster:
sudo sacctmgr add cluster compute-cluster
cgroups allows memory limitations from SLURM jobs and users to be implemented. Set memory cgroups on all workers with:
sudo nano /etc/default/grub And change the following variable to: GRUB_CMDLINE_LINUX="cgroup_enable=memory swapaccount=1" sudo update-grubFinally at the end, I did one last
sudo apt update
, sudo apt upgrade
, and sudo apt autoremove
, then rebooted the computers:
sudo reboot
If you are running cgroups on a config that requires nodes to be LXC containers, then the main host where the LXC container(s) will run, needs to have the grub updated to the following config:
Edit the /etc/default/grub
sudo nano /etc/default/grub ## change the following line to: GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=0"= sudo update-grub
And you have to change the LXC config to NOT use cgroups2, so change any and all lines in the lxc configs:
lxc config edit container_name change all instances of =cgroup2= to just =cgroup= Save and Exit lxc restart container_name
Prior to creating partitions, you need ensure a few things. First that the SelectType
variable in /etc/slurm/slurm.conf
is set to select/linear
Next you also want to add the following two lines to etc/slurm/slurm.conf
PreemptType=preempt/partition_prio PreemptMode=SUSPEND,GANG
Remember to make these changes across ALL the machines or simply edit and copy the latest slurm.conf
from the master to ALL the worker machines using scp
Once these changes have been made, you can create the partitions.
To create partitions, you need to add the following lines to the /etc/slurm/slurm.conf
file on ALL the machines.
PartitionName=VISION Nodes=ALL Default=NO PriorityTier=100 AllowAccounts=vision_group OverSubscribe=FORCE:3 PartitionName=SCHOOL Nodes=ALL Default=YES PriorityTier=1 OverSubscribe=NO
Remember to run systemctl restart slurmctld
on the master and systemctl restart slurmd
on ALL the worker nodes.
So the above partitions are examples from watgpu cluster. The first one is for the VISION group, where only users that are part of the "vision_group" (which is a SLURM group, different from an AD group). Each user who is part of the vision research group must be added to this group separately from AD grouping. The other configs such as "PriorityTier" are priority rankings. The greater PriorityTier number takes precedence over the other smaller ones. So in this case, anybody who uses the partition name "SCHOOL" (which is the default partition in the cluster), can have their jobs bumped off and paused if a Vision group user submits a job to the VISION partition.
When in doubt, first try updating software with sudo apt update; sudo apt upgrade -y
and rebooting (sudo reboot
).
/var/log/slurmd.log
and /var/log/slurmctld.log
by default. Open them with sudo nano /var/log/slurmctld.log
. To go to the bottom of the file, use ctrl+_ and ctrl+v. I also changed the log paths to /var/log/slurm/slurmd.log
and so on, and changed the permissions of the folder to be owner by slurm: sudo chown slurm:slurm /var/log/slurm
.
scontrol ping
-- this checks if the controller node can be reached. If this isn't working (i.e. the command returns 'DOWN' and not 'UP'), you might need to allow connections to the slurmctrlport (in the slurm.conf file). This is set to 6817 in the config file. To allow connections with the firewall, execute:
sudo ufw allow from any to any port 6817
and
sudo ufw reload
sbatch
and the exit code is 1:0, this is usually a file writing error. The first thing to check is that your output and error file paths in the .job file are correct. Also check the .py file you want to run has the correct filepath in your .job file. Then you should go to the logs (/var/log/slurm/slurmctld.log
) and see which node the job was trying to run on. Then go to that node and open the logs (/var/log/slurm/slurmd.log
) to see what it says. It may say something about the path for the output/error files, or the path to the .py file is incorrect.
It could also mean your common storage location is not r/w accessible to all nodes. In the logs, this would show up as something about permissions and unable to write to the filesystem. Double-check that you can create files on the /storage location on all workers with something like touch testing.txt
. If you can't create a file from the worker nodes, you probably have some sort of NFS issue. Go back to the NFS section and make sure everything looks ok. You should be able to create directories/files in /storage from any node with the admin account and they should show up as owned by the admin user. If not, you may have some issue in your /etc/exports or with your GID/UIDs not matching.
If the exit code is 2:0, this can mean there is some problem with either the location of the python executable, or some other error when running the python script. Double check that the srun or python script is working as expected with the python executable specified in the sbatch job file.
If some workers are 'draining', down, or unavailable, you might try:
sudo scontrol update NodeName=worker1 State=RESUME
sinfo
) free -m
or sudo slurmd -C
and update slurm.conf on all machines in the cluster. Then run sudo scontrol update NodeName=worker1 State=RESUME
sudo scontrol update NodeName=worker1 State=RESUME
to get them working/available.
sudo scontrol update NodeName=worker1 State=DRAIN Reason='Maintenance'
Users can see the reason with sinfo -R
watch -n 0.1 nvidia-smi
will show the GPU load in real-time. You can use this to monitor jobs as they are scheduled to make sure all the GPUs are being utilized.
bash sudo sacctmgr modify account students set GrpJobs=-1 sudo sacctmgr modify account students set GrpSubmitJobs=-1 sudo sacctmgr modify account students set MaxJobs=-1 sudo sacctmgr modify account students set MaxSubmitJobs=-1
These are the commands to create the group:
sacctmgr add account name=vision_group description="VISION Group"
sacctmgr modify account name=vision_group set GrpNodes=VISION
sacctmgr add association cluster=compute-cluster account=vision_group partition=VISION
And then to create and add the user to the group:
sacctmgr add user name=ldpaniak account=vision_group
You can also do the following to view associations:
sacctmgr show association
This shows all running jobs with the user who is running them.
sacct --format=jobid,jobname,state,exitcode,user,account
More on sacct here.
/etc/hosts
on all machines and /etc/exports
on the master node. It's best to restart after making these changes.
date
command to see if the times are synced across the servers.
pam_slurm_adopt.so
shared object file on the worker/compute nodes.
cd slurm-20.11.9/contribs/pam_slurm_adopt/ make && make installThis will run through the install and create the shared object file. Locate the shared object and copy it to the worker node.
scp slurm-20.11.9/contribs/pam_slurm_adopt/.libs/pam_slurm_adopt.so WORKER1:/lib/x86_64-linux-gnu/security/pam_slurm_adopt.so
Edit /etc/slurm/slurm.conf
on ALL nodes including Daytona-login and add this line:
PrologFlags=contain
Edit /etc/pam.d/sshd
and add this line:
account required /lib/x86_64-linux-gnu/security/pam_slurm_adopt.so=Make sure no instances of the phrase
pam_systemd.so
exists on any of the /etc/pam.d/*
files, i.e. grep systemd /etc/pam.d/*
and see if there are lines that have pam_systemd.so
, and if they do, comment them out using #
Stop and unmask systemd-logind.service on the nodes:
systemctl stop sytemd-logind systemctl unmask systemd-logind
May need to run run a slurm reconfigure on all the nodes (including master node) if it doesn’t work right away:
scontrol reconfigure