SLURM-installation on Ubuntu

Preparing for SLURM installation

Munge and SLURM user creations

You want to ensure that you have the Munge and Slurm users created on EACH machine (master + nodes) so that the UIDs and GIDs are synced across the cluster for these. To do so, run the following:

sudo adduser -u 1111 munge --disabled-password --gecos ""
sudo adduser -u 1121 slurm --disabled-password --gecos ""

Passwordless ssh from Master to worker nodes

First we need passwordless SSH between the master and compute nodes. We are still using master as the master node hostname and worker as the worker hostname. On the master:

ssh-keygen
ssh-copy-id admin@worker

Install munge on the master:

sudo apt-get install libmunge-dev libmunge2 munge -y
sudo systemctl enable munge
sudo systemctl start munge

Test munge if you like: munge -n | unmunge | grep STATUS

Copy the munge key to all the WORKER nodes at /etc/munge/munge.key (ensure to chown -R munge /etc/munge/munge.key )

ssh WORKER mkdir -p /etc/munge
scp /etc/munge/munge.key WORKER:/etc/munge/
ssh WORKER chown -R munge:munge /etc/munge/

Install munge on worker nodes:

sudo apt-get install libmunge-dev libmunge2 munge
sudo systemctl enable munge
sudo systemctl start munge

If you want, you can test munge: munge -n | unmunge | grep STATUS

Prepare DB for SLURM

These instructions more or less follow this github repo: https://github.com/mknoxnv/ubuntu-slurm

First we want to clone the repo:

 git clone https://github.com/mknoxnv/ubuntu-slurm.git 

Install prereqs on BOTH Master and Worker:

sudo apt-get install git gcc make ruby ruby-dev libpam0g-dev libmysqlclient-dev mariadb-server build-essential libssl-dev -y
sudo gem install fpm

Next we set up MariaDB for storing SLURM data:

sudo systemctl enable mysql
sudo systemctl start mysql
sudo mysql -u root

Within mysql:

create database slurm_acct_db;
create user 'slurm'@'localhost';
set password for 'slurm'@'localhost' = password('slurmdbpass');
grant usage on *.* to 'slurm'@'localhost';
grant all privileges on slurm_acct_db.* to 'slurm'@'localhost';
flush privileges;
exit

Ideally you want to change the password to something different than slurmdbpass. This must also be set in the config file ubuntu-slurm/slurmdbd.conf.

Install SLURM

Download and install SLURM on Master

Build the SLURM .deb install file

It’s best to check the downloads page and use the latest version (right click link for download and use in the wget command). Ideally we’d have a script to scrape the latest version and use that dynamically.

You can use the -j option to specify the number of CPU cores to use for 'make', like make -j12. htop is a nice package that will show usage stats and quickly show how many cores you have.

wget https://download.schedmd.com/slurm/slurm-20.11.9.tar.bz2
tar xvjf slurm-20.11.9.tar.bz2
cd slurm-20.11.9
./configure --prefix=/tmp/slurm-build --sysconfdir=/etc/slurm --enable-pam --with-pam_dir=/lib/x86_64-linux-gnu/security/ --without-shared-libslurm
make
make contrib
make install
cd ..

Install SLURM

sudo fpm -s dir -t deb -v 1.0 -n slurm-20.11.9 --prefix=/usr -C /tmp/slurm-build .
sudo dpkg -i slurm-20.11.9_1.0_amd64.deb

Make all the directories we need:

sudo mkdir -p /etc/slurm /etc/slurm/prolog.d /etc/slurm/epilog.d /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm
sudo chown slurm /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm

Copy slurm control and db services:

sudo cp ubuntu-slurm/slurmdbd.service /etc/systemd/system/
sudo cp ubuntu-slurm/slurmctld.service /etc/systemd/system/

The slurmdbd.conf file should be copied before starting the slurm services: sudo cp ubuntu-slurm/slurmdbd.conf /etc/slurm/

Start the slurm services:

sudo systemctl daemon-reload
sudo systemctl enable slurmdbd
sudo systemctl start slurmdbd
sudo systemctl enable slurmctld
sudo systemctl start slurmctld

If the master is also going to be a worker/compute node, you should do:

sudo cp ubuntu-slurm/slurmd.service /etc/systemd/system/
sudo systemctl enable slurmd
sudo systemctl start slurmd

Worker nodes

Now install SLURM on worker nodes:

Copy the slurm-20.11.9_1.0_amd64.deb file from the master to the worker nodes. And then on the WORKER node(s):

root@WORKER:
sudo dpkg -i slurm-20.11_1.0_amd64.deb
sudo cp ubuntu-slurm/slurmd.service /etc/systemd/system/
sudo systemctl enable slurmd
sudo systemctl start slurmd

Configuring SLURM

Next we need to set up the configuration file. Copy the default config from the github repo:

cp ubuntu-slurm/slurm.conf /etc/slurm/slurm.conf

Note: for job limits for users, you should add the [AccountingStorageEnforce=limits](https://slurm.schedmd.com/resource_limits.html) line to the config file.

Once SLURM is installed on all nodes, we can use the command

sudo slurmd -C

to print out the machine specs. Then we can copy this line into the config file and modify it slightly. To modify it, we need to add the number of GPUs we have in the system (and remove the last part which shows UpTime). You can get the number of GPUs by running nvidia-smi . Here is an example of a config line:

NodeName=worker1 Gres=gpu:2 CPUs=12 Boards=1 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=128846

Take this line and put it at the bottom of slurm.conf.

Next, setup the gres.conf file. Lines in gres.conf should look like:

NodeName=worker1 Name=gpu File=/dev/nvidia0
NodeName=worker1 Name=gpu File=/dev/nvidia1
NodeName=worker2 Name=gpu File=/dev/nvidia0

If you have multiple GPUs, keep adding lines for each node and increment the last number after nvidia.

Gres has more options detailed in the docs: https://slurm.schedmd.com/slurm.conf.html (near the bottom).

Finally, we need to copy .conf files on *all* machines. This includes the slurm.conf file, gres.conf, cgroup.conf , and cgroup_allowed_devices_file.conf. Without these files it seems like things don’t work.

sudo cp ubuntu-slurm/cgroup* /etc/slurm/
sudo cp ubuntu-slurm/slurm.conf /etc/slurm/
sudo cp ubuntu-slurm/gres.conf /etc/slurm/

This directory should also be created on ALL workers:

sudo mkdir -p /var/spool/slurm/d
sudo chown slurm /var/spool/slurm/d

After the conf files have been copied to all workers and the master node, you may want to reboot the computers, or at least restart the slurm services:

Workers: sudo systemctl restart slurmd Master:

sudo systemctl restart slurmctld
sudo systemctl restart slurmdbd
sudo systemctl restart slurmd  (if your master node is part of the slurm queue, which ideally we don't have, e.g. daytona-login, watgpu.cs)

Next we just create a cluster: sudo sacctmgr add cluster compute-cluster

Configure cgroups

cgroups allows memory limitations from SLURM jobs and users to be implemented. Set memory cgroups on all workers with:

sudo nano /etc/default/grub
And change the following variable to:
GRUB_CMDLINE_LINUX="cgroup_enable=memory swapaccount=1"
sudo update-grub
Finally at the end, I did one last sudo apt update, sudo apt upgrade, and sudo apt autoremove, then rebooted the computers: sudo reboot

If you are running cgroups on a config that requires nodes to be LXC containers, then the main host where the LXC container(s) will run, needs to have the grub updated to the following config:

Edit the /etc/default/grub

sudo nano /etc/default/grub
## change the following line to:

GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=0"=

sudo update-grub

And you have to change the LXC config to NOT use cgroups2, so change any and all lines in the lxc configs:

lxc config edit container_name

change all instances of =cgroup2= to just =cgroup=

Save and Exit

lxc restart container_name

Partitioning and Priority

Prior to creating partitions, you need ensure a few things. First that the SelectType variable in /etc/slurm/slurm.conf is set to select/linear Next you also want to add the following two lines to etc/slurm/slurm.conf

PreemptType=preempt/partition_prio
PreemptMode=SUSPEND,GANG

Remember to make these changes across ALL the machines or simply edit and copy the latest slurm.conf from the master to ALL the worker machines using scp Once these changes have been made, you can create the partitions.

To create partitions, you need to add the following lines to the /etc/slurm/slurm.conf file on ALL the machines.

PartitionName=VISION Nodes=ALL Default=NO PriorityTier=100 AllowAccounts=vision_group OverSubscribe=FORCE:3
PartitionName=SCHOOL Nodes=ALL Default=YES PriorityTier=1 OverSubscribe=NO

Remember to run systemctl restart slurmctld on the master and systemctl restart slurmd on ALL the worker nodes.

So the above partitions are examples from watgpu cluster. The first one is for the VISION group, where only users that are part of the "vision_group" (which is a SLURM group, different from an AD group). Each user who is part of the vision research group must be added to this group separately from AD grouping. The other configs such as "PriorityTier" are priority rankings. The greater PriorityTier number takes precedence over the other smaller ones. So in this case, anybody who uses the partition name "SCHOOL" (which is the default partition in the cluster), can have their jobs bumped off and paused if a Vision group user submits a job to the VISION partition.

Troubleshooting

When in doubt, first try updating software with sudo apt update; sudo apt upgrade -y and rebooting (sudo reboot).

Log files

When in doubt, you can check the log files. The locations are set in the slurm.conf file, and are /var/log/slurmd.log and /var/log/slurmctld.log by default. Open them with sudo nano /var/log/slurmctld.log. To go to the bottom of the file, use ctrl+_ and ctrl+v. I also changed the log paths to /var/log/slurm/slurmd.log and so on, and changed the permissions of the folder to be owner by slurm: sudo chown slurm:slurm /var/log/slurm.

Checking SLURM states

Some helpful commands:

scontrol ping -- this checks if the controller node can be reached. If this isn't working (i.e. the command returns 'DOWN' and not 'UP'), you might need to allow connections to the slurmctrlport (in the slurm.conf file). This is set to 6817 in the config file. To allow connections with the firewall, execute:

sudo ufw allow from any to any port 6817

and

sudo ufw reload

Error codes 1:0 and 2:0

If trying to run a job with sbatch and the exit code is 1:0, this is usually a file writing error. The first thing to check is that your output and error file paths in the .job file are correct. Also check the .py file you want to run has the correct filepath in your .job file. Then you should go to the logs (/var/log/slurm/slurmctld.log) and see which node the job was trying to run on. Then go to that node and open the logs (/var/log/slurm/slurmd.log) to see what it says. It may say something about the path for the output/error files, or the path to the .py file is incorrect.

It could also mean your common storage location is not r/w accessible to all nodes. In the logs, this would show up as something about permissions and unable to write to the filesystem. Double-check that you can create files on the /storage location on all workers with something like touch testing.txt. If you can't create a file from the worker nodes, you probably have some sort of NFS issue. Go back to the NFS section and make sure everything looks ok. You should be able to create directories/files in /storage from any node with the admin account and they should show up as owned by the admin user. If not, you may have some issue in your /etc/exports or with your GID/UIDs not matching.

If the exit code is 2:0, this can mean there is some problem with either the location of the python executable, or some other error when running the python script. Double check that the srun or python script is working as expected with the python executable specified in the sbatch job file.

If some workers are 'draining', down, or unavailable, you might try:

sudo scontrol update NodeName=worker1 State=RESUME

Node is stuck draining (drng from sinfo)

This has happened due to the memory size in slurm.conf being higher than actual memor size. Double check the memory from free -m or sudo slurmd -C and update slurm.conf on all machines in the cluster. Then run sudo scontrol update NodeName=worker1 State=RESUME

Nodes are not visible upon restart

After restarting the master node, sometimes the workers aren't there. I've found I often have to do sudo scontrol update NodeName=worker1 State=RESUME to get them working/available.

Taking a node offline

The best way to take a node offline for maintenance is to drain it: sudo scontrol update NodeName=worker1 State=DRAIN Reason='Maintenance'

Users can see the reason with sinfo -R

Testing GPU load

Using watch -n 0.1 nvidia-smi will show the GPU load in real-time. You can use this to monitor jobs as they are scheduled to make sure all the GPUs are being utilized.

Setting account options

You may want to limit jobs or submissions. Here is how to set attributes (-1 means no limit):
bash
sudo sacctmgr modify account students set GrpJobs=-1
sudo sacctmgr modify account students set GrpSubmitJobs=-1
sudo sacctmgr modify account students set MaxJobs=-1
sudo sacctmgr modify account students set MaxSubmitJobs=-1

Groups and membership

These are the commands to create the group:

sacctmgr add account name=vision_group description="VISION Group"

sacctmgr modify account name=vision_group set GrpNodes=VISION

sacctmgr add association cluster=compute-cluster account=vision_group partition=VISION

And then to create and add the user to the group:

sacctmgr add user name=ldpaniak account=vision_group

You can also do the following to view associations:

sacctmgr show association

Better sacct

This shows all running jobs with the user who is running them.

sacct --format=jobid,jobname,state,exitcode,user,account

More on sacct here.

Changing IPs

If the IP addresses of your machines change, you will need to update these in the file /etc/hosts on all machines and /etc/exports on the master node. It's best to restart after making these changes.

Node not able to connect to slurmctld

If a node isn't able to connnect to the controller (server/master), first check that time is properly synced. Try using the date command to see if the times are synced across the servers.

Configure resource restricted SSH access

To configure resource limited SSH access where the users would share resources on a single-node (such as shared GPU access), you need to have the pam_slurm_adopt.so shared object file on the worker/compute nodes.

Make Install pam_slurm_adopt.so

Go back to the "head node" where slurm-20.11.9 was installed and do the following:
cd slurm-20.11.9/contribs/pam_slurm_adopt/
make && make install
This will run through the install and create the shared object file. Locate the shared object and copy it to the worker node.

scp slurm-20.11.9/contribs/pam_slurm_adopt/.libs/pam_slurm_adopt.so WORKER1:/lib/x86_64-linux-gnu/security/pam_slurm_adopt.so

Edit /etc/slurm/slurm.conf on ALL nodes including Daytona-login and add this line:

PrologFlags=contain

Edit /etc/pam.d/sshd and add this line:

account    required     /lib/x86_64-linux-gnu/security/pam_slurm_adopt.so=
Make sure no instances of the phrase pam_systemd.so exists on any of the /etc/pam.d/* files, i.e. grep systemd /etc/pam.d/* and see if there are lines that have pam_systemd.so, and if they do, comment them out using #

Stop and unmask systemd-logind.service on the nodes:

systemctl stop sytemd-logind
systemctl unmask systemd-logind

May need to run run a slurm reconfigure on all the nodes (including master node) if it doesn’t work right away:

scontrol reconfigure
Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2023-10-12 - LoriPaniak
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback