SlurmDeployment < CF

CF Web>Hardware>SlurmDeployment (2023-10-12, LoriPaniak) (raw view)
---+SLURM-installation on Ubuntu

---+Preparing for SLURM installation

---++ Munge and SLURM user creations

You want to ensure that you have the Munge and Slurm users created on EACH machine (master + nodes) so that the UIDs and GIDs are synced across the cluster for these. To do so, run the following:

<verbatim>
sudo adduser -u 1111 munge --disabled-password --gecos ""
sudo adduser -u 1121 slurm --disabled-password --gecos ""
</verbatim>



---++ Passwordless ssh from Master to worker nodes

First we need passwordless SSH between the master and compute nodes.  We are still using =master= as the master node hostname and =worker= as the worker hostname.  On the master:

<verbatim>
ssh-keygen
ssh-copy-id admin@worker
</verbatim>

---++Install munge on the master:
<verbatim>
sudo apt-get install libmunge-dev libmunge2 munge -y
sudo systemctl enable munge
sudo systemctl start munge
</verbatim>

Test munge if you like:
=munge -n | unmunge | grep STATUS=


Copy the munge key to all the WORKER nodes at /etc/munge/munge.key (ensure to =chown -R munge /etc/munge/munge.key= )
<verbatim>
ssh WORKER mkdir -p /etc/munge
scp /etc/munge/munge.key WORKER:/etc/munge/
ssh WORKER chown -R munge:munge /etc/munge/
</verbatim>

---++Install munge on worker nodes:
<verbatim>
sudo apt-get install libmunge-dev libmunge2 munge
sudo systemctl enable munge
sudo systemctl start munge
</verbatim>

If you want, you can test munge:
=munge -n | unmunge | grep STATUS=

---++Prepare DB for SLURM

These instructions more or less follow this github repo: https://github.com/mknoxnv/ubuntu-slurm

First we want to clone the repo:

<verbatim> git clone https://github.com/mknoxnv/ubuntu-slurm.git </verbatim>

Install prereqs on BOTH Master and Worker:
<verbatim>
sudo apt-get install git gcc make ruby ruby-dev libpam0g-dev libmysqlclient-dev mariadb-server build-essential libssl-dev -y
sudo gem install fpm
</verbatim>

Next we set up MariaDB for storing SLURM data:
<verbatim>
sudo systemctl enable mysql
sudo systemctl start mysql
sudo mysql -u root
</verbatim>

Within mysql:
<verbatim>
create database slurm_acct_db;
create user 'slurm'@'localhost';
set password for 'slurm'@'localhost' = password('slurmdbpass');
grant usage on *.* to 'slurm'@'localhost';
grant all privileges on slurm_acct_db.* to 'slurm'@'localhost';
flush privileges;
exit
</verbatim>


Ideally you want to change the password to something different than =slurmdbpass=.  This must also be set in the config file =ubuntu-slurm/slurmdbd.conf=.

---+Install SLURM
---++Download and install SLURM on Master

---+++Build the SLURM .deb install file
It’s best to check the downloads page and use the latest version (right click link for download and use in the wget command).  Ideally we’d have a script to scrape the latest version and use that dynamically.

You can use the -j option to specify the number of CPU cores to use for 'make', like =make -j12=.  =htop= is a nice package that will show usage stats and quickly show how many cores you have.

<verbatim>
wget https://download.schedmd.com/slurm/slurm-20.11.9.tar.bz2
tar xvjf slurm-20.11.9.tar.bz2
cd slurm-20.11.9
./configure --prefix=/tmp/slurm-build --sysconfdir=/etc/slurm --enable-pam --with-pam_dir=/lib/x86_64-linux-gnu/security/ --without-shared-libslurm
make
make contrib
make install
cd ..
</verbatim>

---+++Install SLURM
<verbatim>
sudo fpm -s dir -t deb -v 1.0 -n slurm-20.11.9 --prefix=/usr -C /tmp/slurm-build .
sudo dpkg -i slurm-20.11.9_1.0_amd64.deb
</verbatim>

Make all the directories we need:
<verbatim>
sudo mkdir -p /etc/slurm /etc/slurm/prolog.d /etc/slurm/epilog.d /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm
sudo chown slurm /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm
</verbatim>

Copy slurm control and db services:
<verbatim>
sudo cp ubuntu-slurm/slurmdbd.service /etc/systemd/system/
sudo cp ubuntu-slurm/slurmctld.service /etc/systemd/system/
</verbatim>

The slurmdbd.conf file should be copied before starting the slurm services:
=sudo cp ubuntu-slurm/slurmdbd.conf /etc/slurm/=

Start the slurm services:
<verbatim>
sudo systemctl daemon-reload
sudo systemctl enable slurmdbd
sudo systemctl start slurmdbd
sudo systemctl enable slurmctld
sudo systemctl start slurmctld
</verbatim>

If the master is also going to be a worker/compute node, you should do:

<verbatim>
sudo cp ubuntu-slurm/slurmd.service /etc/systemd/system/
sudo systemctl enable slurmd
sudo systemctl start slurmd
</verbatim>

---++Worker nodes
Now install SLURM on worker nodes:

Copy the slurm-20.11.9_1.0_amd64.deb file from the master to the worker nodes. And then on the WORKER node(s):
<verbatim>
root@WORKER:
sudo dpkg -i slurm-20.11_1.0_amd64.deb
sudo cp ubuntu-slurm/slurmd.service /etc/systemd/system/
sudo systemctl enable slurmd
sudo systemctl start slurmd
</verbatim>

---++Configuring SLURM

Next we need to set up the configuration file.  Copy the default config from the github repo:

=cp ubuntu-slurm/slurm.conf /etc/slurm/slurm.conf=

Note: for job limits for users, you should add the [AccountingStorageEnforce=limits](https://slurm.schedmd.com/resource_limits.html) line to the config file.

Once SLURM is installed on all nodes, we can use the command

=sudo slurmd -C=

to print out the machine specs.  Then we can copy this line into the config file and modify it slightly.  To modify it, we need to add the number of GPUs we have in the system (and remove the last part which shows UpTime).  You can get the number of GPUs by running =nvidia-smi= . Here is an example of a config line:

=NodeName=worker1 Gres=gpu:2 CPUs=12 Boards=1 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=128846=

Take this line and put it at the bottom of =slurm.conf=.

Next, setup the =gres.conf= file.  Lines in =gres.conf= should look like:

<verbatim>
NodeName=worker1 Name=gpu File=/dev/nvidia0
NodeName=worker1 Name=gpu File=/dev/nvidia1
NodeName=worker2 Name=gpu File=/dev/nvidia0
</verbatim>

If you have multiple GPUs, keep adding lines for each node and increment the last number after nvidia.

Gres has more options detailed in the docs: https://slurm.schedmd.com/slurm.conf.html (near the bottom).

Finally, we need to copy .conf files on **all** machines.  This includes the =slurm.conf= file, =gres.conf=, =cgroup.conf= , and =cgroup_allowed_devices_file.conf=.  Without these files it seems like things don’t work.

<verbatim>
sudo cp ubuntu-slurm/cgroup* /etc/slurm/
sudo cp ubuntu-slurm/slurm.conf /etc/slurm/
sudo cp ubuntu-slurm/gres.conf /etc/slurm/
</verbatim>

This directory should also be created on ALL workers:
<verbatim>
sudo mkdir -p /var/spool/slurm/d
sudo chown slurm /var/spool/slurm/d
</verbatim>

After the conf files have been copied to all workers and the master node, you may want to reboot the computers, or at least restart the slurm services:

Workers:
=sudo systemctl restart slurmd=
Master:
<verbatim>
sudo systemctl restart slurmctld
sudo systemctl restart slurmdbd
sudo systemctl restart slurmd  (if your master node is part of the slurm queue, which ideally we don't have, e.g. daytona-login, watgpu.cs)
</verbatim>

Next we just create a cluster:
=sudo sacctmgr add cluster compute-cluster=


---++Configure cgroups

cgroups allows memory limitations from SLURM jobs and users to be implemented.  Set memory cgroups on all workers with:

<verbatim>
sudo nano /etc/default/grub
And change the following variable to:
GRUB_CMDLINE_LINUX="cgroup_enable=memory swapaccount=1"
sudo update-grub
</verbatim>
Finally at the end, I did one last =sudo apt update=, =sudo apt upgrade=, and =sudo apt autoremove=, then rebooted the computers:
=sudo reboot=

If you are running cgroups on a config that requires nodes to be LXC containers, then the main host where the LXC container(s) will run, needs to have the grub updated to the following config:

Edit the /etc/default/grub

<verbatim>
sudo nano /etc/default/grub
## change the following line to:

GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=0"=

sudo update-grub
</verbatim>

And you have to change the LXC config to NOT use cgroups2, so change any and all lines in the lxc configs:

<verbatim>
lxc config edit container_name

change all instances of =cgroup2= to just =cgroup=

Save and Exit

lxc restart container_name
</verbatim>


---++ Partitioning and Priority

Prior to creating partitions, you need ensure a few things. First that the =SelectType= variable in =/etc/slurm/slurm.conf= is set to =select/linear=
Next you also want to add the following two lines to =etc/slurm/slurm.conf=

<verbatim>
PreemptType=preempt/partition_prio
PreemptMode=SUSPEND,GANG
</verbatim>

Remember to make these changes across ALL the machines or simply edit and copy the latest =slurm.conf= from the master to ALL the worker machines using =scp=
Once these changes have been made, you can create the partitions.

To create partitions, you need to add the following lines to the =/etc/slurm/slurm.conf= file on ALL the machines.

<verbatim>
PartitionName=VISION Nodes=ALL Default=NO PriorityTier=100 AllowAccounts=vision_group OverSubscribe=FORCE:3
PartitionName=SCHOOL Nodes=ALL Default=YES PriorityTier=1 OverSubscribe=NO
</verbatim>

Remember to run =systemctl restart slurmctld= on the master and =systemctl restart slurmd= on ALL the worker nodes.

So the above partitions are examples from watgpu cluster. The first one is for the VISION group, where only users that are part of the "vision_group" (which is a SLURM group, different from an AD group). Each user who is part of the vision research group must be added to this group separately from AD grouping. The other configs such as "PriorityTier" are priority rankings. The greater PriorityTier number takes precedence over the other smaller ones. So in this case, anybody who uses the partition name "SCHOOL" (which is the default partition in the cluster), can have their jobs bumped off and paused if a Vision group user submits a job to the VISION partition. 



---+Troubleshooting

When in doubt, first try updating software with =sudo apt update; sudo apt upgrade -y= and rebooting (=sudo reboot=).

---++Log files
When in doubt, you can check the log files.  The locations are set in the slurm.conf file, and are =/var/log/slurmd.log= and =/var/log/slurmctld.log= by default.  Open them with =sudo nano /var/log/slurmctld.log=.  To go to the bottom of the file, use ctrl+_ and ctrl+v.  I also changed the log paths to =/var/log/slurm/slurmd.log= and so on, and changed the permissions of the folder to be owner by slurm: =sudo chown slurm:slurm /var/log/slurm=.

---++Checking SLURM states
Some helpful commands:

=scontrol ping= -- this checks if the controller node can be reached.  If this isn't working (i.e. the command returns 'DOWN' and not 'UP'), you might need to allow connections to the slurmctrlport (in the slurm.conf file).  This is set to 6817 in the config file.  To allow connections with the firewall, execute:

=sudo ufw allow from any to any port 6817=

and

=sudo ufw reload=

---++Error codes 1:0 and 2:0
If trying to run a job with =sbatch= and the exit code is 1:0, this is usually a file writing error.  The first thing to check is that your output and error file paths in the .job file are correct.  Also check the .py file you want to run has the correct filepath in your .job file.  Then you should go to the logs (=/var/log/slurm/slurmctld.log=) and see which node the job was trying to run on.  Then go to that node and open the logs (=/var/log/slurm/slurmd.log=) to see what it says.  It may say something about the path for the output/error files, or the path to the .py file is incorrect.

It could also mean your common storage location is not r/w accessible to all nodes.  In the logs, this would show up as something about permissions and unable to write to the filesystem.  Double-check that you can create files on the /storage location on all workers with something like =touch testing.txt=.  If you can't create a file from the worker nodes, you probably have some sort of NFS issue.  Go back to the NFS section and make sure everything looks ok.  You should be able to create directories/files in /storage from any node with the admin account and they should show up as owned by the admin user.  If not, you may have some issue in your /etc/exports or with your GID/UIDs not matching.

If the exit code is 2:0, this can mean there is some problem with either the location of the python executable, or some other error when running the python script.  Double check that the srun or python script is working as expected with the python executable specified in the sbatch job file.

If some workers are 'draining', down, or unavailable, you might try:

=sudo scontrol update NodeName=worker1 State=RESUME=


---++Node is stuck draining (drng from =sinfo=)
This has happened due to the memory size in slurm.conf being higher than actual memor size.  Double check the memory from =free -m= or =sudo slurmd -C= and update slurm.conf on all machines in the cluster.  Then run =sudo scontrol update NodeName=worker1 State=RESUME=

---++Nodes are not visible upon restart
After restarting the master node, sometimes the workers aren't there. I've found I often have to do =sudo scontrol update NodeName=worker1 State=RESUME= to get them working/available.


---++Taking a node offline
The best way to take a node offline for maintenance is to drain it:
=sudo scontrol update NodeName=worker1 State=DRAIN Reason='Maintenance'=

Users can see the reason with =sinfo -R=


---++Testing GPU load
Using =watch -n 0.1 nvidia-smi= will show the GPU load in real-time.  You can use this to monitor jobs as they are scheduled to make sure all the GPUs are being utilized.




---++Setting account options
You may want to limit jobs or submissions.  Here is how to set attributes (-1 means no limit):
<verbatim>bash
sudo sacctmgr modify account students set GrpJobs=-1
sudo sacctmgr modify account students set GrpSubmitJobs=-1
sudo sacctmgr modify account students set MaxJobs=-1
sudo sacctmgr modify account students set MaxSubmitJobs=-1
</verbatim>

---++ Groups and membership

These are the commands to create the group:

 

sacctmgr add account name=vision_group description="VISION Group"

sacctmgr modify account name=vision_group set GrpNodes=VISION

sacctmgr add association cluster=compute-cluster account=vision_group partition=VISION

 

And then to create and add the user to the group:

 

sacctmgr add user name=ldpaniak account=vision_group

 

You can also do the following to view associations:

 

sacctmgr show association

---+Better sacct

This shows all running jobs with the user who is running them.

=sacct --format=jobid,jobname,state,exitcode,user,account=

More on sacct [[https://slurm.schedmd.com/sacct.html][here]].


---+Changing IPs
If the IP addresses of your machines change, you will need to update these in the file =/etc/hosts= on all machines and =/etc/exports= on the master node.  It's best to restart after making these changes.

---+Node not able to connect to slurmctld
If a node isn't able to connnect to the controller (server/master), first check that time is properly synced.  Try using the =date= command to see if the times are synced across the servers.



---+Configure resource restricted SSH access
To configure resource limited SSH access where the users would share resources on a single-node (such as shared GPU access), you need to have the =pam_slurm_adopt.so= shared object file on the worker/compute nodes.

---++Make Install pam_slurm_adopt.so
Go back to the "head node" where slurm-20.11.9 was installed and do the following:
<verbatim>
cd slurm-20.11.9/contribs/pam_slurm_adopt/
make && make install
</verbatim>
This will run through the install and create the shared object file. Locate the shared object and copy it to the worker node.

<verbatim>
scp slurm-20.11.9/contribs/pam_slurm_adopt/.libs/pam_slurm_adopt.so WORKER1:/lib/x86_64-linux-gnu/security/pam_slurm_adopt.so
</verbatim>


Edit =/etc/slurm/slurm.conf= on ALL nodes including Daytona-login and add this line:

<verbatim>
PrologFlags=contain
</verbatim>


Edit =/etc/pam.d/sshd= and add this line:
<verbatim>
account    required     /lib/x86_64-linux-gnu/security/pam_slurm_adopt.so=
</verbatim>
Make sure no instances of the phrase =pam_systemd.so= exists on any of the =/etc/pam.d/*= files, i.e. =grep systemd /etc/pam.d/*= and see if there are lines that have =pam_systemd.so=, and if they do, comment them out using =#=

Stop and unmask systemd-logind.service on the nodes:
<verbatim>
systemctl stop sytemd-logind
systemctl unmask systemd-logind
</verbatim>

May need to run run a slurm reconfigure on all the nodes (including master node) if it doesn’t work right away:
<verbatim>
scontrol reconfigure
</verbatim>
Topic revision: r3 - 2023-10-12 - LoriPaniak
Information in this area is meant for use by CSCF staff and is not official documentation, but anybody who is interested is welcome to use it if they find it useful.
Other Webs
My links
- People
- CERAS
- WatForm
- Tetherless lab
- Ubuntu Main.HowTo
- eDocs
- RGG NE notes
- RGG
- CS infrastructure
- Grad images
Edit