Himrod cluster

Himrod cluster

This is a cluster belonging to Ashraf Aboulnaga and Hans DeSterck, purchase in May 2014, running since October 2014

Sysadmin notes

Admin tools

IMPORTANT Currently I have a number of tools located under /cscf-adm/src/cluster - the plan is to move all of them to /usr/local/bin in the very near future.
This list below has been moved there

add_users

What:
- * Add users from CVS optionally specify home directory, email address, password and groups*
- If the group is admin - the user will be get all of the admin groups
See:
- sync-users to sync all user changes to all of the nodes
Notes:
- needs root

del-user

Delete a user on all of the nodes
Usage:
- del-user userid
Notes:
- needs root perms

sync-users

sync user accounts, passwords, ssh keys and group setting to all of the nodes
Usage:
- sync-users
- This command can be run any time, and more then once without harm
Notes:
- needs root

Checking nodes and NFS mounts

/cscf-adm/src/cluster/fix-mount
- Verifies NFS mounts are working - mounts them if not

Reset node using iDrac

/cscf-adm/src/IiDrac/reset_node ilom-NODE - where NODE is lised in the table below

ILOM managment

You can do full KVM management of each node from himrod (himrod has to be up)
- This includes powering up, console, booting remote media from your own desktop, etc
Log onto himrod with X forwarding enabled: ssh -X userid@himrod.cs.uwaterloo.ca
- Start firefox one himrod using these options: firefox --no-remote
- Each node has an ILOM address 9listed below) open that url, for example, as https://ilom-himrod-1
- Accept the security certificate
- Login userid: cscf-adm - fall 2013 password - see CSCF staff
  - WE will be created a himrod user shortly

172.19.128.10 ilom-himrod-storage
172.19.128.11 ilom-himrod-big-1
172.19.128.12 ilom-himrod-big-2
172.19.128.13 ilom-himrod-big-3
172.19.128.14 ilom-himrod-big-4
172.19.128.101 ilom-himrod-1
172.19.128.102 ilom-himrod-2
172.19.128.103 ilom-himrod-3
172.19.128.104 ilom-himrod-4
172.19.128.105 ilom-himrod-5
172.19.128.106 ilom-himrod-6
172.19.128.107 ilom-himrod-7
172.19.128.108 ilom-himrod-8
172.19.128.109 ilom-himrod-9
172.19.128.110 ilom-himrod-10
172.19.128.111 ilom-himrod-11
172.19.128.112 ilom-himrod-12
172.19.128.113 ilom-himrod-13
172.19.128.114 ilom-himrod-14
172.19.128.115 ilom-himrod-15
172.19.128.116 ilom-himrod-16
172.19.128.117 ilom-himrod-17
172.19.128.118 ilom-himrod-18
172.19.128.119 ilom-himrod-19
172.19.128.120 ilom-himrod-20
172.19.128.121 ilom-himrod-21
172.19.128.122 ilom-himrod-22
172.19.128.123 ilom-himrod-23

check-nodes

check that each node is on line or not
Usage:
- check-nodes
Notes:
- This can be used as a common template we use to perform a task on all nodes
- Check to See if all of the nodes are online

Packages

finding packages

Example: apt-cache search postgres -n
- Search for postgres in the one line description
Example: apt-cahe search postgres
- Searchj for postgres in the entire description

example script to install packages on all of the nodes

/cscf-adm/src/cluster/install-mpi
- The script first installed the packages listed in the script on himrod and then on the nodes
- The script is only 27 lines long and you will only have to change 2 lines!
- NODES and "common_vars* is pulled in from the search path - in this case: /usr/local/bin
  - (ie. they doNOT have to be in the current directory)

!/bin/bash
#
# Mike Gore, 10 Oct 2014
#
# Install openmpi on the nodes and headnode

. common_vars
. NODES

update_list
update_packages netpipe-openmpi openmpi-bin openmpi-checkpoint openmpi-common openmpi-doc 


for i in $NODES
do 
   if ping -c 1 $i >/dev/null 2>&1
   then
   cat <<EOF | ssh root@"$i" 
. common_vars
. NODES
update_list
update_packages netpipe-openmpi openmpi-bin openmpi-checkpoint openmpi-common openmpi-doc 
EOF
   else
      echo $i is down
   fi
done

Disks on nodes

Each node has disks mounted with names /localdiskN where N is 0 .. 5

Tools

OPENMPI

I have installed openmpi on himrod and all of the nodes
As of 10 Oct 2014 only the configuration part has not been completed

Limits

I have added the following lines to the /etc/security/limits.conf file for now (until I am given a better idea of the other entries we must set)
(I used the /cscf-adm/src/cluster/sync-users script to update this file on the nodes - I ran it as root)
Note: the nodes may need to be restarted
There is a script that does this restart correctly is under /cscf-adm/src/cluster called restart_all_nodes
(Must be run as root)

# system defaults
*          hard    cpu             unlimited
*          hard    nproc           unlimited
*          hard    as              unlimited
*          hard    data            unlimited
*          hard    sigpending      unlimited
*          hard    nofile          unlimited
*          hard    msqqueue        unlimited
*          hard    locks           unlimited
*          hard    file            unlimited

Topic revision: r1 - 2015-01-27 - MichaelHynes

Himrod