Himrod cluster
This is a cluster belonging to Ashraf Aboulnaga and Hans
DeSterck, purchase in May 2014, running since October 2014
Overview
role |
count |
cpu |
memory |
disk |
interconnect |
other |
himrod.cs |
1 |
2xIntel E5-2670(8C) |
32GB |
Dell RAID (2TB) |
4x 10GbE + 2x GbE |
himrod-big-[1-4] |
4 |
2x Intel E5-2670 (8C) |
512GB |
6x 600GB + 1x 200GB |
2x 10GbE + 2x GbE |
|
himrod-[1-23] |
23 |
2x Intel E5-2670 (8C) |
256GB |
6x 600GB + 1x 200GB |
2x 10GbE + 2x GbE |
|
himrod-storage |
1 |
2x Intel E5-2620 (6C) |
64GB |
Dell RAID (50TB) |
2x 10GbE + 2x GbE |
|
Sysadmin notes
Admin tools
- IMPORTANT Currently I have a number of tools located under /cscf-adm/src/cluster - the plan is to move all of them to /usr/local/bin in the very near future.
- This list below has been moved there
add_users file.csv
- What:
- * Add users from CSV file optionally specify home directory, email address, password and groups*
- If the group is admin - the user will be get all of the admin groups
- The script will email the user their password and account information
- Usage: * ./add_users file.csv
- See:
- sync-users to sync all user changes to all of the nodes
- Notes:
- needs root
- if the users email is NOT in uwdir you must add the full email to the csv file
del-user
- Delete a user on all of the nodes
- Usage:
- Notes:
sync-users
- sync user accounts, passwords, ssh keys and group setting to all of the nodes
- Usage:
- sync-users
- This command can be run any time, and more then once without harm
- Notes:
NFS mounts"> Checking nodes and NFS mounts
- /cscf-adm/src/cluster/fix-mount
- Verifies NFS mounts are working - mounts them if not
check-nodes
- check that each node is on line or not
- Usage:
- Notes:
- This can be used as a common template we use to perform a task on all nodes
- Check to See if all of the nodes are online
ILOM managment
- You can do full KVM management of each node (himrod has to be up)
- This includes powering up, console, booting remote media from your own desktop, etc
- Browse directly to the ILOM interface as listed below
- Accept the security certificate
- Login userid: cscf-adm - fall 2013 password - see CSCF staff
10.0.150.10 ilom-himrod-storage
10.0.150.11 ilom-himrod-big-1
10.0.150.12 ilom-himrod-big-2
10.0.150.13 ilom-himrod-big-3
10.0.150.14 ilom-himrod-big-4
10.0.150.101 ilom-himrod-1
10.0.150.102 ilom-himrod-2
10.0.150.103 ilom-himrod-3
10.0.150.104 ilom-himrod-4
10.0.150.105 ilom-himrod-5
10.0.150.106 ilom-himrod-6
10.0.150.107 ilom-himrod-7
10.0.150.108 ilom-himrod-8
10.0.150.109 ilom-himrod-9
10.0.150.110 ilom-himrod-10
10.0.150.111 ilom-himrod-11
10.0.150.112 ilom-himrod-12
10.0.150.113 ilom-himrod-13
10.0.150.114 ilom-himrod-14
10.0.150.115 ilom-himrod-15
10.0.150.116 ilom-himrod-16
10.0.150.117 ilom-himrod-17
10.0.150.118 ilom-himrod-18
10.0.150.119 ilom-himrod-19
10.0.150.120 ilom-himrod-20
10.0.150.121 ilom-himrod-21
10.0.150.122 ilom-himrod-22
10.0.150.123 ilom-himrod-23
10.0.150.27 ilom-himrod-switch
Himrod Admin setup overview SAMBA, DNSMASQ, DHCP, FIREWALL, APACHE, NFS, Imaging tools
- HimrodTools overview of the setup and installaion of services on HIMROD
iDRac tools and scripts
ipmitools
Imaging a node
PXE booting recovry and imaging tools
Setup and configuration Scripts
Packages
finding packages
- Example: apt-cache search postgres -n
- Search for postgres in the one line description
- Example: apt-cahe search postgres
- Searchj for postgres in the entire description
example script to install packages on all of the nodes
- /cscf-adm/src/cluster/install-mpi
- The script first installed the packages listed in the script on himrod and then on the nodes
- The script is only 27 lines long and you will only have to change 2 lines!
- NODES and "common_vars* is pulled in from the search path - in this case: /usr/local/bin
- (ie. they do NOT have to be in the current directory)
!/bin/bash
#
# Mike Gore, 10 Oct 2014
#
# Install openmpi on the nodes and headnode
. common_vars
. NODES
update_list
update_packages netpipe-openmpi openmpi-bin openmpi-checkpoint openmpi-common openmpi-doc
for i in $NODES
do
if ping -c 1 $i >/dev/null 2>&1
then
cat <<EOF | ssh root@"$i"
. common_vars
. NODES
update_list
update_packages netpipe-openmpi openmpi-bin openmpi-checkpoint openmpi-common openmpi-doc
EOF
else
echo $i is down
fi
done
Disks on nodes
- Each node has disks mounted with names /localdiskN where N is 0 .. 5
Tools
OPENMPI
- I have installed openmpi on himrod and all of the nodes
- As of 10 Oct 2014 only the configuration part has not been completed
Limits
I have added the following lines to the /etc/security/limits.conf file for now (until I am given a better idea of the other entries we must set)
(I used the /cscf-adm/src/cluster/sync-users script to update this file on the nodes - I ran it as root)
Note: the nodes may need to be restarted
There is a script that does this restart correctly is under /cscf-adm/src/cluster called restart_all_nodes
(Must be run as root)
# system defaults
* hard cpu unlimited
* hard nproc unlimited
* hard as unlimited
* hard data unlimited
* hard sigpending unlimited
* hard nofile unlimited
* hard msqqueue unlimited
* hard locks unlimited
* hard file unlimited