--
MikeGore
Note - cluster is no longer in service RT#1039321
- The data for all of novo1,novo2 and novo10 including /u is now located on
- root@rapid1:/data/novo-backup
- root@asimov:/backups/research/novo
NOVO Cluster"> Bin Ma NOVO Cluster
Hardware Overview
Cluster Description
IBM BladeCenter HS22 with 10 nodes
Intel Xeon Processor E5520 4C 2.26GHz 8MB Cache 106
IBM System Storage DS3200
IBM 73GB 15K 6Gbps SAS 2.5" SFF Slim-HS HDD
SAS Connectivity Card (CIOv) for IBM BladeCenter
Integrated SAS Mirroring - 2 identical HDDs required
IBM 2TB 7200 Dual Port SATA 3.5'' HS HDD
IBM BC SAS Connectivity Mod
IBM System Storage DS3200 SAS Dual Controller
DS3000 1GB Cache Memory Upgrade
Memory 2GB (1x2GB, 2Rx8, 1.5V) PC3-10600 CL9 ECC DDR3 1333MHz VLP
BNT Layer 2/3 Copper Gb Ethernet Switch Module for IBM BladeCenter
E Series Blade Chassis
MT: 8677
S/N KQZLXPX
Product ID: 86773TUMTM - S/N 8677HC1 - KQZLXPX
Storage1 : IBM System Storage DS3200
Type: 1726-HC2
S/N: 13K1AZW
P/N 13N1972
FRU PN: 39R6545
EC G32895B
UPS
FRU PN: 46M5359
PN: 46M5348
MT 2419
Model: RU1
S/N: 23A2008
Product ID: 24195KX
Warranty
IBM Tech Support
- Support Number is: 1-866-880-2765
Documents
Links
*
EDOCS
Notes: has purchase,service,configuration settings and other information saved in EDOCS
UPS documents
UPS is plugged into - the 13/15 socket under the floor
Inventory
Blades
Novo BladeCenter Management Interface - Remote Control, Remote Console, Network, Configuration, etc
- All Blade Managment tasks and remote console access cab be done via this Web Interface
- See: Novo Management Interface NOVOMGT
- Quick Summary:
- https://novo-mgmt.cs.uwaterloo.ca
- Userid: cscfadm, password (2017 version same as cscf-adm- see password safe)
- Remote Video KVM requires IE under windows with Sun Java
- Console:
- Blade Tasks
- Remote Control
- Start Remote Control BUTTON
- (Pick NODE from dropdown to right of Video Icon - on top center toolbar)
- Double Click in screen to activate the window
- Power:
- Blade Tasks
- Remote Control
- Start Remote Control BUTTON (A new windows Will Pop Up - say continue if you see a security warning)
- (Pick NODE from dropdown to right of Video Icon - on top center toolbar)
- Click on Power Control Icon dropdown * Choose Power ON or OFF
- Note: The following will NOT work: Restart, Shutdown, Shutdown with NMI (MNI does not waits for the OS to respond)
- These last options ASSUME there are kernel modules installed that we DO NOT have installed!
Removing and Reinstalling Blades
Notes: The system takes more then three minutes to respond to any power on or display select after being inserted back in the chassis. This is likely due to the lights out management processor (LOM) boot time overhead for the blade
Head node
Compute Nodes
Notes: Each node has an external IP address at this time however you should use NOVO1.cs as the perferred connection path
-or -
The Management Interface NOVOMGT because we may move them to a private network
- NOVO has 10 nodes named NOVO1..NOVO10
- Each node can me managed and installed using the Management Interface
Remote Access
SSH Shell access
-
ssh NOVO1.cs
login as cscf-adm
Root Access
- Login as cscf-adm then sudo bash
Reboot Shutdown or Power cycle nodes - via Management Interface
- See: Novo Management Interface Summary above - or NOVOMGT
SSH via NO"> Shutdown Cluster with SSH via NOVO1 - the easy way
Assumes NOVO1 is online
Note: This scripts shuts down all the nodes first - then lastly NOVO1
- ssh cscf-adm@NOVO1.cs - standard password
- sudo bash
- cd /root/utils
- /shutdown-all
Shutdown any NODE
*Note: the nodes should relaly be shutdown before doing this
- Use SSH or Management Interface to connect to a node (novoX.cs is one of novo1.cs..novo10.cs) using userid cscf-adm
- sudo -s
- shutdown -h -t 90 novoX.cs will shutdown in 90 seconds
+Startup the Cluster
Note: always start novo1.cs first and then after fully booted up you may start the nodes
- Connect to NOVO1.cs via Management Interface
- Power on NOVO1.cs via Managment interface
- Wait until it starts
- Power on other nodes via Management interface
- See: Novo Management Interface Summary above - or NOVOMGT
Monitor and Keyboard access to the nodes
- NOVO1.cs has a monitor attached
Problem Solving
Networking
- /root/utils/check-nodes does a quick ping test
Shared filesystems on each NOVO1..NOVO10 node and the headnode novo
- /u - user home directories
- /local - local disk
- /tftpboot/pxes/images - Acronis images of nodes
- Optional:
- /opt - shared programs
- /usr/local/bin - common binaries from custom applications
Adding users
NOVO is part of the CS Active DIrectory
Software installation and maintenance
- NOVO cluster uses Ubuntu 12.04LTS 64bit
- apt-get *install
Cluster Tools and Scripts
- See: ClusterTools - outlines many of the scripst found under /root/utils on NOVO1.cs
Backups
- See LinuxLegatoClientSetup
- Note: / is replaced to /backup every night via crontab entry
- etckeeper package is installed - keeps daily snapshot of changes under /etc
Software Issues - anything we need installed
- In order to install software on NOVO1 and make it accessible on the nodes we install under /opt which is shared via NFS to all nodes
Unatended Upgrades
- We have installed the unattended-upgrades package to do critical updates automatically
MATLAB installation on novo and the nodes
- See OpenMPI
- /usr/bin/mpi*
- /opt/mpi-code - test examples
Software examples and usage
Running scripts on the nodes
- Example run a command on a node that displays the hostname
- ssh NOVO1 hostname
- This example runs the command hostname on node NOVO1
- Simple script that knows what node it was run on
Network and Host Configs
Ethernet interfaces
- eth0 External - connects to novo-sw1.cs
- eth1 Internal - connects to novo-sw2.cs
Network routes
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
default dc-csfw1-1-csre 0.0.0.0 UG 100 0 0 eth0
129.97.7.0 * 255.255.255.0 U 0 0 0 eth0
link-local * 255.255.0.0 U 1000 0 0 eth0
192.168.8.0 * 255.255.255.0 U 0 0 0 eth1
NOVO1 /etc/network/interfaces
root@NOVO1:/etc/network# cat interfaces
# ========================================================
# Created automatically on NOVO1.cs by command: /root/utils/./mk-interfaces
# on Wed Aug 8 10:54:23 EDT 2012
# ========================================================
auto eth0
iface eth0 inet static
address 129.97.7.230
netmask 255.255.255.0
broadcast 129.97.7.255
network 129.97.7.0
gateway 129.97.7.1
# mtu 1492
dns-servers 172.19.32.5 172.19.32.6
dns-search cs.uwaterloo.ca uwaterloo.ca
auto eth1
iface eth1 inet static
address 192.168.8.1
netmask 255.255.255.0
broadcast 192.168.8.255
network 192.168.8.0
# gateway 192.168.8.1
# mtu 1492
dns-servers 172.19.32.5 172.19.32.6
dns-search cs.uwaterloo.ca uwaterloo.ca
- See DNSMASQ - provides DHCP,DNS, BOOTP and TFTPBOOT
- The sections below show the files we used for DNSMASQ
hosts.common
127.0.0.1 localhost
127.0.1.1 localhost
# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
#BACKUP
129.97.167.210 backup-0.cs.uwaterloo.ca backup-0.cs backup-0 backup.cs.uwaterloo.ca backup.cs backup
dnsmasq.source.novo1
* documents ALL of the MAc/IP assignments for the system
# Define the NOVO Node LOMs
# We turn all lines in /etc/dnsmasq.conf to comments
# Except the line: conf-file=/etc/dnsmasq.hosts
# All configuration goes here
# ONLY the listen-address line changes when moving this to a new host
# ================================================================
include=dnsmasq.common
# ================================================================
# ONLY the listen-address line changes when moving this to a new host
# listen-address replaced with interface for this system
# listen-address=127.0.0.1,192.168.8.10
# 192.168.8.1
interface=eth1
# 129.97.7.230
except-interface=eth0
no-dhcp-interface=eth0
bind-interfaces
#This program assumes that the dhcp-range is in the following order
# tag,min,max,ttl
dhcp-range=comm,192.168.8.2,192.168.8.254,255.255.255.0,12h
# dhcp-option=option:router,192.168.8.1
# 3 = route
# 6 = DNS
#dhcp-option=comm,3,192.168.8.1
dhcp-option=comm,6,192.168.8.1,172.19.32.5,172.19.32.6,172.19.47.6
dhcp-option=119,cs.uwaterloo.ca
# ==============================================================
# Syntax:
# IP,MAC,NAME[,CNAME]
# subnet=192.168.8 (currently only doing /24 networks)
# ttl=1440m (whatever dnsmasq accepts for ttl)
# ==============================================================
# NOVO10 is on vlan 7
# ETH0
ttl=1440m
subnet=129.97.7
iface=eth0
230,e4:1f:13:7a:b3:cc,novo1.cs.uwaterloo.ca,novo1.cs
213,e4:1f:13:7a:b7:54,novo2.cs.uwaterloo.ca,novo2.cs
245,e4:1f:13:7a:b7:a0,novo10.cs.uwaterloo.ca,novo10.cs
# ==============================================================
# NOVO nodes that are on vlan 170
# ETH0
ttl=1440m
subnet=129.97.171
iface=eth0
225,e4:1f:13:7a:b8:48,novo3.cs.uwaterloo.ca,novo3.cs
224,e4:1f:13:7a:b7:74,novo4.cs.uwaterloo.ca,novo4.cs
223,e4:1f:13:7a:b6:c8,novo5.cs.uwaterloo.ca,novo5.cs
222,e4:1f:13:7a:b7:88,novo6.cs.uwaterloo.ca,novo6.cs
221,e4:1f:13:7a:b6:2c,novo7.cs.uwaterloo.ca,novo7.cs
220,e4:1f:13:7a:b7:1c,novo8.cs.uwaterloo.ca,novo8.cs
219,e4:1f:13:7a:b4:04,novo9.cs.uwaterloo.ca,novo9.cs
# ==============================================================
# COMM and NFS vlan
# Note: IP address match
# ETH1
ttl=1440m
subnet=192.168.8
iface=eth1
1,e4:1f:13:7a:b3:ce,novo1-comm,novo1
2,e4:1f:13:7a:b7:56,novo2-comm,novo2
3,e4:1f:13:7a:b8:4a,novo3-comm,novo3
4,e4:1f:13:7a:b7:76,novo4-comm,novo4
5,e4:1f:13:7a:b6:ca,novo5-comm,novo5
6,e4:1f:13:7a:b7:8a,novo6-comm,novo6
7,e4:1f:13:7a:b6:2e,novo7-comm,novo7
8,e4:1f:13:7a:b7:1e,novo8-comm,novo8
9,e4:1f:13:7a:b4:06,novo9-comm,novo9
10,e4:1f:13:7a:b7:a2,novo10-comm,novo10
NFS exports
root@novo1:~# exportfs -v
/opt 192.168.8.0/24(rw,wdelay,root_squash,no_subtree_check)
/u 192.168.8.0/24(rw,wdelay,root_squash,no_subtree_check)
/local 192.168.8.0/24(rw,wdelay,root_squash,no_subtree_check)
Disks
root@novo1:~# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sdc1 66G 36G 26G 59% /
udev 5.9G 4.0K 5.9G 1% /dev
tmpfs 2.4G 1.1M 2.4G 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 5.9G 0 5.9G 0% /run/shm
/dev/sdb1 7.9T 175M 7.5T 1% /local
/dev/sda1 5.0T 894G 3.8T 19% /u
Nodes /etc/fstab
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# <file system> <mount point> <type> <options> <dump> <pass>
proc /proc proc nodev,noexec,nosuid 0 0
/dev/sda1 / ext4 errors=remount-ro 0 1
/dev/sda2 none swap sw 0 0
/dev/sdb1 /local ext4 errors=remount-ro 0 1
#NOVO1.cs.uwaterloo.ca:/u /u nfs rw,sync,hard,intr,bg 0 0
#NOVO1.cs.uwaterloo.ca:/scratch /scratch nfs rw,sync,hard,intr,bg 0 0
192.168.8.1:/u /u nfs rw,sync,soft,intr,bg 0 0
192.168.8.1:/scratch /scratch nfs rw,sync,soft,intr,bg 0 0
#192.168.8.1:/usr/local/bin /usr/local/bin nfs rw,sync,soft,intr,bg 0 0
#192.168.8.1:/opt /opt nfs rw,sync,soft,intr,bg 0 0
Disks and Raid
Diskless Booting PXE Booting
novo1.cs is running a
PXE boot server*
Imaging /Deploying a node
- Follow the steps in PXE Booting PXE
- See Acronis Imaging Notes CF.Acronis10
- /tftpboot/pxes/images - Acronis images of nodes