-- MikeGore

Bin Ma NOVO Cluster

Hardware Overview

Cluster Description

   IBM BladeCenter HS22 with 10 nodes
   Intel Xeon Processor E5520 4C 2.26GHz 8MB Cache 106
   IBM System Storage DS3200
   IBM 73GB 15K 6Gbps SAS 2.5" SFF Slim-HS HDD
   SAS Connectivity Card (CIOv) for IBM BladeCenter
   Integrated SAS Mirroring - 2 identical HDDs required
   IBM 2TB 7200 Dual Port SATA 3.5'' HS HDD
   IBM BC SAS Connectivity Mod
   IBM System Storage DS3200 SAS Dual Controller
   DS3000 1GB Cache Memory Upgrade
   Memory 2GB (1x2GB, 2Rx8, 1.5V) PC3-10600 CL9 ECC DDR3 1333MHz VLP
   BNT Layer 2/3 Copper Gb Ethernet Switch Module for IBM BladeCenter

   E Series Blade Chassis
   MT: 8677
   S/N KQZLXPX
   Product ID: 86773TUMTM - S/N 8677HC1 - KQZLXPX

   Storage1 : IBM System Storage DS3200
   Type: 1726-HC2
   S/N: 13K1AZW
   P/N 13N1972
   FRU PN: 39R6545
   EC G32895B

   UPS
   FRU PN: 46M5359
   PN: 46M5348
   MT 2419
   Model: RU1
   S/N: 23A2008
   Product ID: 24195KX 
   

Warranty

IBM Tech Support

  • Support Number is: 1-866-880-2765

Documents

Links

*

EDOCS

Notes: has purchase,service,configuration settings and other information saved in EDOCS

UPS documents

UPS is plugged into - the 13/15 socket under the floor

Inventory

Blades

Novo BladeCenter Management Interface - Remote Control, Remote Console, Network, Configuration, etc

  • All Blade Managment tasks and remote console access cab be done via this Web Interface
  • See: Novo Management Interface NOVOMGT
  • Quick Summary:
    • https://novo-mgmt.cs.uwaterloo.ca
      • Userid: cscfadm, password (2011 version same as cscf-adm- see password safe)
      • Remote Video KVM requires IE under windows with Sun Java
    • Console:
      • Blade Tasks
      • Remote Control
      • Start Remote Control BUTTON
      • (Pick NODE from dropdown to right of Video Icon - on top center toolbar)
        • Double Click in screen to activate the window
    • Power:
      • Blade Tasks
      • Remote Control
      • Start Remote Control BUTTON (A new windows Will Pop Up - say continue if you see a security warning)
      • (Pick NODE from dropdown to right of Video Icon - on top center toolbar)
        • Click on Power Control Icon dropdown * Choose Power ON or OFF
          • Note: The following will NOT work: Restart, Shutdown, Shutdown with NMI (MNI does not waits for the OS to respond)
            • These last options ASSUME there are kernel modules installed that we DO NOT have installed!

Removing and Reinstalling Blades

Notes: The system takes more then three minutes to respond to any power on or display select after being inserted back in the chassis.
This is likely due to the lights out management processor (LOM) boot time overhead for the blade

Head node

  • NOVO1.cs

Compute Nodes

Notes: Each node has an external IP address at this time however you should use NOVO1.cs as the perferred connection path
-or - The Management Interface NOVOMGT because we may move them to a private network
  • NOVO has 10 nodes named NOVO1..NOVO10
  • Each node can me managed and installed using the Management Interface

Remote Access

SSH Shell access

  • ssh NOVO1.cs login as cscf-adm

Root Access

  • Login as cscf-adm then sudo bash

Reboot Shutdown or Power cycle nodes - via Management Interface

  • See: Novo Management Interface Summary above - or NOVOMGT

Shutdown Cluster with SSH via NOVO1 - the easy way

Assumes NOVO1 is online
Note: This scripts shuts down all the nodes first - then lastly NOVO1
  • ssh cscf-adm@NOVO1.cs - standard password
  • sudo bash
  • cd /root/utils
  • /shutdown-all

Shutdown any NODE

*Note: the nodes should relaly be shutdown before doing this
  • Use SSH or Management Interface to connect to a node (novoX.cs is one of novo1.cs..novo10.cs) using userid cscf-adm
  • sudo -s
  • shutdown -h -t 90 novoX.cs will shutdown in 90 seconds

+Startup the Cluster

Note: always start novo1.cs first and then after fully booted up you may start the nodes
  • Connect to NOVO1.cs via Management Interface
  • Power on NOVO1.cs via Managment interface
    • Wait until it starts
    • Power on other nodes via Management interface
      • See: Novo Management Interface Summary above - or NOVOMGT

Monitor and Keyboard access to the nodes

  • NOVO1.cs has a monitor attached

Problem Solving

Networking

  • /root/utils/check-nodes does a quick ping test

Shared filesystems on each NOVO1..NOVO10 node and the headnode novo

  • /u - user home directories
  • /local - local disk
  • /tftpboot/pxes/images - Acronis images of nodes
  • Optional:
    • /opt - shared programs
    • /usr/local/bin - common binaries from custom applications

Adding users

NOVO is part of the CS Active DIrectory
  • Add users via CS AD

Software installation and maintenance

  • NOVO cluster uses Ubuntu 12.04LTS 64bit
  • apt-get *install

Cluster Tools and Scripts

  • See: ClusterTools - outlines many of the scripst found under /root/utils on NOVO1.cs

Backups

  • See LinuxLegatoClientSetup
  • Note: / is replaced to /backup every night via crontab entry
    • See: /scripts/backup
  • etckeeper package is installed - keeps daily snapshot of changes under /etc

Software Issues - anything we need installed

  • In order to install software on NOVO1 and make it accessible on the nodes we install under /opt which is shared via NFS to all nodes

Unatended Upgrades

  • We have installed the unattended-upgrades package to do critical updates automatically

MATLAB installation on novo and the nodes

  • See: MatlabInstallation and install the 64bit version of Matlab
  • Install MATLAB on novo under /opt/matlab - this is a shared space that all of the nodes also see via NFS
  • I run the following script as root on novo using bash to create a local link on each node to the shared software
          #!/bin/bash
          #
          source /root/utils/NODES
          for i in $NODES
             do echo $i
             ssh $i "ln -s /usr/local/bin/matlab /opt/matlab/bin/matlab"
          done
       

OpenMPI

  • See OpenMPI
  • /usr/bin/mpi*
  • /opt/mpi-code - test examples

Software examples and usage

Running scripts on the nodes

  • Example run a command on a node that displays the hostname
    • ssh NOVO1 hostname
      • This example runs the command hostname on node NOVO1
  • Simple script that knows what node it was run on
    • Creates a directory based on the node name then changes into that directory finally displaying it's current working directory.
            #!/bin/bash
            #
            HOSTN=`hostname`
            cd /data/sjwlwan
      
            if [ ! -d $HOSTN ]
            then
              mkdir $HOSTN
            fi
      
            cd $HOSTN
      
            pwd
         

Network and Host Configs

Ethernet interfaces

  • eth0 External - connects to novo-sw1.cs
  • eth1 Internal - connects to novo-sw2.cs

Network routes

   Kernel IP routing table
   Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
   default         dc-csfw1-1-csre 0.0.0.0         UG    100    0        0 eth0
   129.97.7.0      *               255.255.255.0   U     0      0        0 eth0
   link-local      *               255.255.0.0     U     1000   0        0 eth0
   192.168.8.0     *               255.255.255.0   U     0      0        0 eth1
   

NOVO1 /etc/network/interfaces

root@NOVO1:/etc/network# cat interfaces
# ========================================================
# Created automatically on NOVO1.cs by command: /root/utils/./mk-interfaces
# on Wed Aug  8 10:54:23 EDT 2012
# ========================================================
auto eth0
iface eth0 inet static
   address 129.97.7.230
   netmask 255.255.255.0
   broadcast 129.97.7.255
   network 129.97.7.0
   gateway 129.97.7.1
   # mtu 1492
   dns-servers 172.19.32.5 172.19.32.6
   dns-search cs.uwaterloo.ca uwaterloo.ca   
auto eth1
iface eth1 inet static
   address 192.168.8.1
   netmask 255.255.255.0
   broadcast 192.168.8.255
   network 192.168.8.0
   # gateway 192.168.8.1
   # mtu 1492
   dns-servers 172.19.32.5 172.19.32.6
   dns-search cs.uwaterloo.ca uwaterloo.ca
   

DNSMASQ

  • See DNSMASQ - provides DHCP,DNS, BOOTP and TFTPBOOT
  • The sections below show the files we used for DNSMASQ

hosts.common

   127.0.0.1       localhost
   127.0.1.1       localhost

   # The following lines are desirable for IPv6 capable hosts
   ::1     localhost ip6-localhost ip6-loopback
   fe00::0 ip6-localnet
   ff00::0 ip6-mcastprefix
   ff02::1 ip6-allnodes
   ff02::2 ip6-allrouters

   #BACKUP
   129.97.167.210 backup-0.cs.uwaterloo.ca backup-0.cs backup-0 backup.cs.uwaterloo.ca backup.cs backup
   

dnsmasq.source.novo1

* documents ALL of the MAc/IP assignments for the system

   # Define the NOVO Node LOMs
   # We turn all lines in /etc/dnsmasq.conf to comments
   # Except the line: conf-file=/etc/dnsmasq.hosts

   # All configuration goes here
   # ONLY the listen-address line changes when moving this to a new host

   # ================================================================
   include=dnsmasq.common
   # ================================================================
   # ONLY the listen-address line changes when moving this to a new host

   # listen-address replaced with interface for this system
   # listen-address=127.0.0.1,192.168.8.10

   # 192.168.8.1
   interface=eth1
   # 129.97.7.230
   except-interface=eth0
   no-dhcp-interface=eth0
   bind-interfaces
   #This program assumes that the dhcp-range is in the following order
   # tag,min,max,ttl
   dhcp-range=comm,192.168.8.2,192.168.8.254,255.255.255.0,12h
   # dhcp-option=option:router,192.168.8.1

   # 3 = route
   # 6 = DNS
   #dhcp-option=comm,3,192.168.8.1
   dhcp-option=comm,6,192.168.8.1,172.19.32.5,172.19.32.6,172.19.47.6
   dhcp-option=119,cs.uwaterloo.ca
   # ==============================================================
   # Syntax:
   # IP,MAC,NAME[,CNAME]
   # subnet=192.168.8   (currently only doing /24 networks)
   # ttl=1440m   (whatever dnsmasq accepts for ttl)
   # ==============================================================

   # NOVO10 is on vlan 7
   # ETH0
   ttl=1440m
   subnet=129.97.7
   iface=eth0
   230,e4:1f:13:7a:b3:cc,novo1.cs.uwaterloo.ca,novo1.cs
   213,e4:1f:13:7a:b7:54,novo2.cs.uwaterloo.ca,novo2.cs
   245,e4:1f:13:7a:b7:a0,novo10.cs.uwaterloo.ca,novo10.cs
   # ==============================================================
   # NOVO nodes that are on vlan 170
   # ETH0
   ttl=1440m
   subnet=129.97.171
   iface=eth0
   225,e4:1f:13:7a:b8:48,novo3.cs.uwaterloo.ca,novo3.cs
   224,e4:1f:13:7a:b7:74,novo4.cs.uwaterloo.ca,novo4.cs
   223,e4:1f:13:7a:b6:c8,novo5.cs.uwaterloo.ca,novo5.cs
   222,e4:1f:13:7a:b7:88,novo6.cs.uwaterloo.ca,novo6.cs
   221,e4:1f:13:7a:b6:2c,novo7.cs.uwaterloo.ca,novo7.cs
   220,e4:1f:13:7a:b7:1c,novo8.cs.uwaterloo.ca,novo8.cs
   219,e4:1f:13:7a:b4:04,novo9.cs.uwaterloo.ca,novo9.cs
   # ==============================================================
   # COMM and NFS vlan
   # Note: IP address match
   # ETH1
   ttl=1440m
   subnet=192.168.8
   iface=eth1
   1,e4:1f:13:7a:b3:ce,novo1-comm,novo1
   2,e4:1f:13:7a:b7:56,novo2-comm,novo2
   3,e4:1f:13:7a:b8:4a,novo3-comm,novo3
   4,e4:1f:13:7a:b7:76,novo4-comm,novo4
   5,e4:1f:13:7a:b6:ca,novo5-comm,novo5
   6,e4:1f:13:7a:b7:8a,novo6-comm,novo6
   7,e4:1f:13:7a:b6:2e,novo7-comm,novo7
   8,e4:1f:13:7a:b7:1e,novo8-comm,novo8
   9,e4:1f:13:7a:b4:06,novo9-comm,novo9
   10,e4:1f:13:7a:b7:a2,novo10-comm,novo10
   

NFS exports

   root@novo1:~# exportfs -v
   /opt             192.168.8.0/24(rw,wdelay,root_squash,no_subtree_check)
   /u               192.168.8.0/24(rw,wdelay,root_squash,no_subtree_check)
   /local           192.168.8.0/24(rw,wdelay,root_squash,no_subtree_check)
   

Disks

   root@novo1:~# df -h
   Filesystem      Size  Used Avail Use% Mounted on
   /dev/sdc1        66G   36G   26G  59% /
   udev            5.9G  4.0K  5.9G   1% /dev
   tmpfs           2.4G  1.1M  2.4G   1% /run
   none            5.0M     0  5.0M   0% /run/lock
   none            5.9G     0  5.9G   0% /run/shm
   /dev/sdb1       7.9T  175M  7.5T   1% /local
   /dev/sda1       5.0T  894G  3.8T  19% /u
   

Nodes /etc/fstab

   # /etc/fstab: static file system information.
   #
   # Use 'blkid' to print the universally unique identifier for a
   # device; this may be used with UUID= as a more robust way to name devices
   # that works even if disks are added and removed. See fstab(5).
   #
   # <file system> <mount point>   <type>  <options>       <dump>  <pass>
   proc            /proc           proc    nodev,noexec,nosuid 0       0
   /dev/sda1 /               ext4    errors=remount-ro 0       1
   /dev/sda2 none            swap    sw              0       0

   /dev/sdb1 /local          ext4    errors=remount-ro 0       1

   #NOVO1.cs.uwaterloo.ca:/u  /u  nfs  rw,sync,hard,intr,bg  0  0
   #NOVO1.cs.uwaterloo.ca:/scratch /scratch  nfs rw,sync,hard,intr,bg  0 0
   192.168.8.1:/u  /u  nfs  rw,sync,soft,intr,bg  0  0
   192.168.8.1:/scratch /scratch  nfs rw,sync,soft,intr,bg  0 0
   #192.168.8.1:/usr/local/bin /usr/local/bin nfs rw,sync,soft,intr,bg 0 0
   #192.168.8.1:/opt /opt nfs rw,sync,soft,intr,bg 0 0
   

Disks and Raid


Diskless Booting PXE Booting

novo1.cs is running a PXE boot server*

  • See PXE

Imaging /Deploying a node

  • Follow the steps in PXE Booting PXE
  • See Acronis Imaging Notes CF.Acronis10
    • /tftpboot/pxes/images - Acronis images of nodes
Topic attachments
I Attachment Action Size Date Who Comment
PDFpdf NOVO_maintenance_renewal.pdf manage 1697.0 K 2014-12-11 - 09:46 MikeGore 2014 Renewal Notice
PDFpdf xdo31_195147_32-3.pdf manage 5.1 K 2014-12-11 - 09:47 MikeGore 2013 Renewal
PDFpdf xdo9_201833_10.pdf manage 19.0 K 2014-12-11 - 09:45 MikeGore Warranty Extension 2014
Topic revision: r12 - 2016-08-16 - MikeGore
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback