Riemann Cluster Notes
General
The Riemann cluster obtained in late 2009 is an SGI XE340 cluster consisting of nine chassis with two separate machines per chassis for a total of 18 nodes. The head node is riemann.math with a public network interface. The rest are on a private subnet. Home directories come from
NetApp fs01.student.math by means of a gigabit fibre link also on the private subnet. The private subnet operates on switch math-sw-mc-3-3015-ah06. Each node has two quad-core Intel Nehalem CPUs. The cluster was updated to SLES 11 SP2 with SGI Foundation 2.8 and Performance Suite 1.6 in August 2013. There was an initial three year contract for full support on hardware and software. The last renewal expires November 2014.
Initial set-up was done by SGI. There is no serious cluster administration or job scheduling suite such as Scali, Platform Manager, Grid Engine, etc. Open-source System Imager (SI) was used for basic OS imaging of the nodes. This was largely a budgetary decision. With SI it does not seem possible to propagate minor updates or OS configuration changes; one must apply such changes on a node and then inhale a complete image to SI for subsequent complete rebuilding of client nodes. Consequently, it becomes essential for us to record any configuration changes made to the nodes to ensure that we can either later incorporate those into an image or easily refer to such records for repeating changes after a node is reinstalled.
Please be sure to RCS any manual changes to system configuration files. That way we can find things we've changed by searching for RCS directories. Things we do through Yast won't be manageable that way, so we'll have to record them here (or via reference to ST items with details).
CSCF Machine notes
https://www.cs.uwaterloo.ca/cscf/internal/edocs/machines/riemann.math.uwaterloo.ca/
Network
Roughly: netmask 255.255.0.0; data on 10.20.0.[1-99], BMC on 10.20.0.[101-199], pilatus on 10.20.0.99 (former home directory server), fs01.student.math on 10.20.0.98. All nodes other than head node use eth0 for the private network and BMC. Head node uses eth0 IP 129.97.140.166, eth1 is used for private network 10.20.0.18, eth0 is also used for BMC 129.97.82.166
Console access
The XE340 has BMC which can be accessed using
IPMI tools. The networked BMC connection piggybacks on the default eth0 port. (A dedicated port is available but that would consume twice as many network switch ports and we don't think the traffic warrants it.)
There's a graphical tool for connecting to the IMPIs that you can start with this command:
/bin/sh /opt/SUPERMICRO/IPMIView/IPMIView20.bin
Head Node configuration
- Public interface on eth0 129.97.140.166, private on eth1
- Install third-party application software into /opt so the clients will automatically see it without also having to install it on them.
- Making it an NIS server to provide account information to the client nodes.
- Needs to have uid and gid consistency with pilatus for user accounts.
- Only the head node has a second disk for cloning the system disk. Cron job runs clone script at /usr/local/maintenance/clone_disk
- /etc/sudoers and /etc/group modified to enable members of group wheel for sudo
- Installed gcc suite.
- Installed gcc-fortran from SLES SDK. ST#72938
- Installed MPFR and GMP development support from SLES SDK. Needed for gmpfrxx, ST#70388
Add-on software packages that we built:
- ST#70386 pari installed under /opt/pari_gnu
- ST#70387 Bailey's quad double software installed under /opt/qd_gnu
- ST#70388 built gmpfrxx. Installed in /opt for automatic accessibility by the cluster nodes.
Manual Client Node configuration
- NFS mounts for /opt from riemann.math; /home from fs01.student.math, /scratch from pilatus, all on private network
- Installed gcc suite
- Added SLES SDK as installation source
- Installed gcc-fortran, MPFR, GMP (ST#70388, ST#72938)
Adding Accounts
Create the account on the riemann head node. Use the same UID and GID that the person has in the Math region.
- riemann:/etc # cat /etc/passwd | grep 1619
- riemann:~ # groupadd -g 1619 gboerke
- riemann:~ # useradd -m -c "gboerke" -u 1619 -g 1619 gboerke
- riemann:~ # passwd gboerke
- Changing password for gboerke.
- New Password:
- Reenter New Password:
- Password changed.
- riemann:~ # chfn -f "Gordon Boerke" gboerke
To become another user
- cscf-adm@riemann:~> sudo -s
- riemann:/home/cscf-adm # ls -la /home/swilson/.ssh
- ls: cannot open directory /home/swilson/.ssh: Permission denied
- riemann:/home/cscf-adm # su swilson
- swilson@riemann:/home/cscf-adm> ls -la /home/swilson/.ssh
Once the account is created on riemann.math head node it will need to be propagated to the sub-nodes from the head node. We use NIS for account propagation.
Also, create a .forward file to the person's
@uwaterloo.ca account. We don't want the cluster to be a mail processing destination.
NIS Account Management
Accounts are propagated to all sub-nodes from the head node via NIS. To make the nodes notice a new account, run
YaST2 on the head node and go through the NIS Server set-up, not changing anything, just next-next-next...done. Some details follow.
- riemann:~ # rpm -qa | grep -i NIS
- yast2-nis-client-2.17.7-1.36
- yast2-nis-server-2.17.2-1.53
Use YAST2 to set up NIS server on the head node riemann. Follow the selections from YAST2. Do not set up the head node as an NIS client (NIS docs suggest doing so).
- Select "Install and set up an NIS Master Server"
- Set NIS domain name to nisreimann.
- Enable fast map distribution (rpc.ypxfrd).
- Enable password, GECOS, and shell changing.
- Open port in firewall already selected; firewall details show eth1
- (private) has open port, and eth0 (public) doesn't, which is what we want.
- Next page is for server slaves. Don't need any.
- Next page is for which maps to manage. Select the maps: group, passwd, shadow, i.e. just accounts stuff.
- Next page is for query host setup, i.e. who is allowed to ask this NIS server for info. So we want netmask 255.255.0.0 and network 10.20.0.0
Use YAST2 to set up NIS client on the sub-nodes
- Run YAST2 on the command line, Networks Services -> NIS Client
- Set domain as nisriemann, and set server as 10.20.0.18 and then click finish.
The NIS domain name is nisriemann.
- riemann:~ # cat /etc/defaultdomain
- nisriemann
To report the NIS domain name on the client node 10.20.0.5:
- riemann5:~ # ypdomainname
- nisriemann
- riemann5:~ # nisdomainname
- nisriemann
The head node will be running ypserv daemaon and the sub-nodes will be running ypbind daemon.
Sub-nodes will point to the NIS server
- cat /etc/yp.conf shows ypserver 10.20.0.18
To add accounts from /etc/passwd to NIS run YAST2 again to set up the server. Unlike shiraz/maroo/vidal running '/var/yp/make all' and '/var/yp/make' does not seem to work.
On a node to find all NIS users
- ypcat -d nisriemann passwd
On the head node because it's configured only for the private subnet do
- ypcat -d nisriemann -h 10.20.0.18 passwd
SSH Key Generation
SSH into the head node will require all users to enter their password. From the head node enter any client node without requiring a password. In this example, we set up the keys for cscf-adm on Pilatus. As it is NFS mounted to riemann all nodes will have the 'authorized2' file. This dates back to use of pilatus as the home directory server, which is no longer the case.
- pilatus:/home/cscf-adm # ls .ssh
- known_hosts
- cscf-adm@pilatus:~/.ssh> ssh-keygen -t dsa
- Generating public/private dsa key pair.
- Enter file in which to save the key (/home/cscf-adm/.ssh/id_dsa):
- Enter passphrase (empty for no passphrase):
- Enter same passphrase again:
- Your identification has been saved in /home/cscf-adm/.ssh/id_dsa.
- Your public key has been saved in /home/cscf-adm/.ssh/id_dsa.pub.
- The key fingerprint is: ...
- cscf-adm@pilatus:~/.ssh> ls
- id_dsa id_dsa.pub known_hosts
- cscf-adm@pilatus:~/.ssh> cat id_dsa.pub >> authorized_keys2
- cscf-adm@pilatus:~/.ssh> ls
- authorized_keys2 id_dsa id_dsa.pub known_hosts
- cscf-adm@pilatus:~/.ssh>
Test
SSH keys by logging into riemann and from riemann log into any client node riemann1 to riemann17. It may ask to allow the connection if this is the first time logging into the client, reply with 'yes'. After this it should not ask for a password.
iLOM Access
SuperMicro IPMI BMC comes with an embedded web server. Access to the remote management card is done via any browser to
https://bmc-riemann.math.uwaterloo.ca (temporarily the card is named bmc2-riemann.math.uwaterloo.ca)
From the browser connection view sensor information, event logs, power control, network settings, etc.
To access riemann.math via the iLOM KVM interface select the menu tab Remote Control then Launch Console button. Log into riemann and start Firefox and connect to any of the sub-node iLOM cards via
http://10.20.0.101 to
http://10.20.0.117.
Don't change the Configuration, LAN Select Settings. If this setting is changed to Enable On-Board or Dedicated the node must be power-cycled, otherwise sensor readings will be inaccessible.
Problem accessing KVM
The Java from the remote machine must be able to download and run jviewer.jnlp. Head node riemann.math does not have the correct Java version. To access a sub-node find a machine with the correct Java.
From fe105.math
fe105$ firefox &
In the address bar: http://bmc-riemann.math
Log into the IPMI card. Then start a KVM of riemann.
It will ask to download jviewer.jnlp and to select Java Web Start to run the command. It will query yes/no a couple times.
After login start a firefox GUI session. In firefox address bar enter the IP of the sub-node IPMI.
Login to the sub-node and start a KVM session of the sub-node.
Show ILOM Mac and IP address
riemann:/opt/SUPERMICRO/IPMIView # ./ipmicfg-linux.x86_64 -m
IP=129.97.82.166 MAC=00:30:48:C9:6C:55
--
RobynLanders - 04 Dec 2009