Restoring Broken Old Master After Failover
Caution: this process is meant to cover most situations- output of commands should be reviewed to make sure they make sense. We haven't documented all of the edge-cases of failover and recovery, so if there are missing parts, they should be filled in below.
- decide that we're ready to fix the old master (does NOT need to happen during/right after the emergency of failing over the database- it will wait until there's time).
- if the old master lxc container (eg, 102) is up and you can ssh in from linux.cscf, skip ahead to the "mysql config" section below. Otherwise:
- To avoid some trial-and-error checking, you should find out how the old master was brought down- restore will depend in part on their work. Check the ST for the failover, and possibly check with the staff-person who failed over the machine. If there are no hints, you can follow the following steps in increasing order of complexity.
- you need to know the IAAS lxc host, which you can find by checking the Infrastructure containers list: https://cs.uwaterloo.ca/cscf/internal/infrastructure/inventory/virtual-host-index/
- If that list isn't conclusive, search for your LXC container in inventory and hopefully either the inventory record, or its eDocs page, will list its last known location.
- if the IAAS lxc host is up and you can ssh in from linux.cscf, do so and restart the lxc container, in a similar fashion to:
ubuntu1404-202:~# ssh dc-3558-411
Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 3.19.0-56-generic x86_64)
root@dc-3558-411:~# lxc-info --name mysql-102.cs.uwaterloo.ca
Name: mysql-102.cs.uwaterloo.ca
State: STOPPED
root@dc-3558-411:~# lxc-start -d -n mysql-102.cs.uwaterloo.ca
- if the IAAS lxc host is not up, you can restart it by: finding it in inventory and determining its
lom-cs{barcode}.cs.uwaterloo.ca
hostname.
See
https://cs.uwaterloo.ca/cscf/internal/infrastructure/inventory/IAAS-hardware-index/ for current data. Last seen 2016-04-04:
Note that the lom-cs* interfaces can only be reached from systems on VLAN 15 (eg. CSCF workstations).
-
- visit the URL for the
lom-cs
hostname. Credentials for lom-cs* interfaces can be found in CSCF safe under "LOM".
- once logged in, confirm that the machine is powered off and select the "Power ON / OFF" option to turn it on. A minute later, the LOM page should auto-update from displaying a thumbnail of a booting machine and show a thumbnail of a command-prompt. it should now be accessible via ssh; proceed with the "IAAS lxc is up" item above.
- if, after restarting the machine, it is not ssh accessible, the ona port might have been shut off.
- search for the IAAS lxc host's regular IP address (ie., the IP for dc-3558-411, not the IP for the LOM); find its port. On the port's page, look for its status (enabled or disabled); if disabled, a comment explaining why should be present. If it's something you are comfortable re-enabling, erase the comment and re-enable; hopefully ssh to the IAAS lxc host will succeed and and you may proceed with the "IAAS lxc is up" item above.
Mysql Config:
- ssh to the old master host.
- Check that firewall is configured to reject incoming connections from MySQL clients. Ensure
MySQL(REJECT) net $FW
is NOT commented out at the bottom of /etc/shorewall/rules
- Restart firewall by running
shorewall restart
- edit
/etc/mysql/conf.d/server-id.cnf
to uncomment read_only
- "unbreak" /usr/sbin/mysqld if it was renamed to 'mysqld-x' - possibly =mv /usr/sbin/mysqld{-x,}
-
service mysql start
should work and result in "mysql start/running, process nnnnnn" not "start: Job failed to start".
- verify mysql is running clean: check
/var/log/mysql/error.log
for problems, and run mysql
from the command-line. If it runs, onward to configuring as a slave.
Configuring as a slave:
- currently this host is operating as a read-only master with no slaves.
-
show slave status;
should return an empty set.
-
show master status;
should return a binlog file and position. Note these results.
- The master binlog on this host will have stopped just before that position; hopefully cleanly. You can review the previous binlog file running
mysqlbinlog /var/log/mysql/binlog.{nnnnnn} |less
- to configure as a slave, we need to find the binlog transaction on the new master which corresponds to the last transaction on this machine.
- If there is a clean match, the new master was caught up to this old master. If there isn't, you might need to be clever.
- Follow the above instructions for re-slaving 106 from 104: "mysql on 106: change master to 104 ( needs file position: run show master status on 104 )"