Restoring Broken Old Master After Failover

Caution: this process is meant to cover most situations- output of commands should be reviewed to make sure they make sense. We haven't documented all of the edge-cases of failover and recovery, so if there are missing parts, they should be filled in below.

  • decide that we're ready to fix the old master (does NOT need to happen during/right after the emergency of failing over the database- it will wait until there's time).
  • if the old master lxc container (eg, 102) is up and you can ssh in from linux.cscf, skip ahead to the "mysql config" section below. Otherwise:
  • To avoid some trial-and-error checking, you should find out how the old master was brought down- restore will depend in part on their work. Check the ST for the failover, and possibly check with the staff-person who failed over the machine. If there are no hints, you can follow the following steps in increasing order of complexity.

  • you need to know the IAAS lxc host, which you can find by checking the Infrastructure containers list: https://cs.uwaterloo.ca/cscf/internal/infrastructure/inventory/virtual-host-index/
    • If that list isn't conclusive, search for your LXC container in inventory and hopefully either the inventory record, or its eDocs page, will list its last known location.
  • if the IAAS lxc host is up and you can ssh in from linux.cscf, do so and restart the lxc container, in a similar fashion to:
ubuntu1404-202:~# ssh dc-3558-411
Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 3.19.0-56-generic x86_64)

root@dc-3558-411:~# lxc-info --name mysql-102.cs.uwaterloo.ca
Name:           mysql-102.cs.uwaterloo.ca
State:          STOPPED
root@dc-3558-411:~# lxc-start -d -n mysql-102.cs.uwaterloo.ca
  • if the IAAS lxc host is not up, you can restart it by: finding it in inventory and determining its lom-cs{barcode}.cs.uwaterloo.ca hostname.
See https://cs.uwaterloo.ca/cscf/internal/infrastructure/inventory/IAAS-hardware-index/ for current data. Last seen 2016-04-04:
Management interface MySQL system Container hostname
https://lom-cs009701.cs.uwaterloo.ca mysql-102 dc-3558-411.cloud.cs.uwaterloo.ca
https://lom-cs009728.cs.uwaterloo.ca mysql-104 mc-3015-411.cloud.cs.uwaterloo.ca
https://lom-cs009732.cs.uwaterloo.ca mysql-106 m3-3101-411.cloud.cs.uwaterloo.ca
Note that the lom-cs* interfaces can only be reached from systems on VLAN 15 (eg. CSCF workstations).

    • visit the URL for the lom-cs hostname. Credentials for lom-cs* interfaces can be found in CSCF safe under "LOM".
    • once logged in, confirm that the machine is powered off and select the "Power ON / OFF" option to turn it on. A minute later, the LOM page should auto-update from displaying a thumbnail of a booting machine and show a thumbnail of a command-prompt. it should now be accessible via ssh; proceed with the "IAAS lxc is up" item above.
  • if, after restarting the machine, it is not ssh accessible, the ona port might have been shut off.
    • search for the IAAS lxc host's regular IP address (ie., the IP for dc-3558-411, not the IP for the LOM); find its port. On the port's page, look for its status (enabled or disabled); if disabled, a comment explaining why should be present. If it's something you are comfortable re-enabling, erase the comment and re-enable; hopefully ssh to the IAAS lxc host will succeed and and you may proceed with the "IAAS lxc is up" item above.

Mysql Config:

  • ssh to the old master host.
  • Check that firewall is configured to reject incoming connections from MySQL clients. Ensure MySQL(REJECT)   net            $FW is NOT commented out at the bottom of /etc/shorewall/rules
  • Restart firewall by running shorewall restart
  • edit /etc/mysql/conf.d/server-id.cnf to uncomment read_only
  • "unbreak" /usr/sbin/mysqld if it was renamed to 'mysqld-x' - possibly =mv /usr/sbin/mysqld{-x,}
  • service mysql start should work and result in "mysql start/running, process nnnnnn" not "start: Job failed to start".
  • verify mysql is running clean: check /var/log/mysql/error.log for problems, and run mysql from the command-line. If it runs, onward to configuring as a slave.

Configuring as a slave:

  • currently this host is operating as a read-only master with no slaves.
  • show slave status; should return an empty set.
  • show master status; should return a binlog file and position. Note these results.
  • The master binlog on this host will have stopped just before that position; hopefully cleanly. You can review the previous binlog file running mysqlbinlog /var/log/mysql/binlog.{nnnnnn} |less
  • to configure as a slave, we need to find the binlog transaction on the new master which corresponds to the last transaction on this machine.
    • If there is a clean match, the new master was caught up to this old master. If there isn't, you might need to be clever.
  • Follow the above instructions for re-slaving 106 from 104: "mysql on 106: change master to 104 ( needs file position: run show master status on 104 )"
Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2016-04-07 - DanielAllen
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback