MySQLHARestoringBrokenMaster < CF

CF Web>MySQLHAOperationsManual>MySQLHARestoringBrokenMaster (2016-04-07, DanielAllen) (raw view)
---+++ Restoring Broken Old Master After Failover

Caution: this process is meant to cover most situations- output of commands should be reviewed to make sure they make sense. We haven't documented all of the edge-cases of failover and recovery, so if there are missing parts, they should be filled in below.
   * decide that we're ready to fix the old master (does NOT need to happen during/right after the emergency of failing over the database- it will wait until there's time).
   * if the old master lxc container (eg, 102) is up and you can ssh in from linux.cscf, skip ahead to the "mysql config" section below. Otherwise:
   * To avoid some trial-and-error checking, you should find out how the old master was brought down- restore will depend in part on their work. Check the ST for the failover, and possibly check with the staff-person who failed over the machine. If there are no hints, you can follow the following steps in increasing order of complexity.

   * you need to know the IAAS lxc host, which you can find by checking the Infrastructure containers list: https://cs.uwaterloo.ca/cscf/internal/infrastructure/inventory/virtual-host-index/ 
      * If that list isn't conclusive, search for your LXC container in inventory and hopefully either the inventory record, or its !eDocs page, will list its last known location.
   * if the IAAS lxc host is up and you can ssh in from linux.cscf, do so and restart the lxc container, in a similar fashion to: 
<verbatim>
ubuntu1404-202:~# ssh dc-3558-411
Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 3.19.0-56-generic x86_64)

root@dc-3558-411:~# lxc-info --name mysql-102.cs.uwaterloo.ca
Name:           mysql-102.cs.uwaterloo.ca
State:          STOPPED
root@dc-3558-411:~# lxc-start -d -n mysql-102.cs.uwaterloo.ca
</verbatim>
   * if the IAAS lxc host is not up, you can restart it by: finding it in inventory and determining its =lom-cs{barcode}.cs.uwaterloo.ca= hostname. 
See https://cs.uwaterloo.ca/cscf/internal/infrastructure/inventory/IAAS-hardware-index/ for current data. Last seen 2016-04-04: 
|| Management interface | MySQL  system | Container hostname ||
|| https://lom-cs009701.cs.uwaterloo.ca | mysql-102 | dc-3558-411.cloud.cs.uwaterloo.ca ||
|| https://lom-cs009728.cs.uwaterloo.ca | mysql-104 | mc-3015-411.cloud.cs.uwaterloo.ca ||
|| https://lom-cs009732.cs.uwaterloo.ca | mysql-106 | m3-3101-411.cloud.cs.uwaterloo.ca ||
Note that the lom-cs* interfaces can only be reached from systems on VLAN 15 (eg. CSCF workstations).  

      * visit the URL for the =lom-cs= hostname. Credentials for lom-cs* interfaces can be found in CSCF safe under "LOM".
      * once logged in, confirm that the machine is powered off and select the "Power ON / OFF" option to turn it on. A minute later, the LOM page should auto-update from displaying a thumbnail of a booting machine and show a thumbnail of a command-prompt. it should now be accessible via ssh; proceed with the "IAAS lxc is up" item above.
   * if, after restarting the machine, it is not ssh accessible, the ona port might have been shut off.
      * search for the IAAS lxc host's regular IP address (ie., the IP for dc-3558-411, not the IP for the LOM); find its port. On the port's page, look for its status (enabled or disabled); if disabled, a comment explaining why should be present. If it's something you are comfortable re-enabling, erase the comment and re-enable; hopefully ssh to the IAAS lxc host will succeed and and you may proceed with the "IAAS lxc is up" item above.

Mysql Config:
   * ssh to the old master host.
   * Check that firewall is configured to reject incoming connections from MySQL clients. Ensure =MySQL(REJECT)	net				$FW= is NOT commented out at the bottom of /etc/shorewall/rules
   * Restart firewall by running =shorewall restart= 
   * edit =/etc/mysql/conf.d/server-id.cnf= to uncomment =read_only= 
   * "unbreak" /usr/sbin/mysqld if it was renamed to 'mysqld-x' - possibly =mv /usr/sbin/mysqld{-x,}
   * =service mysql start= should work and result in "mysql start/running, process nnnnnn" not "start: Job failed to start".
   * verify mysql is running clean: check =/var/log/mysql/error.log= for problems, and run =mysql= from the command-line. If it runs, onward to configuring as a slave.

Configuring as a slave:
   * currently this host is operating as a read-only master with no slaves.
   * =show slave status;= should return an empty set.
   * =show master status;= should return a binlog file and position. Note these results.
   * The master binlog on this host will have stopped just before that position; hopefully cleanly. You can review the previous binlog file running =mysqlbinlog /var/log/mysql/binlog.{nnnnnn} |less=
   * to configure as a slave, we need to find the binlog transaction on the new master which corresponds to the last transaction on this machine. 
      * If there is a clean match, the new master was caught up to this old master. If there isn't, you might need to be clever.
   * Follow the above instructions for re-slaving 106 from 104: "mysql on 106: change master to 104 ( needs file position: run show master status on 104 )"
Topic revision: r1 - 2016-04-07 - DanielAllen
Information in this area is meant for use by CSCF staff and is not official documentation, but anybody who is interested is welcome to use it if they find it useful.
Other Webs
My links
- People
- CERAS
- WatForm
- Tetherless lab
- Ubuntu Main.HowTo
- eDocs
- RGG NE notes
- RGG
- CS infrastructure
- Grad images
Edit