MySQLHAMasterFailure < CF

CF Web>MySQLHAOperationsManual>MySQLHAMasterFailure (2016-10-25, FraserGunn) (raw view)
---++ Failure of the Master Server
If you have performed [[#Problem_Diagnosis][problem diagnosis]] and believe the Master Server has failed, these instructions will assist with emergency recovery.

---+++ Can you quickly determine why the Master Server has failed?
If you know why the master server has failed, or a few minutes of diagnosis makes it obvious, consider whether the downtime of fixing the server will be less than approximately 10 minutes. That is roughly the amount of time it will take to promote one of the slaves to become a master. This assessment of course requires taking a few minutes to check; following are some checks to make.
   *  !SSH into the master
      *  Check to ensure that / and /data are mounted and not full
         *  Check to see if the mysql daemon is running
         *  Check firewall rules in /etc/shorewall/rules. On the master, the line reading =MySQL(REJECT)  net $FW= MUST be commented out
         *  Check /var/log/mysql.err and /var/log/mysql.log for cause of failure/shutdown
      *  If either filesystem is full and cannot be written to, log into container host and use logical volume and fs tools to extend the partitions and filesystems.
   *  If SSHing to master failed
      *  Check to see host is receiving power using LOM
      *  Check to see if host is up using LOM
      *  Check to see if host network port is up using Ona

---+++ Recover Master Server without performing failover
If it looks like recovery will be speedy, you can proceed with them, leaving the database down while you do so. Rebooting the server will require manually starting mysql, because we leave mysql not enabled on boot by default, to prevent multiple masters from starting up at the same time. =service mysql start= should work.

---+++ Failover: a Slave becomes Master, and Old Master Becomes a Slave
Caution, follow this procedure carefully; there are subtle reasons for doing these operations and in this order.  
   *  Decide that failover is appropriate (see above).
   *  Check that slaves are equally up to date, e.g.: on both, perform =mysql> show slave status= and confirm that =Relay_Log_Pos:= is the same value. (If not, choose the one with the higher value to become the new master).
   * Find the IP address of the new master.
<verbatim>
# for i in mysql-102 mysql-104 mysql-106 ; do host "$i".cs.uwaterloo.ca ; done
mysql-102.cs.uwaterloo.ca has address 172.19.152.27
mysql-104.cs.uwaterloo.ca has address 172.19.154.27
mysql-106.cs.uwaterloo.ca has address 172.19.155.27
</verbatim>
   *  Redirect clients to use new master via DNS
      * Use infoblox to change IP address and PTR record for mysql.cs to the new master.
         * This can either be performed in the infoblox GUI interface or via [[CFPrivate.InventoryInfobloxCLIDocs][Infoblox Command Line Interface]] -- 
            * via [[https://nsbuild.uwaterloo.ca/ui/][InfoBLox GUI]], search for "mysql.cs" with search tool in upper right corner of interface.  Open record, find IP address, and update it to the new master IP address. Secondly, find the PTR record and update it to the new master IP address as well.
   *  Take down former master and ensure it cannot come back by accident. *This must happen before new master becomes writeable*.
      * If possible: on former master, stop salt and prevent it from restarting by accident. (Bug: The -disabled part doesn't actually disable it.)
<verbatim>
# service salt-minion stop
# mv /etc/init.d/salt-minion{,-disabled}
</verbatim>
      * If possible: on former master change firewall to block all client traffic: 
         * Uncomment =MySQL(REJECT)	net				$FW= on final line of /etc/shorewall/rules and then restart firewall by running =shorewall restart=
      * Of the following 4 options, do only the first one possible:
         * On the former master =# service mysql stop= and =mv /etc/init.d/mysql{,-disabled}= to prevent mysql from restarting as master automatically
         * lxc container stop; and break container by altering config file: =# lxc-info --name mysql-102.cs.uwaterloo.ca= and verify State: STOPPED. If not stopped, force container stop =lxc-stop mysql-102.cs.uwaterloo.ca= and recheck state. Do not proceed until former master is verified stopped.
         * Take down the IAAS lxc host and prevent it from starting
         * Stop the world from reaching host - ona port shutoff (with comment): Search for former master port at [[https://istns.uwaterloo.ca/ona/search.php][ONA]] by MAC/IP address: =mysql-102 = 00:16:3e:20:7b:cf/10.15.152.27, mysql-104 = 00:16:3e:86:2b:a9/10.15.154.27, mysql-106 = 00:16:3e:3c:8d:61/10.15.155.27=
   * Once the above is done, =# service salt-minion stop= on former slaves to avoid changes being pushed from salt master during these steps.
   * Firewall rule change: comment out =MySQL(REJECT)	net				$FW= on new master and then restart firewall by running =shorewall restart= -- At this point clients that have transitioned to 104 will find it works, but is read-only.
   * mysql on the other slave: change master to new master (needs file position: run =show master status= on new master, record log file name and log position.)
      * Replace the MASTER PASSWORD with the password recorded on each host in =/etc/mysql/{$host}_slave.cnf= (the passwords are the same for all three hosts).
(Bug: What is the second CHANGE MASTER line for?)
<verbatim>
mysql-106# mysql -e 'stop slave''
mysql-106# mysql -e 'CHANGE MASTER TO MASTER_HOST="mysql-104.cs.uwaterloo.ca",MASTER_USER="mysql-106_slave", MASTER_PASSWORD="{$slave_password}", MASTER_LOG_FILE="binlog.000002", MASTER_LOG_POS=332'
mysql-106# mysql -e 'CHANGE MASTER TO MASTER_HOST="mysql-104.cs.uwaterloo.ca",MASTER_USER="mysql-106_slave", MASTER_LOG_FILE="binlog.000002", MASTER_LOG_POS=332'
mysql-106# mysql -e 'start slave'
</verbatim>
   * mysql on new master: =stop slave; reset slave all;= edit =/etc/mysql/conf.d/server-id.cnf= to comment out =read_only= and =service mysql restart=
At this point, 104 will have active client writes.
   * Typically, the most important client host is the web server, currently cs.uwaterloo.ca.  If it still has the old IP address cached, clear the cache.
      * On the web server: =nscd --invalidate hosts=
   * [[MySQLHANormalOperation][Test normal operation]] ... especially the example =pt-table-checksum= on new master
      * Check that apps are working. This includes CSCF Inventory: https://cs.uwaterloo.ca/cscf/internal/inventory/ and ST.
   * On both the new master and slave (but NOT the old master): =service salt-minion start=

The normal procedure is that the new master stays master; the former master is repaired as required and returned to service as a slave of the new master. This completes the emergency process. The process of [[MySQL HA Restoring Broken Master]] can proceed by trained personnel later.
Topic revision: r4 - 2016-10-25 - FraserGunn
Information in this area is meant for use by CSCF staff and is not official documentation, but anybody who is interested is welcome to use it if they find it useful.
Other Webs
My links
- People
- CERAS
- WatForm
- Tetherless lab
- Ubuntu Main.HowTo
- eDocs
- RGG NE notes
- RGG
- CS infrastructure
- Grad images
Edit