Meeting 3 March 2016, 2pm
Attended: drallen a2brenna fhgunn
Agenda:
- Progress on Milestones
- Timeline
- Brief summary since the last meeting
Progress on Milestones:
- Backups set up and tested: (Due-date reset to today): Remaining: testing restore of master? (is this different from restoring slave?)
Timeline
- moving inventory- likely not finished until Friday (18th), with more testing done in the meantime.
- Schedule has had further slip, shows wrapup April 7 or so where:
- Mon 14 Mar - Fri 18 Mar is moving inventory (Fraser and Daniel);
- Mon 21 Mar - Wed 23 Mar is moving the rest of apps (Fraser and Daniel);
- Thu 24 Mar - Wed 29 Mar is benchmarking apps (Lori and Daniel);
- Tue 22 Mar - Tue 29 Mar is finalizing monitoring, tuning, maintenance documentation (Anthony, Fraser, Daniel) - Fraser only available until 24th. (here the following week but realtime lab may interfere). Furthermore, Fraser has new commitments taking priority: spending for the remainder of this FY (must happen before the 31st).
- Tue 29 Mar - Thu 7 April is wrapup: Fixing remaining issues, writing up lessons learned, proposed plans for student cluster, handoff to Ken.
Brief summary since the last meeting, and upcoming week:
Fraser has just learned that he has new commitments taking priority: spending for the remainder of this FY (must happen before the 31st).
- did backup work with Anthony - including adding staggered backups; remaining:
- editing MySQLHATesting - failover component (the rest is complete). - Done!
- deciding which process to copy inventory -will work on this later today; Daniel and Anthony to review.
- mysql config merging;
Anthony:
- did slave recovery;
- wrote draft of slave recovery from backup, documented in ST. (we have a checklist for backup; need a checklist for failover)
- There might be a faster method of recovery using pt-table-sync - we might investigate that later (and method 3 is recovery from dump)
- We're agreed we won't use backups to recover the master: instead always recover slave, make slave into master, and recover the old master as a slave.
- Anthony would like to log backup success - not critical for project going into production but nice to have.
- will do shorewall saltifying next.
- slaves: default deny mysql port; allow from
102,104,106,nagios-202,nagios-204,cacti,asgard
- document that users on cscfnet need to go to asgard to connect to mysql slaves
- and will then continue work on manual, including normal operations.
Daniel tested slave fail/recovery from master. Remaining on that item is Fraser's edits to
MySQLHATesting failover- which we will do now.
Restarting container automatically has a failure mode if master goes down. -- because failover has a window where clients can be hitting multiple master servers.
Agree again to set auto_start to no on all three (
/var/lib/lxc/container_name/{fstab,rootfs/,config}
) ; sysadmin goes to mysqladmin on the container to restart manually.
Fraser notes: we need to consider the sysadmin who doesn't follow our directions (eg., steps in proper order) - we want a warning at the top of our operations manual that says they have to follow the steps in order; only deviate if you understand the rationale for the order.
Future work:
- current architecture doesn't have a clear way to autostart master. We would like to have a way to autostart the container- noted for Future Work. (possibly Ceph/paxos or raft).
- inventory cannot handle two hostnames pointing at the same IP address. We could add 2nd IP to each host.
--
DanielAllen - 2016-03-17