Meeting 3 March 2016, 2pm

Attended: drallen a2brenna fhgunn

Agenda:

  • Progress on Milestones
  • Timeline
  • Brief summary since the last meeting

Progress on Milestones:

  • Backups set up and tested: (Due-date reset to today): Remaining: testing restore of master? (is this different from restoring slave?)

Timeline

  • moving inventory- likely not finished until Friday (18th), with more testing done in the meantime.
    • Schedule has had further slip, shows wrapup April 7 or so where:
      • Mon 14 Mar - Fri 18 Mar is moving inventory (Fraser and Daniel);
      • Mon 21 Mar - Wed 23 Mar is moving the rest of apps (Fraser and Daniel);
      • Thu 24 Mar - Wed 29 Mar is benchmarking apps (Lori and Daniel);
      • Tue 22 Mar - Tue 29 Mar is finalizing monitoring, tuning, maintenance documentation (Anthony, Fraser, Daniel) - Fraser only available until 24th. (here the following week but realtime lab may interfere). Furthermore, Fraser has new commitments taking priority: spending for the remainder of this FY (must happen before the 31st).
      • Tue 29 Mar - Thu 7 April is wrapup: Fixing remaining issues, writing up lessons learned, proposed plans for student cluster, handoff to Ken.

Brief summary since the last meeting, and upcoming week:

Fraser has just learned that he has new commitments taking priority: spending for the remainder of this FY (must happen before the 31st).

  • did backup work with Anthony - including adding staggered backups; remaining:
  • editing MySQLHATesting - failover component (the rest is complete). - Done!
  • deciding which process to copy inventory -will work on this later today; Daniel and Anthony to review.
  • mysql config merging;

Anthony:

  • did slave recovery;
  • wrote draft of slave recovery from backup, documented in ST. (we have a checklist for backup; need a checklist for failover)
    • There might be a faster method of recovery using pt-table-sync - we might investigate that later (and method 3 is recovery from dump)
    • We're agreed we won't use backups to recover the master: instead always recover slave, make slave into master, and recover the old master as a slave.
    • Anthony would like to log backup success - not critical for project going into production but nice to have.
  • will do shorewall saltifying next.
    • slaves: default deny mysql port; allow from 102,104,106,nagios-202,nagios-204,cacti,asgard
    • document that users on cscfnet need to go to asgard to connect to mysql slaves
  • and will then continue work on manual, including normal operations.

Daniel tested slave fail/recovery from master. Remaining on that item is Fraser's edits to MySQLHATesting failover- which we will do now.

Restarting container automatically has a failure mode if master goes down. -- because failover has a window where clients can be hitting multiple master servers. Agree again to set auto_start to no on all three ( /var/lib/lxc/container_name/{fstab,rootfs/,config} ) ; sysadmin goes to mysqladmin on the container to restart manually.

Fraser notes: we need to consider the sysadmin who doesn't follow our directions (eg., steps in proper order) - we want a warning at the top of our operations manual that says they have to follow the steps in order; only deviate if you understand the rationale for the order.

Future work:

  • current architecture doesn't have a clear way to autostart master. We would like to have a way to autostart the container- noted for Future Work. (possibly Ceph/paxos or raft).
  • inventory cannot handle two hostnames pointing at the same IP address. We could add 2nd IP to each host.

-- DanielAllen - 2016-03-17

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2016-03-17 - DanielAllen
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback