Meeting 18 February 2016, 2pm

Attended: drallen, dlgawley a2brenna fhgunn ldpaniak

Agenda:

  • Progress on Milestones
  • Timeline
  • Brief summary since the last meeting

Progress on Milestones:

  • Testing Failover: Was 17 Feb. Decided it can be done in parallel; work in process- Reset to be due tomorrow (19 Feb).
  • Backups set up and tested: (Monday 22 Feb) is started by Guoxiang.

Timeline

  • Keeping to the schedule set last week:
    • started moving first database (inventory) yesterday Feb. 17

Brief summary since the last meeting

  • Anthony has brought up mysql-106.
  • Networking is not 40GB yet - but Devon has done necessary hardware upgrade; at least a week away (?)
    • Do we need contingency plans if the 40Gb connection isn't up? We can first upgrade the slave and test it works.
    • Anthony to look into what are the interfaces on these machines?
      • What will be the connection to the rest of CS eg. www.cs? 1Gb? 10Gb? how many hops?
  • Setting up backup is started but not complete; Guoxiang has completed his part; Fraser has configuration of Percona remaining
    • Fraser will request that we have filesystem backups as well, since doesn't include marmoset.
  • Daniel has set up nagios monitoring for mysql-102. Daniel will add 104 and 106 - when Anthony has verified the MachineNotes are correct
    • Anthony will put his documentation under the Mysql page
  • Fraser and Daniel doing a sync of inventory including the same process (and same data transfer) as stage two sync; this will give us timing on the stage two move. We'll transfer inventory twice; the first one we will plan to break and throw away.
    • Fraser and Anthony have a plan to slave the cluster to the old mysql.cs - which we will test with inventory
    • if things go very well, Fraser and I may be able to start moving the rest of databases starting Wednesday Feb. 24, though schedule suggests Fri. Feb 26 - Tues. Mar 1 (CSCF staff meeting). Agreed we will aim for Wednesday.
  • Daniel will set CNAME TTL to 1 minute- but we can also flush the cache for www.cs and www.math (are there other clients? Daniel to investigate).
  • Anthony will start working on health monitoring- Fraser suggested moving some of the monitoring to the webserver.
  • How do we want to test the database?
    • need lots of additional tests
    • Complete failover from Master-Master (and back)
    • pulling physical things like unplugging networking;
    • breaking software raid? dd on disk?
    • contriving a race condition? (writing to both masters at the same time- what does mysql do?)
    • Anthony will do lower-level testing; and Daniel will do application-level testing.
    • We agree to record exactly (briefly) what we tried, and put it into the wiki.
      • our documentation for failover/fixing needs to include tests we do to make sure we're OK (eg. before bringing up the old master)
  • (out of scope) Anthony wants to do haproxy auto-failover; this might be very easy. So we shouldn't put much effort into designing our own fancy A-record-based recovery option- this is for after this project is complete.
  • semi-synchronous could be "bolted on the end" - except we would want to do lots of testing. So it should be part of the follow-on project; unless we get bored and want to put in semi-synchronous.
  • we discussed project wrap-up procedures and documentation. Keep a log of things you want to put in the "lessons learned."

Next Meeting

  • Thu 25 Feb - NOTE ROOM CHANGE: DC 2564

-- DanielAllen - 2016-02-17

Topic revision: r5 - 2016-02-19 - DanielAllen
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback