Meeting 18 February 2016, 2pm
Attended: drallen, dlgawley a2brenna fhgunn ldpaniak
Agenda:
- Progress on Milestones
- Timeline
- Brief summary since the last meeting
Progress on Milestones:
- Testing Failover: Was 17 Feb. Decided it can be done in parallel; work in process- Reset to be due tomorrow (19 Feb).
- Backups set up and tested: (Monday 22 Feb) is started by Guoxiang.
Timeline
- Keeping to the schedule set last week:
- started moving first database (inventory) yesterday Feb. 17
Brief summary since the last meeting
- Anthony has brought up mysql-106.
- Networking is not 40GB yet - but Devon has done necessary hardware upgrade; at least a week away (?)
- Do we need contingency plans if the 40Gb connection isn't up? We can first upgrade the slave and test it works.
- Anthony to look into what are the interfaces on these machines?
- What will be the connection to the rest of CS eg. www.cs? 1Gb? 10Gb? how many hops?
- Setting up backup is started but not complete; Guoxiang has completed his part; Fraser has configuration of Percona remaining
- Fraser will request that we have filesystem backups as well, since doesn't include marmoset.
- Daniel has set up nagios monitoring for mysql-102. Daniel will add 104 and 106 - when Anthony has verified the MachineNotes are correct
- Anthony will put his documentation under the Mysql page
- Fraser and Daniel doing a sync of inventory including the same process (and same data transfer) as stage two sync; this will give us timing on the stage two move. We'll transfer inventory twice; the first one we will plan to break and throw away.
- Fraser and Anthony have a plan to slave the cluster to the old mysql.cs - which we will test with inventory
- if things go very well, Fraser and I may be able to start moving the rest of databases starting Wednesday Feb. 24, though schedule suggests Fri. Feb 26 - Tues. Mar 1 (CSCF staff meeting). Agreed we will aim for Wednesday.
- Daniel will set CNAME TTL to 1 minute- but we can also flush the cache for www.cs and www.math (are there other clients? Daniel to investigate).
- Anthony will start working on health monitoring- Fraser suggested moving some of the monitoring to the webserver.
- How do we want to test the database?
- need lots of additional tests
- Complete failover from Master-Master (and back)
- pulling physical things like unplugging networking;
- breaking software raid?
dd
on disk?
- contriving a race condition? (writing to both masters at the same time- what does mysql do?)
- Anthony will do lower-level testing; and Daniel will do application-level testing.
- We agree to record exactly (briefly) what we tried, and put it into the wiki.
- our documentation for failover/fixing needs to include tests we do to make sure we're OK (eg. before bringing up the old master)
- (out of scope) Anthony wants to do haproxy auto-failover; this might be very easy. So we shouldn't put much effort into designing our own fancy A-record-based recovery option- this is for after this project is complete.
- semi-synchronous could be "bolted on the end" - except we would want to do lots of testing. So it should be part of the follow-on project; unless we get bored and want to put in semi-synchronous.
- we discussed project wrap-up procedures and documentation. Keep a log of things you want to put in the "lessons learned."
Next Meeting
- Thu 25 Feb - NOTE ROOM CHANGE: DC 2564
--
DanielAllen - 2016-02-17