Enterprise Mysql Cluster Project Wrapup
Executive Summary
Deliverables
- project plan and timeline
- new 3-node cluster with proper master/slave replication
- mysql.cs databases (list here) are moved to new cluster
- Marmoset is left behind on current mysql.cs
- appropriate resource, latency, and health monitoring of the database services - in progress as of 8 April 2016
- wrapup includes:
- maintenance documentation - in progress as of 8 April 2016
- this document, including plans for potential upcoming work - in progress as of 8 April 2016
Lessons Learned
A formal project plan and planning process were highly helpful for keeping this project on target. We also benefited from having a project manager who was responsible for keeping track of the timeline, individual responsibilities, and coordination. The project manager role was also helpful because this was a project with workers coming from each of CSCF's three groups.
[more to follow]
Stored procedures ...
Potential future work
During the project, we noted many pieces of useful work that we deemed out of scope for this project, but want to consider for later.
A potential future project is replicating the process in order to set up a mysql cluster for student use; or for marmoset. In initial analysis, marmoset could take advantage of an automated failover process; however it's not clear that the Highly Available aspects of this project are necessary for marmoset instead of restore-from-backup.
A separate future project is developing a similar process for setting up a postgres cluster. However, the technical details are quite different. We would encourage that a formal project plan and planning process would be helpful for that project.
None of the following items have ST items, and none of these are seen as high priorities for CSCF for Spring 2016.
- haproxy auto-failover would reduce downtimes from 15+ minutes to sub-minute; this might be very easy to implement, but we judged it out of scope.
- further cleanup of production databases:
- we've done a batch of cleanup on grants/permissions. There may be more that would make sense.
- we should set up a process for regular pruning/cleanup, at least yearly.
- change databases which are stored as myisam into innodb for greater data integrity assurance- the biggest of these is ST, and we believe ST uses the full-text-search properties of myisam.
- automation:
-
service mysqld start
doesn't know whether it's a master or slave. We avoid risk of two active masters ruining the database, by writing the maintenance docs to require disabling mysql on a downed master; it would be more foolproof to have an accurate automated method to tell the master from the slaves.
- monitoring could have additional automation and reporting
- efficient use of
pt-table-sync
- we use 1 of 3 modes of operation. There might be a faster method of recovery that we haven't investigated yet.
- fire drills: consideration of whether we should be having yearly firedrills? for this and many other systems. We recognize that we don't have staff-time to do this regularly, without reallocating priorities. (In essence, server upgrades could be a live fire drill)
- mysql configurations: comparing the configurations of unchanged settings against "best practices" (which might be stock debian package, or stock mysql, or something different).
- we can see good reasons to recommend implementing shorewall across CSCF.
-- DanielAllen - 2016-04-08