Enterprise Mysql Cluster Project Wrapup

Summary

As of Tuesday 12 April 2016, the deliverables are either complete, or very minor steps away from completion. Below are noted deliverables, lessons learned, and potential future work identified by the project team.

Deliverables

  • project plan and timeline
  • new 3-node cluster with proper master/slave replication
  • mysql.cs databases (list here) are moved to new cluster - known as mysql.cs
  • Marmoset is left behind on old mysql.cs - now known as mysql-172.cs
  • appropriate resource, latency, and health monitoring of the database services (95% complete as of Tuesday 12 April)
  • wrapup includes:
    • maintenance documentation (95% complete as of Tuesday 12 April)
    • This document: Enterprise Mysql Cluster Project Wrapup, including plans for potential upcoming work

Lessons Learned

A formal project plan and planning process contributed to keeping this project on target. We also benefited from having a project manager who was responsible for keeping track of timeline, individual responsibilities, STs being updated, and coordination. The project manager role was also helpful because all three of CSCF's groups contributed resources to this project.

Compared to many CSCF projects, this project involved considerably more communication within the team. We had 12 weekly meetings for setting strategy and coordinating work, on top of one-on-one discussion in person and discussion on ST. Focused planning meetings proved very helpful toward project success, and it's the project manager's opinion that we got better at focus as the project progressed. We also set up and followed communication plans for alerting our clients, including the rest of CSCF, elsewhere in CS, and with MFCF.

Early on, we agreed that successful completion would require good maintenance documentation, which we did with "cross-training" between project members to verify that it was complete and sufficiently specific. Writing and reviewing documentation has been a substantial portion of the project and is nearly complete as of Tuesday 12 April.

Throughout this project, CSCF managers encouraged the project members and gave us sufficient time to work on this priority. Without this support, the project would likely have failed to complete in the allotted time. In early April a higher priority for the School required a project member for much of his remaining time, and we were able to re-assign tasks among the other team members to accommodate.

All members of the team reported learning many technical details about mysql and its available toolsets such as Percona. We have also recorded below some areas of useful further investigation that we decided were out of scope for this project. A critical area of discovery: mysqldump does not, by default, dump stored procedures for a database. It will do so with an optional --routines parameter. Investigation does not explain why this is the default. During the database move, not knowing this caused data-loss of stored procedures for the SIGMOD application database, which were restored by the SIGMOD application experts. It would have been good to know about this mysqldump issue in advance, and it is a good thing that we had been in close communication with the SIGMOD application users in advance of the move.

In retrospect, the project design document would have been strengthened by a more comprehensive list of risks and mitigations; we might have uncovered some of the areas that turned out to take unexpected amounts of time to solve later in the project; one example of such would be: "Risk: setting up a new mysql install requires value judgements about configurations that were set years ago for a previous version. Mitigation: spending enough time early in the project to make good choices about configurations to copy or keep the mysql default."

Finally, keeping the project sponsor in the loop throughout the project is important; the Project Manager feels he should have reviewed the project spec with the Sponsor as soon as it was settled in the first few weeks, since the spec was changed from what the Sponsor thought we were planning to do this term. This was resolved in a management meeting a week or so later, but having a clearer picture earlier would have been more effective.

  • Added 2016-04-14: Sponsor Feedback: If I could change one thing about this process, it would be to add one or two project sponsor checkpoints along the way.

Potential future work

During the project, we noted many pieces of useful work that we deemed out of scope for this project, but seem useful for CSCF to consider for later.

A potential future project is replicating the process in order to set up a dedicated mysql cluster for marmoset, which also could benefit from an automated failover process; CS students would find this a visible improvement over the current setup. While a replica of this project, as-is, would be a quick project, it would not solve the problem of automated failover, and experience has shown marmoset will fail outside of business hours. Either version of higher availability would be an improvement over the status-quo for marmoset.

A separate future project is developing a similar process for setting up a postgres cluster. However, the technical details are quite different. We would encourage that a formal project plan and planning process be followed for that project.

None of the following items are seen by the team as high priorities for CSCF for Spring 2016, and none have been turned into ST items yet.

  • fire drills: We would like to see CSCF consider having regular (yearly? termly?) fire drills for this or other systems to test our preparedness for outages. We recognize that we don't have staff-time to do this regularly without specifically making time for them. In practice, server upgrades are the only live fire drill we perform, apart from real fires.
  • mysql configurations: we have assessed and copied relevant mysql server configurations from the prior configurations; we might benefit from comparing the configurations of copied settings against "best practices" (which might be stock debian package, or stock mysql, or something different).
  • further cleanup of production databases:
    • we've done a batch of cleanup on grants/permissions. There may be more that would make sense.
    • we would like to set up a process for manual regular cleanup of permissions, databases, and users, perhaps yearly.
    • where feasible, we would like to change application databases which are stored as myisam into innodb for greater data integrity assurance. We believe that many databases are solely myisam because that was a default.
  • automation:
    • haproxy auto-failover would reduce downtimes from 15+ minutes to sub-minute; this might be very easy to implement, but we judged it out of scope.
    • in our HA setup, we lack an automatic tool to prevent two machines from both being brought up as master, which could lead to diverging databases. To mitigate this risk, the maintenance docs require that when a master goes down, we will disable mysql from restarting. It would be more foolproof to have an accurate automated method so only one master can start. This would be a small utility to write and maintain, though we deemed it out of scope for this project.
  • monitoring and reporting could use further enhancements.
  • efficient use of pt-table-sync - we have tested and documented one of three modes of operation for database re-syncing in case of errors; investigating the other two might be fruitful for recovering databases in certain situations.

-- DanielAllen - 2016-04-08

Topic revision: r6 - 2016-04-14 - DanielAllen
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback