Enterprise Mysql Cluster Project Wrapup

Summary

As of Tuesday 12 April 2016, the deliverables are either complete, or very minor steps away from completion.

Deliverables

  • project plan and timeline DONE
  • new 3-node cluster with proper master/slave replication DONE
  • mysql.cs databases (list here) are moved to new cluster DONE
  • Marmoset is left behind on current mysql.cs DONE
  • appropriate resource, latency, and health monitoring of the database services - DONE
  • wrapup includes:
    • maintenance documentation - DONE
    • This document: Enterprise Mysql Cluster Project Wrapup, including plans for potential upcoming work DONE

Lessons Learned

A formal project plan and planning process contributed to keeping this project on target. We also benefited from having a project manager who was responsible for keeping track of timeline, individual responsibilities, STs being updated, and coordination. The project manager role was also helpful because all three of CSCF's groups contributed resources to this project.

Compared to many CSCF projects, this project involved considerable communication within the team. We had 12 weekly meetings for setting strategy and coordinating work. Focused meetings proved very helpful toward project success, and we got better at focus as the project progressed. We also set up and followed communication plans for alerting our clients, including the rest of CSCF, elsewhere in CS, and with MFCF.

Early on, we agreed that successful completion would require good maintenance documentation, which we did with "cross-training" between project members to verify that it was complete and sufficiently specific. Writing and reviewing documentation has been a substantial portion of the project and is nearly complete as of Tuesday 12 April.

Throughout this project, CSCF managers encouraged the project members and gave us sufficient time to work on this priority. Without this support, the project would likely have failed to complete in the allotted time. In early April a higher priority for the School required a project member for much of his remaining time, and we were able to re-assign tasks among the other team members to accommodate.

All members of the team reported learning many technical details about mysql and its available toolsets such as Percona. We have recorded below some areas of useful further investigation that we decided were out of scope for this project. A critical area of discovery: mysqldump does not, by default, dump stored procedures for a database. It will do so with an optional --routines parameter. Investigation does not explain why this is the default versus dumping stored triggers by default. This caused the only data-loss during the database move, for one application database.

In retrospect, the project design document would have been strengthened by a more comprehensive list of risks and mitigations; we might have uncovered some of the areas that we didn't discover and discuss until later in the project, such as: "Risk: setting up a newer mysql requires value judgements about configurations that were set on a previous version. Mitigation: spending enough time to make good choices about configurations to copy or keep the mysql default."

Finally, keeping the project sponsor in the loop throughout the project is important; as Project Manager, I think I should've appraised the Sponsor about the project spec as soon as it was settled in the first few weeks, since the spec was changed from what the Sponsor thought we were planning to do this term.

Potential future work

During the project, we noted many pieces of useful work that we deemed out of scope for this project, but want CSCF to consider for later.

A potential future project is replicating the process in order to set up a dedicated mysql cluster for marmoset, which also could benefit from an automated failover process; CS students would find this a visible improvement over the current setup. While a replica of this project, as-is, would be a quick project, it would not solve the problem of automated failover. Either one would be an improvement over the status-quo for marmoset.

A separate future project is developing a similar process for setting up a postgres cluster. However, the technical details are quite different. We would encourage that a formal project plan and planning process be followed for that project.

None of the following items yet have ST items, and none of these are seen as high priorities for CSCF for Spring 2016.

  • further cleanup of production databases:
    • we've done a batch of cleanup on grants/permissions. There may be more that would make sense.
    • we should set up a process for manual regular cleanup, perhaps yearly.
    • we should change databases which are stored as myisam into innodb for greater data integrity assurance- the biggest of these is ST, and we believe ST uses the full-text-search properties of myisam.
  • automation: * haproxy auto-failover would reduce downtimes from 15+ minutes to sub-minute; this might be very easy to implement, but we judged it out of scope.
    • service mysqld start doesn't know whether it's a master or slave. We avoid risk of two active masters ruining the database, by writing the maintenance docs to require disabling mysql on a downed master; it would be more foolproof to have an accurate automated method to tell the master from the slaves.
    • monitoring could have additional automation and reporting
  • efficient use of pt-table-sync - we use 1 of 3 modes of operation. There might be a faster method of recovery that we haven't investigated yet.
  • fire drills: CSCF consideration of whether we should be having yearly firedrills? for this and many other systems. We recognize that we don't have staff-time to do this regularly without specifically making time for them. (In essence, server upgrades are the only live fire drill we perform, apart from real fires.)
  • mysql configurations: comparing the configurations of unchanged settings against "best practices" (which might be stock debian package, or stock mysql, or something different).
  • we can see good reasons to recommend implementing shorewall policies across CSCF servers, for easy-to-understand firewall filtering under CS control.

-- DanielAllen - 2016-04-08

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r3 - 2016-04-11 - DanielAllen
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2025 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback