Meeting: 2016-11-18 DC-2102
Attendance:
Guoxiang Shen, Lori D. Paniak; Nathan Fish
Agenda:
Ceph on "fat" OSD, current obstacles
Discussion:
Review of building OSDs on md/dm manually as outlined in:
https://cs.uwaterloo.ca/cscf/internal/request_debug/UpdateRequest?106956
Script at Nov 18 from ldpaniak needs to have md and sd devices replaced by UUID.
6GB journals per OSD seen as good initial sizing.
Discussion of failure rates for 3+1R5 vs 6+2R6 OSDs.
Concluded that data loss probability <0.01% over 5yrs for 3+1R5 is essentially correct.
Decided that extra effort to build system was worth it if it reduced maintenance load in production.
Customization should not increase maintenance load or impede upgrade path for the system.
md-based OSD removes direct monitoring of hard drives by Ceph. It is essential that HDD health be
monitored by secondary means (eg. smartd) and reported (eg. Nagios) and acted on (eg. drive replacement)
in a timely manner.
Suggestion by nfish for monthly updates/reboots of DFS systems separately to test HDD spin-down problems and complications from updates.
It will be useful to have a (VM) analogue of the DFS cluster where updates can be applied first.
Suggestion to use existing Ceph cluster on AMD hardware as a backup for the DFS cluster.
To do:
Build md/dm-backed OSD with journal on
SSD partition.
--
LoriPaniak - 2016-11-21