DFSc Working Group
Meeting Date
Invitees - Attendees
- Anthony, Gouxiang, Lori, Nathan, Nick, Lawrence
Review and accept previous meeting minutes.
Notes:
Proposed Agenda Items
Old business
Action items for this meeting 2022-03-17
- ceph-adm upgrade on the 902s (ldpaniak/nfish)
- no action
- Lori will look at it and plan to have Nathan rebuild
- pilot step to the full upgrade/rebuild
- high-fidelity test-bed for the real system
- the 902 cluster was down, got it running - anything still connected to it??
- plan is to re-install with the same setup of the production cluster and run the upgrade process
- how to tell what is/has been connecting to it?
- Lori will investigate with Nathan
- the whole process needs to happen during the summer and it will take a while
- seems to be that we need to break out the /uNs into separate Ceph filesystems
- Fraser - create a ticket for the plans for local storage option -> https://rt.uwaterloo.ca/Ticket/Display.html?id=1209206
- Anthony will discuss available hardware with Fraser to determine next steps - updated ticket
- no obvious updates
- purchased 8 spindled drives
- O to follow up with Fraser
New business
- Access to DFSc performance counters, RT#1211673 dlgawley
- Dave's not here - leave for next meeting
- Diagnostics a2brenna - also RT#1211673
- debug symbols
- Anthony wants them installed
- Lori wants to know what happens with these debug symbols?
- Anthony: Symbolic information for debugging in stack traces, inert on the disk, takes space, but do not run in code
- separated back in the time when disk space was very limited
- why installed on the servers? Cannot do any meaning profiling without on the server itself
- could maybe be used on another host if all other code lined up the same
- but cannot do any profiling of a running system if debug symbols are not on the host itself
- Lori - what problems are we solving?
- Anthony - has profiled the client side and found no problems, but cannot profile the server side
- Anthony - sampling profiler is non-invasive, unlike strace
- Lori - current software is EOL in 6-8 weeks, is it worth doing on this version?
- Anthony - nothing in next release notes seem to indicate major fixes, so will want to start reviewing
- this ticket has now been resolved
- Any plans to deal with cephfs client crashes on teaching systems? - yc2lee/gxshen - RT#1214857
- several teaching systems have crashed in the past couple months
- seems to be a linux kernel/ceph issue - not clear if anyone in those groups are working on it
- eg: https://forum.proxmox.com/threads/bug-kernel-null-pointer-dereference-address-0000000000000402.106067/
- process for now?
- does not exist on or before 5.11 kernel
- Anthony: concerned this is due to running the HWE kernels. Should only be running stock kernels
- agreed at this meeting that the stock kernel should be put back onto systems as they are rebooted
- noted in ticket RT#1217576:
- At today's DFSC meeting, we agreed that Anthony will replace the current HWE kernels with the stock (5.4) kernel when rebooting the infrastructure systems over the next few days.
- If you have any objections - note them here. Otherwise we assume you are all supportive of this plan.
- If there are any noted problems, Anthony assures us that it should be able to be rolled back easily.
- Lori's concern is about potential impacts on performance. Would not be supportive of the change if we were not experiencing crashes.
- Anthony has diagnostic data but may require some re-working of his diagnostic tools based on the kernel change.
Upgrades
- Lori: will follow from pending upgrades
- Anthony: that is basically the Prometheus data
Start with 902 systems. Practice upgrades, work out bugs nfish/a2brenna/ldpaniak
Server side
- Want to upgrade to Pacific by end of summer at the latest: strays, upgrade path, less OSD spill from RocksDB(sharding) currently:3, 30, 300GB..., mclock scheduler, graphana daemons. Octopus out of support 2022-06-01: Early May?
- One MDS problem
- Remove all snapshots?
- in preparation for the upgrade
- Anthony: we should ensure we have complete backups before that
- Guoxiang: running low on disk space for index files, backups are slow, new hardware has problems, doesn't know when he can do the full backup (pretty soon?)
- Real downtime with low/no cluster load
- ceph deploy deprecated: Migrate to ceph-adm (docker containerization) then upgrade
- maybe not as deprecated as originally thought
- probably can't use ceph-adm
- will try on the 902s
- splitting cs-teaching into multiple real filesystems: u0-u19(?)
Client side
Scratch(ish) drives on 211 systems (fhgunn)
Upcoming maintenance
- Ongoing failed/flakey hard drive replacement gxshen. 421 drives are 5 years old. Expect failures.
MDS instances holding strays
* Anthony: what happens to the strays if the machine holding the MDS {goes away, crashes, etc}
* MDS is a data service that is triplicated, so one can go away
* the strays will move
* Anthony: why not just turn off the MDS?
* the cluster will automatically startup a new one
Action items for next meeting
- ldpaniak/nfish: Ceph upgrade on the 902s
- Omar: Fraser - create a ticket for the plans for local storage option -> https://rt.uwaterloo.ca/Ticket/Display.html?id=1209206
- Anthony: Any plans to deal with cephfs client crashes on teaching systems? - yc2lee/gxshen - RT#1214857 (revert to 5.4 kernel)
- Clayton: NFS servers to ganesha 4.0
- Guoxiang: need to get a full backup before the Ceph upgrade
- Lori: disable/reduce new snapshots in anticipation of Ceph upgrade (?)