DFSc Working Group
Meeting Date
Invitees - Attendees
- Anthony, Gouxiang, Lori, Nathan, Nick, Lawrence
Review and accept previous meeting minutes.
Proposed Agenda Items
Old business
Action items for next meeting 2022-02-15
- Nathan - create a RBD with the -thick-provision option -> Pending
- Clayton - create a ticket to get container made for v4 ganesha node -> Done
- Fraser - create a ticket for the plans for local storage option -> https://rt.uwaterloo.ca/Ticket/Display.html?id=1209206
- Nathan/Lori will update Ceph configuration to no longer allow insecure connections -> Scheduled
- Lori to generate a schedule for upgrading the 42x systems -> Done
not near future
- move one or more home directories to Rados NFS device
- move one or more home directories to NetApp
New business
Increase number of pgs for cs-teaching: nfish
* Currently 256. Going to 512
* Some extra mon load. Those on 422 systems
Replace failing drive causing data inconsistencies/scrub errors: gxshen
* After pg rebalance (2022-02-23 or later)
* GNOME VFS
* lsof gives funny results
Diagnostics running? Latency per request
@ubuntu2004-002% cat /usr/local/bin/cephfs-trace
#!/bin/bash
bpftrace -e 'kprobe:ceph_mdsc_do_request {​​​​​​​​ @start[tid] = nsecs; }​​​​​​​​ kretprobe:ceph_mdsc_do_request /@start[tid]/ {​​​​​​​​ $duration = nsecs - @start[tid]; printf("CephFS MDS Request by %d comm: %s pid: %d tid: %d dur: %d\n", uid, comm, pid, tid, $duration); delete(@start[tid]); }​​​​​​​​' | logger
strace: writes on cs-teaching almost always fast. rm -rf under load, across snapshots
- sticky on unlink: unlinkat(4, "url", AT_REMOVEDIR
Dynamic MDS allocation
- Need smaller changes in cap allocation
- Spot heating for periods of time
Upgrades
Server side
- Want to upgrade to Pacific by end of summer at the latest: strays, upgrade path, less OSD spill from RocksDB(sharding) currently:3, 30, 300GB..., mclock scheduler, graphana daemons. Octopus out of support 2022-06-01: Early May?
- One MDS problem
- Remove all snapshots?
- Real downtime with low/no cluster load
- ceph deploy deprecated: MIgrate to ceph-adm (docker containerization) then upgrade
- splitting cs-teaching into multiple real filesystems(?)
- ceph-adm upgrade on the 902s
Client side
Scratch(ish) drives on 211 systems (fhgunn)
Upcoming maintenance
- New upgrade schedule for 42x systems (ldpaniak)
- rebooting these systems seems to be helping after an update
- Reading Week maintenance: Feb 20-27 (nfish/ldpaniak)
- Increase number of pgs for cs-teaching pool. Start on 2022-02-23
- PS: Feb 22 is now a University Holiday
To do
- ceph-adm upgrade on the 902s (ldpaniak/nfish)