DFSc Working Group

Meeting Date

TEAMS: 2022-02-15

Invitees - Attendees

Anthony, Gouxiang, Lori, Nathan, Nick, Lawrence

Review and accept previous meeting minutes.

CsDFScWgMeeting20220202

Proposed Agenda Items

Old business

Action items for next meeting 2022-02-15

Nathan - create a RBD with the -thick-provision option -> Pending
Clayton - create a ticket to get container made for v4 ganesha node -> Done
Fraser - create a ticket for the plans for local storage option -> https://rt.uwaterloo.ca/Ticket/Display.html?id=1209206
Nathan/Lori will update Ceph configuration to no longer allow insecure connections -> Scheduled
Lori to generate a schedule for upgrading the 42x systems -> Done

not near future

move one or more home directories to Rados NFS device
move one or more home directories to NetApp

New business

Increase number of pgs for cs-teaching: nfish

* Currently 256. Going to 512 * Some extra mon load. Those on 422 systems

Replace failing drive causing data inconsistencies/scrub errors: gxshen

* After pg rebalance (2022-02-23 or later)

gvfs on login systems https://rt.uwaterloo.ca/Ticket/Display.html?id=1209858

* GNOME VFS * lsof gives funny results

Diagnostics running? Latency per request

@ubuntu2004-002% cat /usr/local/bin/cephfs-trace 
#!/bin/bash
bpftrace -e 'kprobe:ceph_mdsc_do_request {&#8203;&#8203;&#8203;&#8203;&#8203;&#8203;&#8203;&#8203; @start[tid] = nsecs; }&#8203;&#8203;&#8203;&#8203;&#8203;&#8203;&#8203;&#8203; kretprobe:ceph_mdsc_do_request /@start[tid]/ {&#8203;&#8203;&#8203;&#8203;&#8203;&#8203;&#8203;&#8203; $duration = nsecs - @start[tid]; printf("CephFS MDS Request by %d comm: %s pid: %d tid: %d dur: %d\n", uid, comm, pid, tid, $duration); delete(@start[tid]); }&#8203;&#8203;&#8203;&#8203;&#8203;&#8203;&#8203;&#8203;' | logger

strace: writes on cs-teaching almost always fast. rm -rf under load, across snapshots

sticky on unlink: unlinkat(4, "url", AT_REMOVEDIR

Dynamic MDS allocation

Need smaller changes in cap allocation
Spot heating for periods of time

Upgrades

Server side

Want to upgrade to Pacific by end of summer at the latest: strays, upgrade path, less OSD spill from RocksDB(sharding) currently:3, 30, 300GB..., mclock scheduler, graphana daemons. Octopus out of support 2022-06-01: Early May?
One MDS problem
Remove all snapshots?
Real downtime with low/no cluster load
ceph deploy deprecated: MIgrate to ceph-adm (docker containerization) then upgrade
splitting cs-teaching into multiple real filesystems(?)
ceph-adm upgrade on the 902s

Client side

5.13 kernel on 2004-002,016 at this time
Wait to end of term
5.16 has nowsync default https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.16-Ceph

Scratch(ish) drives on 211 systems (fhgunn)

ZFS sends for sync

Upcoming maintenance

New upgrade schedule for 42x systems (ldpaniak)
- rebooting these systems seems to be helping after an update
Reading Week maintenance: Feb 20-27 (nfish/ldpaniak)
- Increase number of pgs for cs-teaching pool. Start on 2022-02-23
- PS: Feb 22 is now a University Holiday

To do

ceph-adm upgrade on the 902s (ldpaniak/nfish)

Topic revision: r4 - 2022-02-15 - LoriPaniak

Information in this area is meant for use by CSCF staff and is not official documentation, but anybody who is interested is welcome to use it if they find it useful.

Other Webs

My links
- People
- CERAS
- WatForm
- Tetherless lab
- Ubuntu Main.HowTo
- eDocs
- RGG NE notes
- RGG
- CS infrastructure
- Grad images

Edit