DFSc Working Group



Meeting Date

  • TEAMS: 2022-02-15

Invitees - Attendees

  • Anthony, Gouxiang, Lori, Nathan, Nick, Lawrence

Review and accept previous meeting minutes.

Proposed Agenda Items

Old business

Action items for next meeting 2022-02-15

  • Nathan - create a RBD with the -thick-provision option -> Pending
  • Clayton - create a ticket to get container made for v4 ganesha node -> Done
  • Fraser - create a ticket for the plans for local storage option -> https://rt.uwaterloo.ca/Ticket/Display.html?id=1209206
  • Nathan/Lori will update Ceph configuration to no longer allow insecure connections -> Scheduled
  • Lori to generate a schedule for upgrading the 42x systems -> Done

not near future

  • move one or more home directories to Rados NFS device
  • move one or more home directories to NetApp

New business

Increase number of pgs for cs-teaching: nfish

* Currently 256. Going to 512 * Some extra mon load. Those on 422 systems

Replace failing drive causing data inconsistencies/scrub errors: gxshen

* After pg rebalance (2022-02-23 or later)

gvfs on login systems https://rt.uwaterloo.ca/Ticket/Display.html?id=1209858

* GNOME VFS * lsof gives funny results

Diagnostics running? Latency per request

@ubuntu2004-002% cat /usr/local/bin/cephfs-trace 
#!/bin/bash
bpftrace -e 'kprobe:ceph_mdsc_do_request {​​​​​​​​ @start[tid] = nsecs; }​​​​​​​​ kretprobe:ceph_mdsc_do_request /@start[tid]/ {​​​​​​​​ $duration = nsecs - @start[tid]; printf("CephFS MDS Request by %d comm: %s pid: %d tid: %d dur: %d\n", uid, comm, pid, tid, $duration); delete(@start[tid]); }​​​​​​​​' | logger

strace: writes on cs-teaching almost always fast. rm -rf under load, across snapshots

  • sticky on unlink: unlinkat(4, "url", AT_REMOVEDIR

Dynamic MDS allocation

  • Need smaller changes in cap allocation
  • Spot heating for periods of time

Upgrades

Server side

  • Want to upgrade to Pacific by end of summer at the latest: strays, upgrade path, less OSD spill from RocksDB(sharding) currently:3, 30, 300GB..., mclock scheduler, graphana daemons. Octopus out of support 2022-06-01: Early May?
  • One MDS problem
  • Remove all snapshots?
  • Real downtime with low/no cluster load
  • ceph deploy deprecated: MIgrate to ceph-adm (docker containerization) then upgrade
  • splitting cs-teaching into multiple real filesystems(?)
  • ceph-adm upgrade on the 902s

Client side

Scratch(ish) drives on 211 systems (fhgunn)

  • ZFS sends for sync

Upcoming maintenance

  • New upgrade schedule for 42x systems (ldpaniak)
    • rebooting these systems seems to be helping after an update
  • Reading Week maintenance: Feb 20-27 (nfish/ldpaniak)
    • Increase number of pgs for cs-teaching pool. Start on 2022-02-23
    • PS: Feb 22 is now a University Holiday

To do

  • ceph-adm upgrade on the 902s (ldpaniak/nfish)
Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2022-02-15 - LoriPaniak
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback