DFSc Working Group

Meeting Date
Invitees - Attendees
Review and accept previous meeting minutes.
- - Notes:
Proposed Agenda Items

Meeting Date

TEAMS: 2022-04-27

Invitees - Attendees

Anthony, Gouxiang, Lori, Nathan, Nick, Lawrence

Review and accept previous meeting minutes.

CsDFScWgMeeting20220215

Notes:

Recent performance https://icinga.cscf.uwaterloo.ca/grafana/d/03EnhXZGz/dfsc-monitoring?orgId=1&from=1649099905385&to=1649141741609
- spike on April 4th 31k+ iops

Proposed Agenda Items

Old business

Action items for this meeting 2022-03-17

ceph-adm upgrade on the 902s (ldpaniak/nfish)
- no action
- Lori will look at it and plan to have Nathan rebuild
- pilot step to the full upgrade/rebuild
- high-fidelity test-bed for the real system
- the 902 cluster was down, got it running - anything still connected to it??
  - plan is to re-install with the same setup of the production cluster and run the upgrade process
  - how to tell what is/has been connecting to it?
  - Lori will investigate with Nathan
  - the whole process needs to happen during the summer and it will take a while
  - seems to be that we need to break out the /uNs into separate Ceph filesystems
Fraser - create a ticket for the plans for local storage option -> https://rt.uwaterloo.ca/Ticket/Display.html?id=1209206
- Anthony will discuss available hardware with Fraser to determine next steps - updated ticket
- no obvious updates
- purchased 8 spindled drives
- O to follow up with Fraser

New business

Access to DFSc performance counters, RT#1211673 dlgawley
- Dave's not here - leave for next meeting

Diagnostics a2brenna - also RT#1211673
- debug symbols
- Anthony wants them installed
- Lori wants to know what happens with these debug symbols?
- Anthony: Symbolic information for debugging in stack traces, inert on the disk, takes space, but do not run in code
- separated back in the time when disk space was very limited
- why installed on the servers? Cannot do any meaning profiling without on the server itself
- could maybe be used on another host if all other code lined up the same
- but cannot do any profiling of a running system if debug symbols are not on the host itself
- Lori - what problems are we solving?
- Anthony - has profiled the client side and found no problems, but cannot profile the server side
- Anthony - sampling profiler is non-invasive, unlike strace
- Lori - current software is EOL in 6-8 weeks, is it worth doing on this version?
- Anthony - nothing in next release notes seem to indicate major fixes, so will want to start reviewing
- this ticket has now been resolved
Any plans to deal with cephfs client crashes on teaching systems? - yc2lee/gxshen - RT#1214857
- several teaching systems have crashed in the past couple months
- seems to be a linux kernel/ceph issue - not clear if anyone in those groups are working on it
- eg: https://forum.proxmox.com/threads/bug-kernel-null-pointer-dereference-address-0000000000000402.106067/
- process for now?
- does not exist on or before 5.11 kernel
- Anthony: concerned this is due to running the HWE kernels. Should only be running stock kernels
- agreed at this meeting that the stock kernel should be put back onto systems as they are rebooted
- noted in ticket RT#1217576:
  - At today's DFSC meeting, we agreed that Anthony will replace the current HWE kernels with the stock (5.4) kernel when rebooting the infrastructure systems over the next few days.
  - If you have any objections - note them here. Otherwise we assume you are all supportive of this plan.
  - If there are any noted problems, Anthony assures us that it should be able to be rolled back easily.
  - Lori's concern is about potential impacts on performance. Would not be supportive of the change if we were not experiencing crashes.
  - Anthony has diagnostic data but may require some re-working of his diagnostic tools based on the kernel change.

Upgrades

Update on status of the Ceph Dashboard RT# 973431 dlgawley

Lori: will follow from pending upgrades
Anthony: that is basically the Prometheus data

Start with 902 systems. Practice upgrades, work out bugs nfish/a2brenna/ldpaniak

in progress

Server side

Want to upgrade to Pacific by end of summer at the latest: strays, upgrade path, less OSD spill from RocksDB(sharding) currently:3, 30, 300GB..., mclock scheduler, graphana daemons. Octopus out of support 2022-06-01: Early May?
One MDS problem
Remove all snapshots?
- in preparation for the upgrade
- Anthony: we should ensure we have complete backups before that
- Guoxiang: running low on disk space for index files, backups are slow, new hardware has problems, doesn't know when he can do the full backup (pretty soon?)
Real downtime with low/no cluster load
ceph deploy deprecated: Migrate to ceph-adm (docker containerization) then upgrade
- maybe not as deprecated as originally thought
- probably can't use ceph-adm
- will try on the 902s
splitting cs-teaching into multiple real filesystems: u0-u19(?)

Client side

- NFS servers to ganesha 4.0 -> ctucker
- https://launchpad.net/~nfs-ganesha/+archive/ubuntu/libntirpc-4
- https://launchpad.net/~nfs-ganesha/+archive/ubuntu/nfs-ganesha-4
- Follow on from previous work https://rt.uwaterloo.ca/Ticket/Display.html?id=1208856
  - should we wait until after the kernel reversion?

Scratch(ish) drives on 211 systems (fhgunn)

ZFS sends for sync

Upcoming maintenance

Ongoing failed/flakey hard drive replacement gxshen. 421 drives are 5 years old. Expect failures.
- updates in progress

MDS instances holding strays

* Anthony: what happens to the strays if the machine holding the MDS {goes away, crashes, etc} * MDS is a data service that is triplicated, so one can go away * the strays will move * Anthony: why not just turn off the MDS? * the cluster will automatically startup a new one

Action items for next meeting

ldpaniak/nfish: Ceph upgrade on the 902s
Omar: Fraser - create a ticket for the plans for local storage option -> https://rt.uwaterloo.ca/Ticket/Display.html?id=1209206
Anthony: Any plans to deal with cephfs client crashes on teaching systems? - yc2lee/gxshen - RT#1214857 (revert to 5.4 kernel)
Clayton: NFS servers to ganesha 4.0
Guoxiang: need to get a full backup before the Ceph upgrade
Lori: disable/reduce new snapshots in anticipation of Ceph upgrade (?)

Topic revision: r5 - 2022-04-27 - LawrenceFolland

Information in this area is meant for use by CSCF staff and is not official documentation, but anybody who is interested is welcome to use it if they find it useful.

Other Webs

My links
- People
- CERAS
- WatForm
- Tetherless lab
- Ubuntu Main.HowTo
- eDocs
- RGG NE notes
- RGG
- CS infrastructure
- Grad images

Edit