Linux Working Group

Linux Working Group
Meeting Date

Meeting Date

TEAMS: 2023-02-22

Invited

Anthony (group leader), Clayton, Guoxiang, Lori, Fraser, Devon, Nathan, Nick, Todd, Dave, O

Attendees

Anthony, Guoxiang, Clayton, Lori, Devon, Todd, O, Fraser

Review and accept previous meeting minutes.

CsLWGMeeting20230208

Review last meeting's Action Items

suexec-flex https://rt.uwaterloo.ca/Ticket/Display.html?id=950947
- Is there a .deb package at this point that is ready for distribution?
  - nfish is working on it, progress happening ...

New Items

Ceph troubleshooting:

References

CERN paper attached below: one bad disk can ruin cluster performance, various tunings, erasure coding configs
- Annotations... We don't use erasure coding;
- OSDs are underperforming because of configuration; hardware itself is performing as expected
- OSD queues have been observed to be idle most of the time.
- The vast majority of requests, from general use host clients to MDS layer, are file locks/unlocks.
High caps (over 200k) on MDS leads to poor performance: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/B7K6B5VXM3I7TODM4GRF3N7S254O5ETY/
- This is a rare occurrence; journaling latencies due to large dumps as opposed to "amortized" dumps ?

Observations

Daily caps load on cs-teaching at 06:25 https://rt.uwaterloo.ca/Ticket/Display.html?id=1274439

(Possible) Actions (Currently waiting on cluster to heal/data collection)

Increase number of pgs per (cs-teaching) metadata pool (a2brenna)
Increase mds_recall_max_caps size (improve cap recovery from clients) (nfish)
objecter_inflight_op_bytes to 10485760000 (See attached CERN paper)

Hypotheses

Sick MDS

cs-teaching/high load filesystems damage MDS affecting their ability to recall/process caps (and IO)
Concentrated load can drive caps (demand outstripping recall?) count on MDS leading to problems? cap velocity/acceleration. See daily caps load ticket above
Single MDS home directory: Assign mds.11 on cs-teaching to /u6/ldpaniak. Consistent, excellent performance even at the same time when other cs-teaching users on same client machine see very poor performance https://rt.uwaterloo.ca/Ticket/Display.html?id=1273593#txn-31548531]]

Insufficient parallelization of MDS OSD workload

https://rt.uwaterloo.ca/Ticket/Display.html?id=1273593#txn-31549431
Increase number of pgs in cs-teaching metadata storage pool

Networking latency mimics OSD drive failure

System to system pings around HS100 ring can show multi-millisec times (typical high-performance networking ping time are hundred microsec): https://rt.uwaterloo.ca/Ticket/Display.html?id=1270078#txn-31490589
As mentioned in CERN paper, a single bad OSD device can "caus[e] small IO requests to take longer than 2s on average"
Is intermittent high-latency networking impacting the cluster the same way as a (room full of) failing OSD device(s)?

Filesystem activity on one client (ubuntu-2004-012) can be correlated with OSD latency?

Compare intervals below with 99th percentile latency in https://icinga.cscf.uwaterloo.ca/grafana/d/sB2fRX3nz/infrastructure-overview?orgId=1&from=1677038897856&to=1677041591222

@ubuntu2004-012%rch)`xf': date && time tar xf gcc-9.1.0.tar
Tue 21 Feb 2023 11:16:20 PM EST

real	2m48.426s
user	0m0.718s
sys	0m7.031s
@ubuntu2004-012%rch)`rm': date && time rm -rf g^C-12.2.0
@ubuntu2004-012%)`9.1.0': date && time rm -rf gcc-9.1.0
Tue 21 Feb 2023 11:20:36 PM EST

real	2m28.312s
user	0m0.239s
sys	0m5.214s
@ubuntu2004-012% date && tdate && time tar xf gcc-12.2.0.tar^C
@ubuntu2004-012% date && time tar xf gcc-12.2.0.tar 
Tue 21 Feb 2023 11:37:14 PM EST

real	3m21.589s
user	0m1.218s
sys	0m9.050s
@ubuntu2004-012% date && t^C
@ubuntu2004-012% date && time rm -rf gcc-12.2.0
Tue 21 Feb 2023 11:43:25 PM EST

real	1m49.031s
user	0m0.230s
sys	0m6.318s

@ubuntu2004-002% date && tdate && time tar xf gcc-9.1.0.tar
Tue 21 Feb 2023 11:31:17 PM EST

real	0m35.288s
user	0m0.492s
sys	0m4.670s
@ubuntu2004-002%rch)`rm': date && time rm -rf gcc-9.1.0
Tue 21 Feb 2023 11:35:16 PM EST

real	0m53.688s
user	0m0.139s
sys	0m3.240s

Comments

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
pdf	s41781-021-00071-1.pdf	r1	manage	2075.6 K	2023-02-21 - 14:20	LoriPaniak

Topic revision: r5 - 2023-02-22 - OmNafees

Information in this area is meant for use by CSCF staff and is not official documentation, but anybody who is interested is welcome to use it if they find it useful.

Other Webs

My links
- People
- CERAS
- WatForm
- Tetherless lab
- Ubuntu Main.HowTo
- eDocs
- RGG NE notes
- RGG
- CS infrastructure
- Grad images

Edit