DFSc Working Group
Meeting Date
Invitees - Attendees
- Anthony, Dave, Fraser, Gouxiang, Lori, Nathan, Nick
Review and accept previous meeting minutes.
Proposed Agenda Items
Old business
global_id reclaim
https://docs.ceph.com/en/latest/security/CVE-2021-20288/
Mitigating this CVE requires upgrading userspace Ceph clients. Samba has been upgraded, only nfs-ganesha servers still need upgrading.
This is the cause of these warnings:
clients are using insecure global_id reclaim
mons are allowing insecure global_id reclaim
Ganesha PPA: version 3.5+
saltstack formula for ganesha (dlgawley)
If
NFS servers are updated, go with toggle to disallow insecure reclaim.
Update on bug report/patch for Ubuntu 5.11 kernel on ceph append bug
Apparently in 5.11.40
https://tracker.ceph.com/issues/51948
https://rt.uwaterloo.ca/Ticket/Display.html?id=1194773
Roll to 100,000 (a2brenna) EOD 2021-11-23.
fhgunn to test.
If OK, roll to prod thereafter.
TImeline to get client systems back to 5.11 kernel
https://icinga.cscf.uwaterloo.ca/grafana/d/03EnhXZGz/dfsc-monitoring?var-hostname=mc-3015-422.cloud.cs.uwaterloo.ca&orgId=1&from=1623347299875&to=1634149134520
See above...
Start with 004 which is down for BIOS update.
New business
ubuntu2004-012
This is a VM? Why so many old ceph processes? Network issue?
a2brenna: Highest uptime, most threads
kworkers do go away eventually...
Can we track the kworkers back to user activity?
lfolland: reboot systems regularly?
a2brenna: Uptime expectations, not addressing the real issues. Does the machine really have a problem?
ldpaniak: Take 012 out of pool? dlgawley has removed from round-robin.
Needs further research...
-c hostname; ps uax |grep Nov |grep msgr
ldpaniak@charon:~/Temp$ ./check-ceph-student.sh
ubuntu2004-002
root 8097 0.0 0.0 0 0 ? I< Nov17 0:00 [ceph-msgr]
ldpaniak 143969 0.0 0.0 9492 3376 ? Ss 22:33 0:00 bash -c hostname; ps uax |grep Nov |grep msgr
ubuntu2004-004
root 8175 0.0 0.0 0 0 ? I< Nov07 0:00 [ceph-msgr]
ldpaniak 131204 0.0 0.0 9492 3280 ? Ss 22:33 0:00 bash -c hostname; ps uax |grep Nov |grep msgr
ubuntu2004-008
root 5641 0.0 0.0 0 0 ? I< Nov16 0:00 [ceph-msgr]
ldpaniak 47748 0.0 0.0 9492 3324 ? Ss 22:33 0:00 bash -c hostname; ps uax |grep Nov |grep msgr
ubuntu2004-010
root 4669 0.0 0.0 0 0 ? I< Nov13 0:00 [ceph-msgr]
ldpaniak 90585 0.0 0.0 9492 3316 ? Ss 22:33 0:00 bash -c hostname; ps uax |grep Nov |grep msgr
ubuntu2004-012
root 5068 0.0 0.0 0 0 ? I< Nov01 0:00 [ceph-msgr]
root 6457 0.0 0.0 0 0 ? I Nov20 0:00 [kworker/90:2-ceph-msgr]
root 18180 0.0 0.0 0 0 ? I Nov20 0:00 [kworker/12:0-ceph-msgr]
root 85071 0.0 0.0 0 0 ? I Nov20 0:19 [kworker/91:0-ceph-msgr]
root 107070 0.0 0.0 0 0 ? I Nov20 0:00 [kworker/26:2-ceph-msgr]
root 122108 0.0 0.0 0 0 ? I Nov19 0:00 [kworker/64:0-ceph-msgr]
root 148385 0.0 0.0 0 0 ? I Nov20 0:13 [kworker/16:0-ceph-msgr]
root 181730 0.0 0.0 0 0 ? I Nov20 0:33 [kworker/15:1-ceph-msgr]
ldpaniak 205256 0.0 0.0 9492 3184 ? Ss 22:33 0:00 bash -c hostname; ps uax |grep Nov |grep msgr
root 208581 0.0 0.0 0 0 ? I Nov20 0:00 [kworker/65:0-ceph-msgr]
root 228965 0.0 0.0 0 0 ? I Nov20 0:51 [kworker/108:0-ceph-msgr]
ubuntu2004-014
root 4500 0.0 0.0 0 0 ? I< Nov10 0:00 [ceph-msgr]
ldpaniak 62494 0.0 0.0 9492 3220 ? Ss 22:33 0:00 bash -c hostname; ps uax |grep Nov |grep msgr
ubuntu2004-016
root 3962 0.0 0.0 0 0 ? I< Nov10 0:00 [ceph-msgr]
ldpaniak 98678 0.0 0.0 9492 3452 ? Ss 22:33 0:00 bash -c hostname; ps uax |grep Nov |grep msgr
ldpaniak@charon:~/Temp$
Client churn
Why does number of clients on cs-teaching change all the time?
ctucker working on ganesha servers?
nfish checking on mon reporting for client (dis)connects.
[
https://icinga.cscf.uwaterloo.ca/grafana/d/03EnhXZGz/dfsc-monitoring?var-hostname=mc-3015-422.cloud.cs.uwaterloo.ca&orgId=1&from=1636348065232&to=1636658833526]
Time keeping on client systems
Please use CSCF NTP servers
a2brenna will investigate
ldpaniak@charon:~/Temp$ ./check-time-student.sh
ubuntu2004-002
synchronised to NTP server (129.97.167.4) at stratum 2
time correct to within 30 ms
polling server every 1024 s
ubuntu2004-004
synchronised to NTP server (129.97.167.12) at stratum 2
time correct to within 43 ms
polling server every 1024 s
ubuntu2004-008
synchronised to NTP server (129.97.167.12) at stratum 2
time correct to within 47 ms
polling server every 1024 s
ubuntu2004-010
synchronised to NTP server (254.173.0.178) at stratum 2
time correct to within 27 ms
polling server every 1024 s
ubuntu2004-012
synchronised to NTP server (254.173.0.178) at stratum 2
time correct to within 32 ms
polling server every 1024 s
ubuntu2004-014
synchronised to NTP server (24.174.107.122) at stratum 2
time correct to within 26 ms
polling server every 1024 s
ubuntu2004-016
synchronised to NTP server (254.173.0.178) at stratum 2
time correct to within 24 ms
polling server every 1024 s
Ceph updates
New point release out. Wait on update for now - end of term. ldpaniak to set regular schedule.
Ceph upgrade to Pacific. Best to move to containers. Only one MDS per filesystem per host.
Future configurations to evaluate
UofT export of
NFS/ZFS on cluster block. Here, use RBD for backend.
Progressive testing: parts of a filesystem at a time.
Demo
NFS server with upcoming hardware on RBD.