Meeting Date

  • TEAMS: 2024-02-07

Invited

Anthony (group leader), Lori, Dave, O, Clayton, Guoxiang, Nathan, Nick, Todd, Ed, Devon

Attendees

  • Anthony (group leader), Dave, Clayton, Guoxiang, Nick, Todd, Devon, Fraser

Review and accept previous meeting minutes.

CsLWGMeeting20240124

Review last meeting's Action Items

Ongoing problems with Inventory and IPAM are hobbling Infrastructure operations

Ongoing problems with Ganesha service RT#1303795

  • Our non-Mac labs use this service to mount kerberized NFS user home directories from DFSc
  • This service (using NFS-Ganesha) has a tendency to hang (i.e mounts disappear on clients, and the server refuses new connections)
    • Possibly related to whenever DFSc is under load/maintenance
      • New evidence suggests this causality is backwards and that ganesha distress causes blocked operations on CephFS clients (a2brenna)
    • NFS-Ganesha doesn't have much in the way of logging (unless you enable debug logs)
  • Only current solution is to reboot every active server in order for clients to work again
  • Server was hanging connections Tues. Jan 23rd. All of our non-Mac labs were unusable (initial report by Carmen Bruni in MC3003)
  • m3-nfs-012 (on m3-3101-203) is the only active server at the moment
  • The end user will try to log in and sees the following symptoms:
    • The password box shakes as if their credentials are incorrect OR
    • The machine gets past the credentials screen and hangs (and might eventually time out back to the login screen)
    • We have had users believe the issue exists with their accounts or the lab computer, and not with the home directory service
  • needs further enhancements to monitoring service?
    • Devon and Anthony preparing doc for help desk
      • Delayed due to lack of staff time
      • Put on hold. Restarting the server destroys all information useful in actually fixing the problem.
    • More comprehensive monitoring of NFS performance is in the works
      • Latest stable nfs-ganesha has hooks for enhanced monitoring
      • Work has begun on modernizing the deployment to take advantage (a2brenna, dmerner)
    • Unexplained stuck transactions on TEACHING filesystems on linux.student.cs correspond with nfs-ganesha being in a malfunctioning state and are unblocked when ganesha service is restarted
      • work to identify exact cause ongoing

Monitoring Services

  • Number of false alerts is a concern.
    • Important alerts continue to be missed as a result of alert fatigue
  • Container networking will not survive a reboot
    • Delayed due to lack of staff time
  • Possible issue with communication between inventory and webserver: RT#1257072 update to DNS fields cause hang
  • Lack of Service Maintenance outside of standard working hours has been more of a problem lately.
    • Management is aware and need to review this.
      • Will bring this up at upcoming staff meeting (a2brenna)

linux.cscf.uwaterloo.ca

  • New linux.cscf.uwaterloo.ca running Ubuntu 22.04 is almost ready - soft roll out next week
    • Testing has revealed some issues, rollout delayed until they're resolved.
    • Duo policy discussion to be had with Dave
    • More delays due to security concerns

New DFSc Hardware

  • Potentially arriving end of Feb
  • Plan is to deploy ASAP
  • Investigating what can be cobbled together with existing hardware to handle anticipated load spikes in the next month (CS 136)
  • Prioritizing this will cause delays in other work

Disk Full on dc-3558-203 RT: 1312969

  • Disks filled up around 16:00 on Friday Jan 19th.
  • Space was freed around 16:28
  • Cause was misconfigured postgres database on dc-3558-odyssey-postgres-2004
    • Longstanding issue, see RT: 1266074
    • Has been effectively mitigated by forcibly migrating dc-3558-odyssey-postgres-2004 backing store to a bounded LVM volume
  • Impact unknown

Power Outtage on 2024-01-28

New Business

2fa policy for linux.cscf

Comments

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2024-02-07 - ToddLichty
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback