Meeting Date
Invited
Anthony (group leader), Lori, Dave, O, Clayton, Guoxiang, Nathan, Nick, Todd, Ed, Devon, Gwen
Attendees
- Anthony, Dave, Clayton, Guoxiang, Nathan, Nick, Ed, Devon, Gwen
Review and accept previous meeting minutes.
CsLWGMeeting20240207
Review last meeting's Action Items
Ongoing problems with Inventory and IPAM are hobbling Infrastructure operations
- Will bring up Management's failure to honour staffing commitments of existing critical services at CSCF Staff meeting
Ongoing problems with Ganesha service RT#1303795
- Production instance of Ganesha now running version 5.6+ stable
- Standby instance now also upgraded
- Numerous independent bugs, including possible cause of previously noted tendency to hang
- New monitoring dashboard
- Debug symbols
- This service (using NFS-Ganesha) has a tendency to hang (i.e mounts disappear on clients, and the server refuses new connections)
- New evidence suggests this causality is backwards and that ganesha distress causes blocked operations on CephFS clients (a2brenna)
- Further evidence from new Ganesha instance
- Only current solution is to reboot every active server in order for clients to work again
- Yet another example of important work that went undone until it became urgent... Devon and Anthony up till 2am. Will bring up at CSCF Staff meeting.
- Currently switching servers requires a reboot or remount
- Will switch to m3 tonight
Monitoring Services
- Number of false alerts is a concern.
- Important alerts continue to be missed as a result of alert fatigue
- Container networking will not survive a reboot
- Delayed due to lack of staff time
- Possible issue with communication between inventory and webserver: RT#1257072 update to DNS fields cause hang
- Lack of Service Maintenance outside of standard working hours has been more of a problem lately.
- Management is aware and need to review this.
- Will bring this up at upcoming staff meeting (a2brenna)
linux.cscf.uwaterloo.ca
- New linux.cscf.uwaterloo.ca running Ubuntu 22.04 is almost ready - soft roll out next week
- Testing has revealed some issues, rollout delayed until they're resolved.
- Duo policy discussion to be had with Dave
- No Duo at this time, all CSCF staff have yubikeys
- More delays due to security concerns
New DFSc Hardware
- No word yet on hardware
- Investigating what can be cobbled together with existing hardware to handle anticipated load spikes in the next month (CS 136)
- Prioritizing this will cause delays in other work
Comments