Linux Working Group
Meeting Date
Invited
Anthony (group leader), Lori, Dave, O, Clayton, Guoxiang, Nathan, Nick, Todd, Ed, Devon
Attendees
- Anthony (group leader), Clayton, Guoxiang, Nathan, Todd, Devon
Review and accept previous meeting minutes.
CsLWGMeeting20240110
Review last meeting's Action Items
CS Mailservers are going away (eventually)
- Mail servers persist due to CS Advisors. Will begin soft shutdown by blocking all hosts other than IST's mail appliance (which forwards the advising mail to mx.cs) and dropping all non-advising mail.
- os upgrades needed
- csadviso@cs.uwaterloo.ca special forwarding will continue working, consulting with IST and Brad Lushman (a2brenna) * Nick is looking to replace the forwarding script with Microsoft Power Automate (M365) * Demo what has been completed to advisors, IST has been notified, advisors need to provide feedback
Ongoing problems with Inventory and IPAM are hobbling Infrastructure operations
- CSCF Management has been put on notice
Ongoing problems with Ganesha service RT#1303795
- needs further enhancements to monitoring service?
- Devon and Anthony preparing doc for help desk
- Delayed due to lack of staff time
- More comprehensive monitoring of NFS performance is in the works (a2brenna, dmerner) ~ January
- Delayed due to lack of staff time
Monitoring Services
- Number of false alerts is a concern.
- Important alerts continue to be missed as a result of alert fatigue
- Container networking will not survive a reboot
- Delayed due to lack of staff time
- Possible issue with communication between inventory and webserver: RT#1257072 update to DNS fields cause hang
- Lack of Service Maintenance outside of standard working hours has been more of a problem lately.
- Management is aware and need to review this.
linux.cscf.uwaterloo.ca
- New linux.cscf.uwaterloo.ca running Ubuntu 22.04 is almost ready - soft roll out next week
- Testing has revealed some issues, rollout delayed until they're resolved.
- Duo policy discussion to be had with Dave
Incremental backups of block devices
- Possible solutions include rsync and borg but neither is ideal
- gxshen to investigate Legato NetWorker backups of block devices
Snapshots are still disabled
- a2brenna to enable snapshots on a file system to test performance - hoping to be done before start of next term
- Delayed due insufficient staffing
- There are new performance anomalies that make this less likely to happen until they are understood and/or fixed (a2brenna).
- Multiple OSDs in BlueFS spillover, see RT #1312466
- Delayed due insufficient staffing
- communication should be sent at the beginning of the term to inform users of the current status
- Cancelled due to a lack of any real news to convey
- In the meantime, more frequent backups of select directories (course accounts) are being arranged, see RT #1312411
SE undergrad users to be created under CS-TEACHING (RT?)
- Need management to clarify relationship between SE and CSCF (bring up at next staff meeting)
- Any custom tools / software needed?
New Business
New DFSc Hardware
- Potentially as much as 8 weeks away
- Investigating what can be cobbled together with existing hardware to handle anticipated load spikes
Disk Full on dc-3558-203 RT: 1312969
- Disks filled up around 16:00 on Friday Jan 19th.
- Space was freed around 16:28
- Cause as yet unknown
- Impact unknown
Power Outtage on 2024-01-28
Problems with nfs-files.student.cs.uwaterloo.ca
- Our non-Mac labs use this service to mount kerberized NFS user home directories from DFSc
- This service (using NFS-Ganesha) has a tendency to hang (i.e mounts disappear on clients, and the server refuses new connections)
- Possibly related to whenever DFSc is under load/maintenance
- NFS-Ganesha doesn't have much in the way of logging (unless you enable debug logs)
- Only current solution is to reboot every active server in order for clients to work again
- Server was hanging connections Tues. Jan 23rd. All of our non-Mac labs were unusable (initial report by Carmen Bruni in MC3003)
- m3-nfs-012 (on m3-3101-203) is the only active server at the moment
- The end user will try to log in and sees the following symptoms:
- The password box shakes as if their credentials are incorrect OR
- The machine gets past the credentials screen and hangs (and might eventually time out back to the login screen)
- We have had users believe the issue exists with their accounts or the lab computer, and not with the home directory service
Comments