Linux Working Group
Meeting Date
Invitees - Attendees
- Adrian, Anthony (group leader), Clayton, Guoxiang, Lori, Fraser, Devon, Nathan, Nick, Todd, Dave, Lawrence, Omar
Review and accept previous meeting minutes.
Proposed Agenda Items
Netapp Retirement - Deadline January 2022
Migrate remaining data
- In General
- Any progress on moving Dan Berry's /opt/csw?
- RSG discuss moving to jerusalem directly (lfolland)
- Configure an NFS share via gateways: RT#1194157 (ctucker)
- Guoxiang - tell us the size of the filesystem
- Nathan - create an appropriate CephFS
- Clayton - create the NFS share
- New storage for apache web logs?
- RT# 1196145 - Will use DFSc since web service already relies on DFSc (for homedirs). Configuration to be done and verified at end of term reboot. (a2brenna, ijmorlan, nfish)
- Anthony feels we're in good shape. Will update the ticket with next steps to be done at term end
- TEACHING - Progress reports on
- Unmounting /oldhome immediately (gxshen). Appears to be done on all general use servers?
- Guoxiang - is this gone now? Yes, done.
- Perform final (end of year backup) of /oldhome data on the Netapp and remove data from Netapp in January (gxshen)
- Guoxiang: did a full backup in May 2020
- /opt/CSCF/packages is also being unmounted (collections of really old packaged software - no longer used).
- Appears to be done on all general use servers. (a2brenna)
- XHIER
- fs-homedirs.student.cs.uwaterloo.ca:/regional/.software/regional
- Seems to be mounted on only one linux.student.cs machine... Can we remove this? (a2brenna)
- RT#1067477
- fs-homedirs.cs.uwaterloo.ca:/regional_core
- Seems to be mounted on only one linux.cs machine... Can we remove this? (a2brenna)
- RT#1067480 * Guoxiang to check with Isaac and Adrian re: Odyssey and web servers
- Are we still on track for moving mail in December?
- lfolland sent notification for move CS mail to IST on 2021-11-24 - Done
- Do not move mail homes to DFSC until after move to IST
- lfolland to discuss with sdinney. Progress?
- /var/mail (still on NetApp)
- TEACHING is done
- CS-GENERAL is done
- Fraser notes that he is still getting some new mail on CS ( can be from CS to CS)
- Adrian points out there are several reasons mail still comes directly
- spammers, Let's Encrypt, other CS mail users, CS systems using mx.cs
- servers to be changed to send mail directly to the destination - has that happened? Not yet. (a2brenna)
- configuration change to postfix required, will need a recipe for other machines that may be setup to do that (arpepper, a2brenna)
- need to change CS hosts to send mail directly, rather than through mx.cs
- do we have a ticket for this?
- Anthony and Adrian to find/create a ticket and discuss
- Anthony has some configurations that may be easily adapted to work
Monitoring
- We now have an Icinga GPU monitoring plugin for nvidia based cards - RT#1198875
- Should we look into support for ATI cards?
- does not appear that we do - possible future development
- Do we have any Icinga agent hosts that we can use to collect some test data?
- basilisk.cs
- if we add it everywhere, then it should start showing up
- Anthony will add it to standard installation process
- Anthony has a script for adding Icinga client-side - down to one step
- get ticket from Launch Wizard
- Reminder - we have a lot of host problems in Icinga that need to be cleaned up to make it more useful
- a lot of these are research, I know
- host groups
- Lawrence to create a ticket for Devon with a big list of Host Groups and contacts - needs to be done manually by Devon
CS Teaching - slowness of systems - how to address? (All)
- Ceph:
- Gateway systems (NFS/Samba) upgrade status? Schedule denylisting of old client. Was this scheduled? (ctucker, ldpaniak)
- has there been any progress in turning off the old ganesha servers
- concerns about security and performance
- Lori - wants us to move to the latest supported software versions
- Clayton - decommission old servers as soon as possible
- Memory depletion on login servers. Reserve 10% memory for system/root/ceph use?
- unclear whether this is the cause or problems or symptom of the problems
Other Issues
*-postgres-2004
- How much disk space does it actually need? 500 + 256 GB. Current *-203.cloud.cs.uwaterloo.ca are adequate.
- problem was that one was using 2TB
Avoid rebooting troubled systems (a2brenna)
- eg: ubuntu2004-????.??? last night * makes it (nearly) impossible to diagnose after a reboot
- If you can access the running OS at all, there are better options
- Urgency is often an illusion
- In the case of machines in the linux.student.cs round-robin, just take them out of the round robin
- Lori - counter-concern - a single slow system may impact all the systems using the same filesystem
- Lawrence - suggestion to have an agreed-upon communications channel to deal with emergency issues (eg: Teams Channel, eg: Emergencies)
- Anthony - has installed a method to analyze crash dumps, if crashed appropriately
- To crash a machine in a useful fashion that generates a dump: 'echo c > /proc/sysrq-trigger'
Joining new linux hosts to AD
- higher volume of container creation and rebuilds could benefit from more automation and more authorized users
- Clayton has a tool that follows INF standards of nscd and kerberos
- Anthony - interested in the part that generates a keytab file
- Clayton: needs to be run on linux.cscf and handles the appropriate tickets
- Lori: what about net join ads?
rebooting INF machines
- last day of exams is Dec 23rd
- either Dec 28th (Tues) or 29th (Wed)
- Anthony and Guoxiang will reboot on the 29th
- Lawrence to send out email to SCS everybody starting at time 1pm, expected end time 5pm?
- all CS Teaching and CS General, plus other services
Last meeting Action items
- Anthony/Adrian - work on new postfix recipe to have servers send mail out directly
- Anthony/Guoxiang - initiate warranty - RT#1196085 - under warranty until 2025-03-30
- Devon - collect power data to show Plant Operations
- yes
- still doing 5V deviance
- someone to communicate with PlantOps?
- Lori - Possibly recover 960GB Optane cards from 422 systems?
- probably more use in a database server
- Guoxiang/Lori - create or report# ticket for Storage option catalogue
- Omar - create ticket(s) for VScode/git workflow
- no RT, but Nick is discussing with faculty
--++ This meeting Action Items
- Anthony/Adrian - work on new postfix recipe to have servers send mail out directly
- Anthony/Guoxiang - initiate warranty - RT#1196085 - under warranty until 2025-03-30
- Lawrence to create a ticket for Devon with a big list of Host Groups and contacts - needs to be done manually by Devon
- RT#1194157 - /opt/csw
- Guoxiang - tell us the size of the filesystem
- Nathan - create an appropriate CephFS
- Clayton - create the NFS share
- Devon - create ticket to document power issues to show Plant Operations
- Lori/Nathan - consider whether we can recover 960GB Optane cards from 422 systems
- Guoxiang/Lori - create or report# ticket for Storage option catalogue (low priority)
- Clayton - document process of adding hosts to AD and move to a generally accessible place - create a ticket
- Lawrence to send out email to SCS everybody starting at time 1pm, expected end time 5pm