Linux Working Group



Meeting Date

  • TEAMS: 2021-12-15

Invitees - Attendees

  • Adrian, Anthony (group leader), Clayton, Guoxiang, Lori, Fraser, Devon, Nathan, Nick, Todd, Dave, Lawrence, Omar

Review and accept previous meeting minutes.

Proposed Agenda Items

Netapp Retirement - Deadline January 2022

Migrate remaining data

  • In General
    • Any progress on moving Dan Berry's /opt/csw?
      • RSG discuss moving to jerusalem directly (lfolland)
        • Configure an NFS share via gateways: RT#1194157 (ctucker)
        • Guoxiang - tell us the size of the filesystem
        • Nathan - create an appropriate CephFS
        • Clayton - create the NFS share
    • New storage for apache web logs?
      • RT# 1196145 - Will use DFSc since web service already relies on DFSc (for homedirs). Configuration to be done and verified at end of term reboot. (a2brenna, ijmorlan, nfish)
        • Anthony feels we're in good shape. Will update the ticket with next steps to be done at term end
  • TEACHING - Progress reports on
    • Unmounting /oldhome immediately (gxshen). Appears to be done on all general use servers?
      • Guoxiang - is this gone now? Yes, done.
    • Perform final (end of year backup) of /oldhome data on the Netapp and remove data from Netapp in January (gxshen)
      • Guoxiang: did a full backup in May 2020
      • /opt/CSCF/packages is also being unmounted (collections of really old packaged software - no longer used).
        • Appears to be done on all general use servers. (a2brenna)

  • XHIER
    • fs-homedirs.student.cs.uwaterloo.ca:/regional/.software/regional
      • Seems to be mounted on only one linux.student.cs machine... Can we remove this? (a2brenna)
      • RT#1067477
    • fs-homedirs.cs.uwaterloo.ca:/regional_core
      • Seems to be mounted on only one linux.cs machine... Can we remove this? (a2brenna)
      • RT#1067480 * Guoxiang to check with Isaac and Adrian re: Odyssey and web servers

  • Are we still on track for moving mail in December?
    • lfolland sent notification for move CS mail to IST on 2021-11-24 - Done
      • Do not move mail homes to DFSC until after move to IST
      • lfolland to discuss with sdinney. Progress?
    • /var/mail (still on NetApp)
      • TEACHING is done
      • CS-GENERAL is done
      • Fraser notes that he is still getting some new mail on CS ( can be from CS to CS)
      • Adrian points out there are several reasons mail still comes directly
        • spammers, Let's Encrypt, other CS mail users, CS systems using mx.cs
        • servers to be changed to send mail directly to the destination - has that happened? Not yet. (a2brenna)
          • configuration change to postfix required, will need a recipe for other machines that may be setup to do that (arpepper, a2brenna)
      • need to change CS hosts to send mail directly, rather than through mx.cs
        • do we have a ticket for this?
        • Anthony and Adrian to find/create a ticket and discuss
        • Anthony has some configurations that may be easily adapted to work

Monitoring

  • We now have an Icinga GPU monitoring plugin for nvidia based cards - RT#1198875
    • Should we look into support for ATI cards?
      • does not appear that we do - possible future development
    • Do we have any Icinga agent hosts that we can use to collect some test data?
      • basilisk.cs
      • if we add it everywhere, then it should start showing up
      • Anthony will add it to standard installation process
    • Anthony has a script for adding Icinga client-side - down to one step
      • get ticket from Launch Wizard
  • Reminder - we have a lot of host problems in Icinga that need to be cleaned up to make it more useful
    • a lot of these are research, I know smile
  • host groups
    • Lawrence to create a ticket for Devon with a big list of Host Groups and contacts - needs to be done manually by Devon

CS Teaching - slowness of systems - how to address? (All)

  • Ceph:
    • Gateway systems (NFS/Samba) upgrade status? Schedule denylisting of old client. Was this scheduled? (ctucker, ldpaniak)
      • has there been any progress in turning off the old ganesha servers
      • concerns about security and performance
      • Lori - wants us to move to the latest supported software versions
      • Clayton - decommission old servers as soon as possible
    • Memory depletion on login servers. Reserve 10% memory for system/root/ceph use?
      • unclear whether this is the cause or problems or symptom of the problems

Other Issues

*-postgres-2004

  • How much disk space does it actually need? 500 + 256 GB. Current *-203.cloud.cs.uwaterloo.ca are adequate.
    • problem was that one was using 2TB

Avoid rebooting troubled systems (a2brenna)

  • eg: ubuntu2004-????.??? last night * makes it (nearly) impossible to diagnose after a reboot
  • If you can access the running OS at all, there are better options
  • Urgency is often an illusion
  • In the case of machines in the linux.student.cs round-robin, just take them out of the round robin
  • Lori - counter-concern - a single slow system may impact all the systems using the same filesystem
  • Lawrence - suggestion to have an agreed-upon communications channel to deal with emergency issues (eg: Teams Channel, eg: Emergencies)
    • Done
  • Anthony - has installed a method to analyze crash dumps, if crashed appropriately
    • To crash a machine in a useful fashion that generates a dump: 'echo c > /proc/sysrq-trigger'

Joining new linux hosts to AD

  • higher volume of container creation and rebuilds could benefit from more automation and more authorized users
    • Clayton has a tool that follows INF standards of nscd and kerberos
    • Anthony - interested in the part that generates a keytab file
    • Clayton: needs to be run on linux.cscf and handles the appropriate tickets
    • Lori: what about net join ads?

rebooting INF machines

  • last day of exams is Dec 23rd
  • either Dec 28th (Tues) or 29th (Wed)
  • Anthony and Guoxiang will reboot on the 29th
  • Lawrence to send out email to SCS everybody starting at time 1pm, expected end time 5pm?
    • all CS Teaching and CS General, plus other services

Last meeting Action items

  • Anthony/Adrian - work on new postfix recipe to have servers send mail out directly
  • Anthony/Guoxiang - initiate warranty - RT#1196085 - under warranty until 2025-03-30
  • Devon - collect power data to show Plant Operations
    • yes
    • still doing 5V deviance
    • someone to communicate with PlantOps?
  • Lori - Possibly recover 960GB Optane cards from 422 systems?
    • probably more use in a database server
  • Guoxiang/Lori - create or report# ticket for Storage option catalogue
  • Omar - create ticket(s) for VScode/git workflow
    • no RT, but Nick is discussing with faculty

--++ This meeting Action Items

  • Anthony/Adrian - work on new postfix recipe to have servers send mail out directly
  • Anthony/Guoxiang - initiate warranty - RT#1196085 - under warranty until 2025-03-30
  • Lawrence to create a ticket for Devon with a big list of Host Groups and contacts - needs to be done manually by Devon
  • RT#1194157 - /opt/csw
    • Guoxiang - tell us the size of the filesystem
    • Nathan - create an appropriate CephFS
    • Clayton - create the NFS share
  • Devon - create ticket to document power issues to show Plant Operations
  • Lori/Nathan - consider whether we can recover 960GB Optane cards from 422 systems
  • Guoxiang/Lori - create or report# ticket for Storage option catalogue (low priority)
  • Clayton - document process of adding hosts to AD and move to a generally accessible place - create a ticket
  • Lawrence to send out email to SCS everybody starting at time 1pm, expected end time 5pm
Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2021-12-16 - LawrenceFolland
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback