Linux Working Group



Meeting Date

  • TEAMS: 2021-03-23

Invitees - Attendees

  • Dave, Anthony, Adrian, Clayton, Devon, Fraser, Guoxiang, Lori, Nathan

Review and accept previous meeting minutes.

  • Reviewed and approved.

Agenda Items

Monitoring notifications

  • When getting email notices about system failures, check the "from" address, if it's not icinga.cscf.uwaterloo.ca, then it's not the CSCF central monitoring service and you'll need to talk to the "Admin Contact" or Service expert of the host the email came from.

Private Cloud Nodes disks (partitions) layout

Drive Assignments

  • / (root distribution only drive with empty /var) - minimum 1TB
  • Space for virtual hosts get their own drive
    • /srv/virtual_services is the disk mount - minimum ~2TB, current standard is greater than 3.5TB
    • /srv/virtual_services/{lxc,libvirt,...} get bind mounted to /var/lib/{lxc,libvirt,...}
  • should /var/log be it's own volume?
    • mixed opinions
  • ZFS - not for base OS setups, use where necessary (where very large volumes, check summing, incremental snapshots,... are needed).

Documenting the amount of minimum "free" space required

These hosts are in a critical state and disk space needs to be dealt with ASAP

  • MC-3015-201.cloud.cs, mysql servers host nodes, MC-3015-211.cloud.cs

Monitoring of disk space

  • It is up to the hardware/service "Admin Contact" to say what is to monitored and how, then engage the Monitoring (Devon) and SaltStack (Anthony,Nathan) via RT to see that it gets implemented.

Friday, Mar 19 MC 3015 equipment failures.

  • Tried replacing the batteries in rack beside DataSci system, on power on, the UPS "inverter?" smoked. This also fried the sPDU and one of six equipment power supplies. Details to be documented later this week.

Saturday's Mar 20 power outage post mortem:

  • Preliminary Shutdown schedule:
    • Friday morning: reduce DNS round-robin of linux.student.cs
  • Saturday @6:00:
    • Shutdown OpenEDx - Ask Todd to inform clients
    • Ugsters (Computer Security and Privacy course) - Fraser to inform client and take care of shutdown
    • Filemaker database server (Grad office and Barb D's courses) - Devon to inform Barb and Grad office.
    • mx.student.cs (ie course email services) so this possible 10 hour (ie > 8 hours) could result in timeout, retry message for incoming mail.
    • MC linux.student.cs, mc-3015-2*.cloud.cs nodes - Guoxiang to shutdown.
    • MC linux.student.cs, mc-3015-4*.cloud.cs nodes (DFSc and SQL) - Guoxiang to shutdown
    • MC VMware Services - Clayton shutdown.
    • Test being able to configure Icinga to deal with this room outage?
  • Saturday @18:00
    • Dave to be on site to power on systems.
    • Mac server - hit power on button (check sticky button)
  • Saturday @20.00
    • Add MC servers back to DNS round-robin of linux.student.cs

Network Redundancy isn't anymore,

  • What can/should CSCF do about this?
    • An [https://rt.uwaterloo.ca/Ticket/Display.html?id=1144540][RT #1144540]] about physics server room redundancy (temporarily being disabled, 2 years ago).
    • Is CSCF monitoring of IST provided services requried, ie establishing quality-of-service indicators for services IST is providing CS?
Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2021-04-19 - FraserGunn
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback