Present from 5.8 onward, possibly on older kernels via backport
Fraser: prudent to rebuild all login servers from scratch. O(hours) to rebuild all student login machines
ctucker: reasonable to replace keytab files, invalidate existing keytabs
Rebuild webservers and other container servers
Appetite for risk on rebuilding servers? Question for management - RT#??? * outstanding question - no response from management * however, technical consensus is that we are not prepared to do that now * could do on a round-robin basis (one at a time) * Lori: could "cloud init" be used to start with a standard image? * Anthony: probably not - would likely be more work to maintain than it would save * would eventually like to get to a stateless setup
Lori - setuid binaries in user filesystems? Mount a survey (per term?). Set ACL on snapshots? https://rt.uwaterloo.ca/Ticket/Display.html?id=1213685 * in upcoming Ceph version there is a "root squash" feature - add to to-do list
Run vulnerability check on questionable systems (e.g. 5.4 kernels)
Fraser to create a ticket for updating the Graphics Lab machines
New proposed agenda items (include name and desired time)
linux.student.cs loadaverage is much higher and more variable than in the past. (Fraser, 10 minutes)
concern is load average is 10x what it used to be
afternoons and evenings much high
ratio of load average to number of cores
processes in the "runable" state, but may be waiting for disk
used to be that more processes were waiting for Ceph, but not so much anymore
right now on cs-general general use hosts, a single user slamming a machine
Peter van Beek; correctly niced +10; Fraser talked to Peter and moved the (CPU cores and RAM intensive) research work to ugster73*
teaching hosts - load from actual compute jobs / VS Code
Omar - how closely does load average correlate with user response
Anthony: if load stays below number of cores, should not impact response
Fraser - would see it jump up to 200 then back to 60 - very jumpy
Anthony - does Icinga have process usage monitoring?
Devon - used to, with smtp daemon
Anthony - will create a ticket for monitoring processor data - will work with Devon
Prometheus is installed on most systems
exporters that provides web data that can be scraped
easy to use and setup
query language is easy to use
any way to get a breakdown by core? top, htop, cat /proc/loadavg (latter is low impact)
Fraser: do we have load average graphs? Devon - yes:
Information in this area is meant for use by CSCF staff and is not official documentation, but anybody who is interested is welcome to use it if they find it useful.