Linux Working Group

Linux Working Group
Meeting Date

Meeting Date

TEAMS: 2022-12-14

Invited

Anthony (group leader), Clayton, Guoxiang, Lori, Fraser, Devon, Nathan, Nick, Todd, Dave, O

Attendees

Anthony, Nathan, Guoxiang, Devon, Fraser, Nick, O, Dave

Review and accept previous meeting minutes.

CsLWGMeeting20221130

Review last meeting's Action Items

suexec-flex https://rt.uwaterloo.ca/Ticket/Display.html?id=950947
- Nathan to discuss with Lori and Issac to determine who should be rebuilding the package and redistributing the deb patch
  - Nathan gathering info from package authors and following up.
- Need documentation for this project and some example test cases to motivate new developers.
- Is there a .deb package at this point that is ready for distribution?
  - nfish is working on it
Separation of TEACHING and GENERAL VLANs: this is a CS legacy separation, do we believe IST's firewall config is sufficient to maintain this?

assuming any separation at layer 2. Is this OK ?

- O needs to gather more info. Will start with managers. iang, pabuhr are faculty that may be good starts for a faculty sample for this.
- Update: O after chatting with managers suggests to continue without assuming any separation at layer 2.
IST Network Group Collaboration (Anthony)
- ACLs placed by IST in the past prove to be background artifacts that contribute to overhead costs for current work. This overhead work manifests as negotiation with IST for removal or modification of these rules.
- Similarly, rules in the campus firewall also need review.
SALT bug reported to Canonical and is being tracked (Anthony)
- Local patch will be available on depot.cs soon, this will be consolidated with main branch eventually/hopefully.
From this week's cscf management meeting: (O)

   -- The Teaching Environment: Architecture and Performance
            => Performance of the teaching environment is multi-faceted and their are issues:
                -- file system on student.cs hosts gets hammered by many different approaches
                -- separation of "resources" and relevant assignments to users may improve
                this. How can we achieve this ?
                -- more analysis on client side is needed of what leads to
                ceph processes lingering and causing high load.
                (e.g. https://rt.uwaterloo.ca/Ticket/Display.html?id=1165024)
                (ICINGA: https://icinga.cscf.uwaterloo.ca/grafana/d/03EnhXZGz/dfsc-monitoring?orgId=1&from=1661961364321&to=1670341277030&viewPanel=19)
                (ICINGA: https://icinga.cscf.uwaterloo.ca/grafana/d/03EnhXZGz/dfsc-monitoring?orgId=1&from=1624797845613&to=1670341340759&viewPanel=19)
                => O will push this out to Linux and TOP staff for filtering and report
            => We need to think strategically about "The Teaching Environment" and see what
            evolution may be needed *everywhere* to alleviate these problems. (e.g. Course Accounts
            vs Cloud based back ends and front-ends like OpenEdX)
                        => Crashes that need analysis: https://rt.uwaterloo.ca/Ticket/Display.html?id=1214857
                        => Load appears to follow users
                           https://icinga.cscf.uwaterloo.ca/grafana/d/03EnhXZGz/dfsc-monitoring?orgId=1&from=1661961364000&to=1670907600000&viewPanel=19
                        => Look into adding monitoring group off of Unix/Linux group

- linux.student.cs machines already log significant information mapping execution to ceph processes.
- by the numbers most requests that go to MDSs from student.cs correspond to lock/unlock files. These are quick.
- The server side appears to take a long time to respond to certain requests like "mkdir/rmdir".
- Are the cluster drives a bottleneck ? Can they be changed to solid state ?
- Are data structures in the ceph servers "over-prioritizing" certain requests over others ? Are real-time queues running lengthy ?
- Are cluster databases operating as needed ?
- Time for further analysis required to discern causes accurately. It appears to be a very complex basket of problems.
- Ideal equipment: NVMEs for metadata (already present) along with SSDs for data. Costing and Proposal needed.
- There are architectural decisions that need to be made with respect to quotas if we move to SSDs.
- O notes that Teaching client needs prioritize speed over storage.
- Implementing quotas has advantages that will reduce the metadata generally.
- quota administration, from experience, is expensive in some ways, though can be simplified if managed properly. Relevant tickets:
  - https://rt.uwaterloo.ca/Ticket/Display.html?id=1131076
  - https://rt.uwaterloo.ca/Ticket/Display.html?id=1112506
Monitoring is a topic that is important, should it be a topic here or in a unique "CS Monitoring Working Group".
- It appears that monitoring is important enough to be discussed here, and if a special meeting is required, it can be instantiated.
- Devon is currently point-person for monitoring configurations.
- There appears to be a lack of good "monitoring" etiquette by users. Maybe a special kick off meeting in January ? Devon would lead the meeting.

Comments

Topic revision: r4 - 2022-12-14 - AnthonyBrennan

Information in this area is meant for use by CSCF staff and is not official documentation, but anybody who is interested is welcome to use it if they find it useful.

Other Webs

My links
- People
- CERAS
- WatForm
- Tetherless lab
- Ubuntu Main.HowTo
- eDocs
- RGG NE notes
- RGG
- CS infrastructure
- Grad images

Edit