DFSc Working Group
Meeting Date
Invitees - Attendees
- Anthony, Dave, Gouxiang, Lori, Nathan, Nick, Lawrence, Omar
Review and accept previous meeting minutes.
Proposed Agenda Items
Old business
New business
Make cs-teaching acceptable for use
Current performance and availability issues need to be resolved
What are the root causes?
- load?
- not a lot of "load" yet
- type of work? VScode, Seashell
- is it even CS136 work or something else?
- CS246, CS241, and others may be using VScode, but it is not mandated by anyone
- VScode might be relevant - it is one big filesystem
- CS136 is now running on its own MDS (mds10)
- ssh into cs241 was slow
- issues on u0,u1,u2
- Graph of spike on Sunday
- any way to track down a particular user causing problem?
- some users doing rsync, or deleting folders - nothing unusual, really
- options to turn on to aid with tracking?
- how to correlate activity of a client with the activity on the filesystem?
- iotop? run on all servers and keep an eye on it?
- problem is with metadata, not necessarily the filesystem itself
- a lot of moving parts and no single cause
- bimodal distribution of performance, leads to belief that there is an architectural issue with Ceph
- Anthony has been doing a number of statistical analyses
- Anthony suggests setting up a general use host not on the round-robin for data collection and analysis * ubuntu2004-012?
- what happened on Sunday?
- Lori unpinned u6 and u8 and someone did "something" that caused those to halt and drove up CPU load
- Lori restarted those and pinned them, that solved the problem
- mds.9 associated with u0/1/2 - restarted several times
- are all servers and clients updated?
- all general-use hosts have been updated and rebooted
- not sure about NFS/Samba hosts
Options to consider:
- general alternatives:
- fixing/tuning Ceph
- creating a separate set of home directories
- should probably do both in parallel
- transition - only need to worry about people who are actively using the system right now (CS136, etc) * unique number of students in CS136, 246.,350 - 1,300 students * would be a stop-gap to (hopefully) relieve the current issues * currently home directory space is sitting at 37TB * 2/3 files space used by a small number of people * quotas would help with providing for other viable solutions
- could be a solution using Ceph, not necessarily CephFS, eg: Rados
-
- VScode issues
- not necessarily connecting to the same server in the round-robin
- don't use VScode on linux.student.cs and instead use ubuntu2004-002.student.cs
- Improvements to current configuration
- nowsync option: Improved performance link/unlink fileops eg. rsync, rm * remount current mounts with "nowsync" option * Lori did some tinkering, maybe 10% improvement, but no "magic bullet" * might be good to try on a test machine, eg: ubuntu2004-012 then add back into round-robin
- additional kernel upgrades 5.13
- Addition/allocation of local storage devices on student login servers
- For CS136 and other target users only
- ingress load balancing * same incoming IP address goes to same server
- Move target users back to NetApp
- network connectivity old/new?
- previous Netapp, old home directory volume 13TB
- Investigate RBD/ZFS/NFS replacement of cephfs layer but preserve ceph storage cluster
- If useful, could reprovision ceph MDS servers for ZFS/NFS gateways
- have Lori and Nathan spin up a 30-40 TB Rados device and into a formatted filesystem that can be exported
- monitor performance issues
- Remove HDD from ceph cluster for provisioning of standalone NFS server
- New hardware for standalone NFS server
How to transition
- suggestion to move some (not all) home directories and compare performance
- in particular, propose /u8 (where Anthony's home directory)
- plan is to move some, not all, home directories
- challenge is that we will have a single point-of-failure in the form of the NFS server
Course-specific action items
CS136
CS246
- Compile-heavy, use of svn? Benefit from local scratch storage?
CS350
- docker-centric. What is requirement for shared filesystem?
Action plans
Immediate
- communicate with ISG/students to use specific linux.student.cs hosts rather than the round-robin name - Nick
- create a Rados block device - Lori and Nathan * increase to 40TB
- create NFS bridge - Anthony and Guoxiang
- talk to MFCF about continued use of NetApp - Lawrence/Dave
- consider Ceph tweaks to current environment - Lori/Nathan
- review Ceph data - Lori/Anthony
- kernel updates?
- communicate with faculty - Omar/Lori - forward message to this group
near future
- move one or more home directories to Rados NFS device
- move one or more home directories to NetApp