DFSc Working Group 
 Meeting Date 
  
 Invitees - Attendees 
 
-  Anthony, Dave, Gouxiang, Lori, Nathan, Nick, Lawrence, Omar
 Review and accept previous meeting minutes. 
  
 Proposed Agenda Items 
 Old business 
 New business 
 Make cs-teaching acceptable for use 
Current performance and availability issues need to be resolved 
 What are the root causes? 
 
-  load?
-  not a lot of "load" yet
-  type of work?  VScode, Seashell
-  is it even CS136 work or something else?
-  CS246, CS241, and others may be using VScode, but it is not mandated by anyone
-  VScode might be relevant - it is one big filesystem
-  CS136 is now running on its own MDS (mds10)
-  ssh into cs241 was slow
-  issues on u0,u1,u2
-  Graph of spike on Sunday  
-  any way to track down a particular user causing problem?
-  some users doing rsync, or deleting folders - nothing unusual, really
-  options to turn on to aid with tracking?
-  how to correlate activity of a client with the activity on the filesystem?
-  iotop? run on all servers and keep an eye on it?
-  problem is with metadata, not necessarily the filesystem itself
-  a lot of moving parts and no single cause
-  bimodal distribution of performance, leads to belief that there is an architectural issue with Ceph
-  Anthony has been doing a number of statistical analyses
-  Anthony suggests setting up a general use host not on the round-robin for data collection and analysis        * ubuntu2004-012?
-  what happened on Sunday? 
-  Lori unpinned u6 and u8 and someone did "something" that caused those to halt and drove up CPU load
-  Lori restarted those and pinned them, that solved the problem
 
-  mds.9 associated with u0/1/2 - restarted several times
-  are all servers and clients updated? 
-  all general-use hosts have been updated and rebooted
-  not sure about NFS/Samba hosts
 
 Options to consider: 
 
-  general alternatives: 
-  fixing/tuning Ceph
-  creating a separate set of home directories
 
-  should probably do both in parallel
-  transition - only need to worry about people who are actively using the system right now (CS136, etc)          * unique number of students in CS136, 246.,350 - 1,300 students          * would be a stop-gap to (hopefully) relieve the current issues          * currently home directory space is sitting at 37TB             * 2/3 files space used by a small number of people          * quotas would help with providing for other viable solutions
-  could be a solution using Ceph, not necessarily CephFS, eg: Rados
 
 - 
 
-  VScode issues 
-  not necessarily connecting to the same server in the round-robin 
-  don't use VScode on linux.student.cs and instead use ubuntu2004-002.student.cs
 
 
 
-  Improvements to current configuration 
-  nowsync option: Improved performance link/unlink fileops eg. rsync, rm           * remount current mounts with "nowsync" option          * Lori did some tinkering, maybe 10% improvement, but no "magic bullet"          * might be good to try on a test machine, eg: ubuntu2004-012 then add back into round-robin
-  additional kernel upgrades 5.13
 
-  Addition/allocation of local storage devices on student login servers 
-  For CS136 and other target users only
 
-  ingress load balancing       * same incoming IP address goes to same server
-  Move target users back to NetApp 
-  network connectivity old/new?
-  previous Netapp, old home directory volume 13TB
 
-  Investigate RBD/ZFS/NFS replacement of cephfs layer but preserve ceph storage cluster 
-  If useful, could reprovision ceph MDS servers for ZFS/NFS gateways
-  have Lori and Nathan spin up a 30-40 TB Rados device and into a formatted filesystem that can be exported 
-  monitor performance issues
 
 
-  Remove HDD from ceph cluster for provisioning of standalone NFS server
-  New hardware for standalone NFS server
 How to transition 
 
-  suggestion to move some (not all) home directories and compare performance
-  in particular, propose /u8 (where Anthony's home directory)
-  plan is to move some, not all, home directories
-  challenge is that we will have a single point-of-failure in the form of the NFS server
 Course-specific action items 
 CS136 
  
 CS246 
 
-  Compile-heavy, use of svn?  Benefit from local scratch storage?
 CS350 
 
-  docker-centric.  What is requirement for shared filesystem?
 Action plans 
 Immediate 
 
-  communicate with ISG/students to use specific linux.student.cs hosts rather than the round-robin name - Nick
-  create a Rados block device - Lori and Nathan       * increase to 40TB
-  create NFS bridge - Anthony and Guoxiang
-  talk to MFCF about continued use of NetApp - Lawrence/Dave
-  consider Ceph tweaks to current environment - Lori/Nathan
-  review Ceph data - Lori/Anthony
-  kernel updates?
-  communicate with faculty - Omar/Lori - forward message to this group
 near future 
 
-  move one or more home directories to Rados NFS device
-  move one or more home directories to NetApp