DFSc Working Group



Meeting Date

  • TEAMS: 2022-01-11

Invitees - Attendees

  • Anthony, Dave, Gouxiang, Lori, Nathan, Nick, Lawrence, Omar

Review and accept previous meeting minutes.

Proposed Agenda Items

Old business

New business

Make cs-teaching acceptable for use

Current performance and availability issues need to be resolved

What are the root causes?

  • load?
  • not a lot of "load" yet
  • type of work? VScode, Seashell
  • is it even CS136 work or something else?
  • CS246, CS241, and others may be using VScode, but it is not mandated by anyone
  • VScode might be relevant - it is one big filesystem
  • CS136 is now running on its own MDS (mds10)
  • ssh into cs241 was slow
  • issues on u0,u1,u2
  • Graph of spike on Sunday
  • any way to track down a particular user causing problem?
  • some users doing rsync, or deleting folders - nothing unusual, really
  • options to turn on to aid with tracking?
  • how to correlate activity of a client with the activity on the filesystem?
  • iotop? run on all servers and keep an eye on it?
  • problem is with metadata, not necessarily the filesystem itself
  • a lot of moving parts and no single cause
  • bimodal distribution of performance, leads to belief that there is an architectural issue with Ceph
  • Anthony has been doing a number of statistical analyses
  • Anthony suggests setting up a general use host not on the round-robin for data collection and analysis * ubuntu2004-012?
  • what happened on Sunday?
    • Lori unpinned u6 and u8 and someone did "something" that caused those to halt and drove up CPU load
    • Lori restarted those and pinned them, that solved the problem
  • mds.9 associated with u0/1/2 - restarted several times
  • are all servers and clients updated?
    • all general-use hosts have been updated and rebooted
    • not sure about NFS/Samba hosts

Options to consider:

  • general alternatives:
    1. fixing/tuning Ceph
    2. creating a separate set of home directories
    • should probably do both in parallel
    • transition - only need to worry about people who are actively using the system right now (CS136, etc) * unique number of students in CS136, 246.,350 - 1,300 students * would be a stop-gap to (hopefully) relieve the current issues * currently home directory space is sitting at 37TB * 2/3 files space used by a small number of people * quotas would help with providing for other viable solutions
    • could be a solution using Ceph, not necessarily CephFS, eg: Rados

    • VScode issues
      • not necessarily connecting to the same server in the round-robin
        • don't use VScode on linux.student.cs and instead use ubuntu2004-002.student.cs

  • Improvements to current configuration
    • nowsync option: Improved performance link/unlink fileops eg. rsync, rm * remount current mounts with "nowsync" option * Lori did some tinkering, maybe 10% improvement, but no "magic bullet" * might be good to try on a test machine, eg: ubuntu2004-012 then add back into round-robin
    • additional kernel upgrades 5.13

  • Addition/allocation of local storage devices on student login servers
    • For CS136 and other target users only

  • ingress load balancing * same incoming IP address goes to same server

  • Move target users back to NetApp
    • network connectivity old/new?
    • previous Netapp, old home directory volume 13TB

  • Investigate RBD/ZFS/NFS replacement of cephfs layer but preserve ceph storage cluster
    • If useful, could reprovision ceph MDS servers for ZFS/NFS gateways
    • have Lori and Nathan spin up a 30-40 TB Rados device and into a formatted filesystem that can be exported
      • monitor performance issues

  • Remove HDD from ceph cluster for provisioning of standalone NFS server

  • New hardware for standalone NFS server

How to transition

  • suggestion to move some (not all) home directories and compare performance
  • in particular, propose /u8 (where Anthony's home directory)
  • plan is to move some, not all, home directories
  • challenge is that we will have a single point-of-failure in the form of the NFS server

Course-specific action items

CS136

CS246

  • Compile-heavy, use of svn? Benefit from local scratch storage?

CS350

  • docker-centric. What is requirement for shared filesystem?

Action plans

Immediate

  • communicate with ISG/students to use specific linux.student.cs hosts rather than the round-robin name - Nick
  • create a Rados block device - Lori and Nathan * increase to 40TB
  • create NFS bridge - Anthony and Guoxiang
  • talk to MFCF about continued use of NetApp - Lawrence/Dave
  • consider Ceph tweaks to current environment - Lori/Nathan
  • review Ceph data - Lori/Anthony
  • kernel updates?
  • communicate with faculty - Omar/Lori - forward message to this group

near future

  • move one or more home directories to Rados NFS device
  • move one or more home directories to NetApp
Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2022-01-11 - LawrenceFolland
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback