Secure web-based file sharing system with distributed backing storage system
SCS DFS system
SCS Nextcloud system
SCS haproxy system
Table of Contents
Project Charter
Distributed_File_System_Growth_-_Google_Docs.pdf
Project Objective: Purpose
The purpose is to provide a multi-platform file sharing/syncing system to meet the needs of the SCS as agreed by K. Salem and CSCF Managers.
Project Scope/Deliverables
The high-level outcomes and results of this project are as follows:
- File sharing/syncing system (FSS) with end-user documentation.
- Distributed File System (DFS) to backing for the file sharing/syncing system and provide a versatile, large-capacity storage facility for future CSCF/CS/MFCF services.
Scope Includes/Excludes
Overview
Includes
|
Service |
Requires |
From |
Delivers |
To |
|
40GbE ring network |
- |
- |
three-building HA Ethernet/RDMA |
DFS internal replication and client services (eg. OwnCloud) |
|
DFS |
networking |
40GbE ring |
HA data storage via filesystem (glusterfs, CephFS) or block (libgfapi, iSCSI, RDB) |
client services |
|
FSS (OwnCloud) |
storage; networking; database; haproxy; container |
DFS; 40GbE ring/10GbE IST; MySQL.cs; CSI |
File sync and share and TBD |
SCS users and groups |
- high-speed dedicated storage network
- distributed file system (DFS) with ~300TB usable capacity meeting CSCF requirements as below
- OwnCloud(-like) web-based, multi-platform file share/sync system (FSS) that leverages DFS, existing authentication services, CS database and web proxy systems.
- Fileshare includes versioning capability with ability for user to recover previous versions of deleted, modified, corrupted or ransomware'd files.
- User data encryption to provide at least:
- No recognizable user data on backing media (DFS hard drives)
- Encryption from client hardware to FSS server via TLS
- rigorous testing of system for performance metrics and failure modes
- global system monitoring DFS_Monitoring
- necessary documentation - especially for day-to-day maintenance and tasks which will be handled by CSI group
- system deployment
- PXE booting cluster
- salt configuration to deploy DFS and OwnCloud systems
Excludes
- HA NFS service with Kerberos authentication. This service may be demonstrated as a PoC.
Constraints
- Project must be CSCF-budget neutral. Cost of build and maintenance of FSS/DFS must not exceed cost of services it supersedes.
Assumptions and Risks
- Completion and testing of 40GbE ring network ST#104790.
- A HA FSS system requires the resources of a CS(CF) TCP/HTTP(S) load balancer (haproxy). ST#107102
- All FSS data is vulnerable to security breach on FSS server/container
- 15U of rack space within 2m cable reach of 40GbE switches in each data center (DC3558, MC3105, M33101).
- 6x 208V 10A power feeds for each 15U of rack space.
- Network/direct connection to attached UPS for each DFS server for monitoring and controlled shutdown on mains loss. ST#107485
Project Members
ldpaniak (project manager), a2brenna, gxshen, nfish / cscflab, dmerner
Project Stakeholders
lfolland omnafees dlgawley
Project Sponsor
Ken Salem
ST
Implementation Plan
- [July 1 - August 15, 2016 (year 0)] Purchase and receive server hardware as in ST#105605. It has been determined that specified hardware is acceptable for all DFS options (Ceph and gluster). Initial hardware 3x 36-drive servers; 144TB usable per server, 216 TB usable per DFS. Complete
- [July - August 2016] Delineate options for acceptable FSS encryption and features. Currently considering Own/NextCloud and SeaFile. Encryption options here. Complete - moving ahead with NextCloud
- [August 15 - September 1, 2016] Install and low-level local testing of DFS hardware: CPU, RAM, HDD, network. See: https://cs.uwaterloo.ca/cscf/internal/request_debug/UpdateRequest?106589 Complete
- [September 1 - December 15 (prev. September 1 -15), 2016] Testing of various DFS software options: performance, hardware/network failure/recovery, maintenance, expansion. Complete - moving ahead with Ceph
- [September 15 - October 15, 2016] Testing of various FSS options: encryption, failure/recovery, client options, connectivity options, features. Complete - Using Nextcloud
- [October 15 - Feb 5 (prev. October 15 - November 15), 2016] Build of production system with DFS and FSS choices. Salt implementation Complete
- [December 20, 2016 - January 6, 2017] Decide on what we will call the File Sharing Service. Ideas can be documented here: FileSharingServiceName Complete - vault.cs
- [December 23, 2016 - Feb 15, 2017] Documentation for administrators and users: https://cs.uwaterloo.ca/cscf/internal/request_debug/UpdateRequest?107956 Complete
- [Feb 15, 2017 ] Rollout of FSS service to CSCF beta testers and system testing. Data may be lost from system during changes/tests. Complete
- [Feb 20, 2017+] System goes into general production: Rollout of FSS service to select SCS users. Data in system is secure
- [January - May 2017] Evaluation of FSS/DFS performance and bottlenecks.
- [June 2017 (year 1)] Purchase of incremental hardware to resolve bottlenecks (eg. PCIe flash cache, single-thread perf MDS servers). Recommend Optane NV flash for ceph journal and ZFS SLOG/L2ARC devices: https://www.servethehome.com/intel-optane-ssd-dc-p4800x-3d-xpoint-landed/intel-optane-ssd-dc-p4800x-ceph-storage-performance-as-a-journal/
- [June 2018 (year 2)] Purchase of three additional servers (>36 HDD, >180TB each) to increase DFS capacity to >400TB. Or additional drive cabinets for existing servers as performance requirements dictate.
- [June 2019 (year 3)] Evaluation of DFS/FSS project and future direction. Choose second-generation DFS and FSS options.
- [June 2020 (year 4)] Purchase of second-generation DFS hardware (3+ servers) and integration with existing FSS or new FSS option.
- [June 2020 - June 2021] Phase-out and retirement from production of initial (year 0) DFS hardware.
- [June 2022] Purchase of new DFS hardware.
- [June 2022 - June 2023] Phase-out and retirement from production of expansion (year 2) DFS hardware.
Meeting notes
Detailed Maintenance Documentation
--
LoriPaniak - 2017-02-12