Filesystem decision matrix: Ceph vs Gluster

ceph+dm-crypt+ZFS RAIDZ2 OSD, flash journal
2-replication
Completely tunable OSD count per chassis to CPU than OSD-per-HDD
Reduced peak IOPs: total OSDs =27 vs 108 in 3-replication above
1MB seq read (32 files): 1.7GB/s
4KB random read (32 files) IOPS:
1MB seq write (32 files): 630MB/s (flash write rate limited)
4KB random write (32 files) IOPS:
cephFS(kernel)
RBD
object
full data scrub
ZFS data protection/scrub
compression
Need dm-crypt under ZFS 220TB <0.01% chance of data loss in 5yrs single drive failure fully manged at ZFS layer ceph FS maturing, features converging No interaction with dm-crypt and ceph OSD management in nominal operation, isolated drive failure
Recommend moving journals to NVMe Intel P3700/Optane flash for performance and flash endurance reasons in the first year of operation.
Reduced performance but reliable data storage and read.
  performance products features support for configuration usable capacity reliability
(worst case)
rebuild/resilver project monitor/maintenance
ceph+dm-crypt+xfs
3-replication
CephFS 1MB seq read (48 files): 3.6-1.8GB/s
CephFS 1MB seq write (48 files): 1GB/s
CephFS 4KB random write (single client) 1850 IOPs
CephFS 4KB random read (single client) 3500 IOPs
cephFS(kernel)
RBD
object
full data scrub fully supported at creation 198TB 2-drive fail before OSD migration = offline
Simultaneous (before migration) single-drive fail on each DFS server =data loss
3-drive fail = data loss
1-drive fail to offline with building loss %RB% Estimate probability of data loss in 5yrs @ 0.1-1%
internal network for rebuild
one drive of data over int network/drive loss for OSD migration + additional on rebuild
ceph FS maturing, features converging  
gluster+ZFS-Z2+dm-crypt
2-replication+arbiter
glusterfs 1MB seq read (12 files): 2.2GB/s
gluster 1MB seq write (6 files): 1GB/s
glusterfs(FUSE)
libgfapi(block)/iSCSI
ganesha-NFS
full data scrub
compression
snapshots
No native ZFS encryption.
Need custom solution underlying ZFS - races
216TB 3-drive fail on single DFS before resilver = loss of building
Min cluster 4-drive fail to data loss
ZFS/PCIe traffic only for pool resilver.
3-drive fail pool loss = migration of entire pool over service net
project currently refactoring.
Changes for small files, substantial changes: DHT2, gluster4
ZED+smartd
single command for drive replace
ceph+dm-crypt+md 3+1RAID5 OSD, flash journal
2-replication
Better match of OSD count per chassis to CPU than OSD-per-HDD
Reduced peak IOPs: total OSDs =27 vs 108 in 3-replication above
1MB seq read (32 files): 2.7GB/s
4KB random read (32 files) IOPS: 6100
1MB seq write (32 files): 630MB/s (flash write rate limited)
4KB random write (32 files) IOPS: 2000
cephFS(kernel)
RBD
object
full data scrub Need md layer under Ceph managed dm-crypt/XFS 222TB <0.01% chance of data loss in 5yrs single drive failure fully manged at md layer ceph FS maturing, features converging No interaction with dm-crypt and ceph OSD management in nominal operation, isolated drive failure
Recommend moving journals to NVMe Intel P3700-class flash for performance and flash endurance reasons in the first year of operation.
Not acceptable option.
On-disk corruption is transmitted to Ceph users.
Ceph scrubs detect corruption but cannot reliably repair.
-- LoriPaniak - 2016-11-01
Topic revision: r9 - 2017-06-05 - LoriPaniak
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback