ceph+dm-crypt+ZFS RAIDZ2 OSD, flash journal 2-replication |
Completely tunable OSD count per chassis to CPU than OSD-per-HDD Reduced peak IOPs: total OSDs =27 vs 108 in 3-replication above 1MB seq read (32 files): 1.7GB/s 4KB random read (32 files) IOPS: 1MB seq write (32 files): 630MB/s (flash write rate limited) 4KB random write (32 files) IOPS: |
cephFS(kernel) RBD object |
full data scrub ZFS data protection/scrub compression |
Need dm-crypt under ZFS | 220TB | <0.01% chance of data loss in 5yrs | single drive failure fully manged at ZFS layer | ceph FS maturing, features converging | No interaction with dm-crypt and ceph OSD management in nominal operation, isolated drive failure Recommend moving journals to NVMe Intel P3700/Optane flash for performance and flash endurance reasons in the first year of operation. |
Reduced performance but reliable data storage and read. |
performance | products | features | support for configuration | usable capacity | reliability (worst case) |
rebuild/resilver | project | monitor/maintenance | ||
ceph+dm-crypt+xfs 3-replication |
CephFS 1MB seq read (48 files): 3.6-1.8GB/s CephFS 1MB seq write (48 files): 1GB/s CephFS 4KB random write (single client) 1850 IOPs CephFS 4KB random read (single client) 3500 IOPs |
cephFS(kernel) RBD object |
full data scrub | fully supported at creation | 198TB | 2-drive fail before OSD migration = offline Simultaneous (before migration) single-drive fail on each DFS server =data loss 3-drive fail = data loss 1-drive fail to offline with building loss %RB% Estimate probability of data loss in 5yrs @ 0.1-1% |
internal network for rebuild one drive of data over int network/drive loss for OSD migration + additional on rebuild |
ceph FS maturing, features converging | ||
gluster+ZFS-Z2+dm-crypt 2-replication+arbiter |
glusterfs 1MB seq read (12 files): 2.2GB/s gluster 1MB seq write (6 files): 1GB/s |
glusterfs(FUSE) libgfapi(block)/iSCSI ganesha-NFS |
full data scrub compression snapshots |
No native ZFS encryption. Need custom solution underlying ZFS - races |
216TB | 3-drive fail on single DFS before resilver = loss of building Min cluster 4-drive fail to data loss |
ZFS/PCIe traffic only for pool resilver. 3-drive fail pool loss = migration of entire pool over service net |
project currently refactoring. Changes for small files, substantial changes: DHT2, gluster4 |
ZED+smartd single command for drive replace |
|
ceph+dm-crypt+md 3+1RAID5 OSD, flash journal 2-replication |
Better match of OSD count per chassis to CPU than OSD-per-HDD Reduced peak IOPs: total OSDs =27 vs 108 in 3-replication above 1MB seq read (32 files): 2.7GB/s 4KB random read (32 files) IOPS: 6100 1MB seq write (32 files): 630MB/s (flash write rate limited) 4KB random write (32 files) IOPS: 2000 |
cephFS(kernel) RBD object |
full data scrub | Need md layer under Ceph managed dm-crypt/XFS | 222TB | <0.01% chance of data loss in 5yrs | single drive failure fully manged at md layer | ceph FS maturing, features converging | No interaction with dm-crypt and ceph OSD management in nominal operation, isolated drive failure Recommend moving journals to NVMe Intel P3700-class flash for performance and flash endurance reasons in the first year of operation. |
Not acceptable option. On-disk corruption is transmitted to Ceph users. Ceph scrubs detect corruption but cannot reliably repair. |