Dfs-C

This documentation is a draft that has not yet been tested with a full rebuild.

Current Layout

  • 3 servers, in dc,mc,m3
  • Hardware: Each server has:
    • Ubuntu 18.04.2 Server installed on mdraid1 SSDs
    • 18 12TB 7200rpm hard drives (18 bays empty)
    • 1 750GB NVMe Optane SSD

  • Ceph daemons: Each server has:
    • 1 ceph-mon - all active, quorum
      • In deployments our size, all hosts should have a mon

    • 1 ceph-mgr - 1 active, 2 standby
      • mgr's are always 1 active
      • No significant resource usage
      • All hosts with a mon should have an mgr, for availability

    • 1 ceph-mds - 1 active, 2 standby
      • Number of active MDS is configurable, but if you lose more MDS's than you have standby's, cluster will freeze for a bit
      • Testing with 2 active is in progress
      • ceph fs set max_mds 2

    • 18 ceph-osd daemons (eg ceph-osd@28) each using:
      • 1 HDD
      • 40GB DB partition on Optane

    • 1 ceph-osd daemon using a 4GB partition (#128) on the Optane
      • May use more than 1 for performance
      • Alternatively throw more cores at it:
        • osd_op_num_shards_ssd (default 8)
        • osd_op_num_threads_per_shard_ssd (default 2)
        • Defaults are set for SATA SSDs, not NVMe/Optane

  • Pools and cephfs:
    • CRUSH rule 'fast' that uses only the 'ssd' device class OSDs
    • ditto 'slow' for 'hdd'
    • cephfs_metadata pool on 'fast', size = 3, min_size = 2
    • cephfs_data pool on 'slow', size = 3, min_size = 2
    • No settings changed on the cephfs itself currently.

Setup Steps

Note the hostnames below; some commands must be run on the salt-master, and some on one or all of the DFS machines. Install ceph packages + dependencies, and the cluster ssh key:
root@salt-204:~# salt -N dfs-c state.apply dfs-c --state-verbose=False test=True
root@salt-204:~# salt -N dfs-c state.apply dfs-c --state-verbose=False

Only if creating a new cluster:

root@m3-3101-422:~# cd /etc/ceph
root@m3-3101-422:/etc/ceph# ceph-deploy new dc-3558-422 mc-3015-422 m3-3101-422
root@m3-3101-422:/etc/ceph# ceph-deploy mon create-initial
root@m3-3101-422:/etc/ceph# ceph-deploy admin dc-3558-422 mc-3015-422 m3-3101-422
# Only one mgr is active at a time, but for failover, all failure domains should have one
root@m3-3101-422:/etc/ceph# ceph-deploy mgr create dc-3558-422 mc-3015-422 m3-3101-422
# Same is true for mds
root@m3-3101-422:/etc/ceph# ceph-deploy mds create dc-3558-422 mc-3015-422 m3-3101-422 

# Only this command needs to be run on all nodes:
root@m3-3101-422:~# cd /etc/ceph && ceph-deploy gatherkeys dc-3558-422 mc-3015-422 m3-3101-422

# Now configuration - TODO, Salt into ceph.conf?
# Create a new CRUSH rule called 'slow' using all devices of class 'hdd' and each 'host' is a failure domain
root@m3-3101-422:/etc/ceph# ceph osd crush rule create-replicated slow default host hdd
# 'fast' rule for SSD OSDs
root@m3-3101-422:/etc/ceph# ceph osd crush rule create-replicated fast default host ssd

If instead adding a new node to an existing cluster:

# Where dc-3558-422 is an existing node, and m3-3101-422 is the new one
root@m3-3101-422:~# cd /etc/ceph
root@m3-3101-422:/etc/ceph# ceph-deploy gatherkeys dc-3558-422
root@m3-3101-422:/etc/ceph# ceph-deploy mon create m3-3101-422
root@m3-3101-422:/etc/ceph# ceph-deploy mgr create m3-3101-422
root@m3-3101-422:/etc/ceph# ceph-deploy mds create m3-3101-422

Now that control software is in place, we add OSDs on each host to actually store data:

root@m3-3101-422:~# cd /etc/ceph
# Create 18 40GB partitions on /dev/nvme0n1 for OSD DB's
# wal.py will make 2GB partitions for WAL only
root@m3-3101-422:/etc/ceph# ~/bin/db.py $(hostname -s)
# Use ceph-deploy to create all 18 HDD OSDs
root@m3-3101-422:/etc/ceph# ~/bin/osd.py $(hostname -s)

# Create NVMe OSDs for metadata - everything on one partition.
# Manually create partition - I used ID 128 to separate it from the 1-18 DB partitions
root@m3-3101-422:/etc/ceph# fdisk /dev/nvme0n1
root@m3-3101-422:/etc/ceph# ceph-deploy --ceph-conf /etc/ceph/ceph.conf osd create --data /dev/nvme0n1p128 $(hostname -s)

Confirm status:

root@m3-3101-422:/etc/ceph# ceph status
root@m3-3101-422:/etc/ceph# ceph health detail

To create a new cephfs:

# metadata on NVMe OSDs
root@m3-3101-422:/etc/ceph# ceph osd pool create cephfs_metadata 128 replicated fast
# data on HDDs
root@m3-3101-422:/etc/ceph# ceph osd pool create cephfs_data 1024 replicated slow

# Set both to size = 3 (keep 3 copies) and min_size = 2 (stop all IO if less than 2 copies remains)
root@m3-3101-422:~# ceph osd pool set cephfs_metadata size 3
root@m3-3101-422:~# ceph osd pool set cephfs_metadata min_size 2
root@m3-3101-422:~# ceph osd pool set cephfs_data size 3
root@m3-3101-422:~# ceph osd pool set cephfs_data min_size 2

# Filesystem:
root@m3-3101-422:~# ceph fs new cephfs_fsname cephfs_metadata cephfs_data

Confirm status again:

root@m3-3101-422:/etc/ceph# ceph status
root@m3-3101-422:/etc/ceph# ceph health detail

Mounting

In /etc/fstab:

10.1.152.122,10.1.154.122,10.1.155.122:/        /mnt/ceph       ceph    defaults,noauto,name=admin,secretfile=/etc/ceph/ceph.admin.secret,mds_namespace=cephfs_bs_db 0 0

  • name = the Cephx username for auth
  • secretfile = Cephx secret
  • mds_namespace = the filesystem name

Reference documentation

-- NathanFish - 2019-04-16

Topic revision: r2 - 2019-04-17 - NathanFish
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback