Dfs-C

Current Layout

  • 3 servers, in dc,mc,m3
  • Hardware: Each server has:
    • Ubuntu 18.04.2 Server installed on mdraid1 SSDs
    • 18 12TB 7200rpm hard drives (18 bays empty)
    • 1 750GB NVMe Optane SSD
    • 1 960GB NVMe Optane SSD
    • Both Optanes in LVM vg_optane

  • Ceph daemons: Each server has:
    • 1 ceph-mon - all active, quorum
      • In deployments our size, all hosts should have a mon

    • 1 ceph-mgr - 1 active, 2 standby
      • mgr's are always 1 active
      • No significant resource usage
      • All hosts with a mon should have an mgr, for availability

    • N ceph-mds *1 active, 1 standby-replay, and 1 standby, per filesystem
      • Number of active MDS is configurable
        • multi-mds doesn't seem needed at this point
        • Load is already spread over 1 MDS per FS

    • 18 ceph-osd bluestore daemons (eg ceph-osd@28) each using:
      • 1 HDD
      • 64GB DB/WAL LV on Optane (eg lv_db_1)

    • 1 ceph-osd daemon using a 64GB LV (lv_osd_1) on the Optane
      • May use more than 1 for performance (this would thrash a HDD, but works on SSD)
      • Alternatively throw more cores at it:
        • osd_op_num_shards_ssd (default 8)
        • osd_op_num_threads_per_shard_ssd (default 2)
        • Defaults are set for SATA SSDs, not NVMe/Optane

  • Pools:
    • CRUSH rule 'fast-room' that groups by room and uses only the 'ssd' device class OSDs
      • The Optanes are detected as 'ssd' rather than 'nvme'; one type will need to be manually changed if we add SATA SSDs.
    • ditto 'slow-room' for 'hdd'
    • CRUSH rule "default" (0) exists but should not be used
    • cephfs metadata pools on 'fast-room', size = 3, min_size = 2
    • cephfs data pools on 'slow-room', size = 3, min_size = 2

  • CephFS:
    • Allow multiple filesystems: ceph fs flag set enable_multiple true
    • On each cephfs: ceph fs set cephfs_cscf-home allow_standby_replay true

Setup Steps

Note the hostnames below; some commands must be run on the salt-master, and some on one or all of the DFS machines. Install ceph packages + dependencies, and the cluster ssh key:

root@salt-204:~# salt -N dfs-c state.apply dfs-c --state-verbose=False test=True
root@salt-204:~# salt -N dfs-c state.apply dfs-c --state-verbose=False

Only if creating a new cluster:

root@m3-3101-422:~# cd /etc/ceph
root@m3-3101-422:/etc/ceph# ceph-deploy new dc-3558-422 mc-3015-422 m3-3101-422
root@m3-3101-422:/etc/ceph# ceph-deploy mon create-initial

# Copy admin keys to all hosts 
root@m3-3101-422:/etc/ceph# ceph-deploy admin dc-3558-422 mc-3015-422 m3-3101-422

# Only one mgr is active at a time, but for failover, all failure domains should have one
root@m3-3101-422:/etc/ceph# ceph-deploy mgr create dc-3558-422 mc-3015-422 m3-3101-422

# We need N MDS' per host, where N is number of filesystems
# We use Salt to make them in parallel, and label them -A, -B, etc
root@salt-204:~# salt -N dfs-c cmd.run cwd=/etc/ceph 'ceph-deploy mds create $(hostname -s):$(hostname -s)-A' 

# This also needs to be run on all nodes, to deposit all keyrings in /etc/ceph:
root@salt-204:~# salt -N dfs-c cmd.run cwd=/etc/ceph 'ceph-deploy gatherkeys $(hostname -s)'

# CRUSH configuration
# Define rooms (on any Ceph host)
ceph osd crush add-bucket dc-3558 room
ceph osd crush add-bucket mc-3015 room
ceph osd crush add-bucket m3-3101 room

# Assign rooms to the root "default"
ceph osd crush move dc-3558 root=default
ceph osd crush move mc-3015 root=default
ceph osd crush move m3-3101 root=default

# Move machines into their rooms
ceph osd crush move m3-3101-422 room=m3-3101
ceph osd crush move dc-3558-422 room=dc-3558
ceph osd crush move mc-3015-422 room=mc-3015

# Confirm
ceph osd crush tree

# Define CRUSH rules for HDD and Optanes, using 'room' as the failure domain
ceph osd crush rule create-replicated slow-room default room hdd
ceph osd crush rule create-replicated fast-room default room ssd

If instead adding a new node to an existing cluster:

# Where dc-3558-422 is an existing node, and m3-3101-422 is the new one
root@m3-3101-422:~# cd /etc/ceph
root@m3-3101-422:/etc/ceph# ceph-deploy gatherkeys dc-3558-422
root@m3-3101-422:/etc/ceph# ceph-deploy mon create m3-3101-422
root@m3-3101-422:/etc/ceph# ceph-deploy mgr create m3-3101-422
root@m3-3101-422:/etc/ceph# ceph-deploy mds create m3-3101-422-A

Now that control software is in place, we add OSDs on each host to actually store data:

root@m3-3101-422:~# cd /etc/ceph
# Create 18 64GB partitions on /dev/nvme0n1 for OSD DB's
# wal.py will make 2GB partitions for WAL only
root@m3-3101-422:/etc/ceph# ~/bin/db.py $(hostname -s)
# Use ceph-deploy to create all 18 HDD OSDs
root@m3-3101-422:/etc/ceph# ~/bin/osd.py $(hostname -s)

# Create NVMe OSDs for metadata - no extra devices
root@salt-204:~# salt -N dfs-c cmd.run cwd=/etc/ceph 'lvcreate -L 64G --name lv_osd_1 vg_optane; done'

Confirm status:

root@m3-3101-422:/etc/ceph# ceph status
root@m3-3101-422:/etc/ceph# ceph health detail

To create a new cephfs:

# metadata on NVMe OSDs
root@m3-3101-422:/etc/ceph# ceph osd pool create cephfs_metadata 32 replicated fast-room
# data on HDDs
root@m3-3101-422:/etc/ceph# ceph osd pool create cephfs_data 64 replicated slow-room

# This is default in the /etc/ceph/ceph.conf pushed by Salt, but for reference:
# Set both to size = 3 (keep 3 copies) and min_size = 2 (stop all IO if less than 2 copies remains)
root@m3-3101-422:~# ceph osd pool set cephfs_metadata size 3
root@m3-3101-422:~# ceph osd pool set cephfs_metadata min_size 2
root@m3-3101-422:~# ceph osd pool set cephfs_data size 3
root@m3-3101-422:~# ceph osd pool set cephfs_data min_size 2

# Filesystem:
root@m3-3101-422:~# ceph fs new cephfs_fsname cephfs_fsname_metadata cephfs_fsname_data

Confirm status again:

root@m3-3101-422:/etc/ceph# ceph status
root@m3-3101-422:/etc/ceph# ceph health detail

Mounting

In /etc/fstab:

10.1.152.122,10.1.154.122,10.1.155.122:/        /mnt/ceph       ceph    defaults,noauto,name=admin,secretfile=/etc/ceph/ceph.admin.secret,mds_namespace=cephfs_bs_db 0 0

  • name = the Cephx username for auth
  • secretfile = Cephx secret
  • mds_namespace = the filesystem name

Reference documentation

-- NathanFish - 2019-04-16

Topic revision: r5 - 2019-06-11 - NathanFish
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback