Dfs-C

Current Layout

  • 3 servers, in dc,mc,m3
  • Hardware: Each server has:
    • Ubuntu 18.04.2 Server installed on mdraid1 SSDs
    • 18 12TB 7200rpm hard drives (18 bays empty)
    • 1 750GB NVMe Optane SSD
    • 1 960GB NVMe Optane SSD
    • Both Optanes in LVM vg_optane

  • Ceph daemons: Each server has:
    • 1 ceph-mon - all active, quorum
      • In deployments our size, all hosts should have a mon

    • 1 ceph-mgr - 1 active, 2 standby
      • mgr's are always 1 active
      • No significant resource usage
      • All hosts with a mon should have an mgr, for availability

    • N ceph-mds *1 active, 1 standby-replay, and 1 standby, per filesystem
      • Number of active MDS is configurable
        • multi-mds doesn't seem needed at this point
        • Load is already spread over 1 MDS per FS

    • 18 ceph-osd bluestore daemons (eg ceph-osd@28) each using:
      • 1 HDD
      • 32GiB DB/WAL LV on Optane (eg lv_db_1)

    • 1 ceph-osd daemon using a 32GB LV (lv_osd_1) on the Optane
      • May use more than 1 for performance (this would thrash a HDD, but works on SSD)
      • Alternatively throw more cores at it:
        • osd_op_num_shards_ssd (default 8)
        • osd_op_num_threads_per_shard_ssd (default 2)
        • Defaults are set for SATA SSDs, not NVMe/Optane

  • Pools:
    • CRUSH rule 'fast-room' that groups by room and uses only the 'ssd' device class OSDs
      • The Optanes are detected as 'ssd' rather than 'nvme'; one type will need to be manually changed if we add SATA SSDs.
    • ditto 'slow-room' for 'hdd'
    • CRUSH rule "default" (0) exists but should not be used
    • cephfs metadata pools on 'fast-room', size = 3, min_size = 2
    • cephfs data pools on 'slow-room', size = 3, min_size = 2

  • CephFS:
    • Allow multiple filesystems: ceph fs flag set enable_multiple true
    • On each cephfs: ceph fs set cephfs_cscf-home allow_standby_replay true

Setup Steps

Note the hostnames below; some commands must be run on the salt-master, and some on one or all of the DFS machines. Install ceph packages + dependencies, and the cluster ssh key:

root@salt-204:~# salt -N dfs-c state.apply dfs-c --state-verbose=False test=True
root@salt-204:~# salt -N dfs-c state.apply dfs-c --state-verbose=False

Only if creating a new cluster:

root@m3-3101-422:~# cd /etc/ceph
root@m3-3101-422:/etc/ceph# ceph-deploy new dc-3558-422 mc-3015-422 m3-3101-422
root@m3-3101-422:/etc/ceph# ceph-deploy mon create-initial

# Copy admin keys to all hosts 
root@m3-3101-422:/etc/ceph# ceph-deploy admin dc-3558-422 mc-3015-422 m3-3101-422

# Only one mgr is active at a time, but for failover, all failure domains should have one
root@m3-3101-422:/etc/ceph# ceph-deploy mgr create dc-3558-422 mc-3015-422 m3-3101-422

# We need N MDS' per host, where N is number of filesystems
# We use Salt to make them in parallel, and label them -A, -B, etc
root@salt-204:~# salt -N dfs-c cmd.run cwd=/etc/ceph 'ceph-deploy mds create $(hostname -s):$(hostname -s)-A' 

# This also needs to be run on all nodes, to deposit all keyrings in /etc/ceph:
root@salt-204:~# salt -N dfs-c cmd.run cwd=/etc/ceph 'ceph-deploy gatherkeys $(hostname -s)'

# CRUSH configuration
# Define rooms (on any Ceph host)
ceph osd crush add-bucket dc-3558 room
ceph osd crush add-bucket mc-3015 room
ceph osd crush add-bucket m3-3101 room

# Assign rooms to the root "default"
ceph osd crush move dc-3558 root=default
ceph osd crush move mc-3015 root=default
ceph osd crush move m3-3101 root=default

# Move machines into their rooms
ceph osd crush move m3-3101-422 room=m3-3101
ceph osd crush move dc-3558-422 room=dc-3558
ceph osd crush move mc-3015-422 room=mc-3015

# Confirm
ceph osd crush tree

# Define CRUSH rules for HDD and Optanes, using 'room' as the failure domain
ceph osd crush rule create-replicated slow-room default room hdd
ceph osd crush rule create-replicated fast-room default room ssd

If instead adding a new node to an existing cluster:

# Where dc-3558-422 is an existing node, and m3-3101-422 is the new one
root@m3-3101-422:~# cd /etc/ceph
root@m3-3101-422:/etc/ceph# ceph-deploy gatherkeys dc-3558-422
root@m3-3101-422:/etc/ceph# ceph-deploy mon create m3-3101-422
root@m3-3101-422:/etc/ceph# ceph-deploy mgr create m3-3101-422
root@m3-3101-422:/etc/ceph# ceph-deploy mds create m3-3101-422-A

Now that control software is in place, we add OSDs on each host to actually store data:

root@m3-3101-422:~# cd /etc/ceph

# Uses ceph-deploy to create all 18 HDD OSDs
# Will create 18 32GB LVs on vg_optane for OSD DB + WAL
root@m3-3101-422:/etc/ceph# ~/bin/osd.py $(hostname -s)

# Create NVMe OSDs for metadata - no extra devices
root@salt-204:~# salt -N dfs-c cmd.run cwd=/etc/ceph 'lvcreate -L 32G --name lv_osd_1 vg_optane; done'

Confirm status:

root@m3-3101-422:/etc/ceph# ceph status
root@m3-3101-422:/etc/ceph# ceph health detail

To create a new cephfs:

# metadata on NVMe OSDs
root@m3-3101-422:/etc/ceph# ceph osd pool create cephfs_metadata 16 replicated fast-room
# data on HDDs
root@m3-3101-422:/etc/ceph# ceph osd pool create cephfs_data 64 replicated slow-room

# This is default in the /etc/ceph/ceph.conf pushed by Salt, but for reference:
# Set both to size = 3 (keep 3 copies) and min_size = 2 (stop all IO if less than 2 copies remains)
root@m3-3101-422:~# ceph osd pool set cephfs_metadata size 3
root@m3-3101-422:~# ceph osd pool set cephfs_metadata min_size 2
root@m3-3101-422:~# ceph osd pool set cephfs_data size 3
root@m3-3101-422:~# ceph osd pool set cephfs_data min_size 2

# Filesystem:
root@m3-3101-422:~# ceph fs new cephfs_fsname cephfs_fsname_metadata cephfs_fsname_data

Confirm status again:

root@m3-3101-422:/etc/ceph# ceph status
root@m3-3101-422:/etc/ceph# ceph health detail

Mounting

Clients should be running Ceph 14 Nautilus for best results. Luminous (12) mostly works but is not recommended. 13 (Mimic) would probably work ok but has not been tested.

In /etc/fstab:

10.1.152.122,10.1.154.122,10.1.155.122:/        /mnt/ceph       ceph    defaults,noauto,name=admin,secretfile=/etc/ceph/ceph.admin.secret,mds_namespace=cephfs_bs_db,x-systemd.automount 0 0

  • name = the Cephx username for auth
  • secretfile = Cephx secret
  • mds_namespace = the filesystem name

Reference documentation

Ceph Bugs / Our contributions

Bugs submitted by nfish:

-- NathanFish - 2019-04-16

Topic revision: r7 - 2019-10-14 - DaveGawley
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback