Project Charter


See also:

Current Layout

  • 3 servers (-422), in dc,mc,m3
  • Hardware: Each server has:
    • Ubuntu 18.04.3 Server installed on mdraid1 SSDs
    • 18 12TB 7200rpm hard drives (18 bays empty)
    • 1 750GB NVMe Optane SSD
    • 1 960GB NVMe Optane SSD
    • Both Optanes in LVM vg_optane

  • Ceph daemons: Each server has:
    • 1 ceph-mon - all active, quorum
      • In deployments our size, all hosts should have a mon

    • 1 ceph-mgr - 1 active, 2 standby
      • mgr's are always 1 active
      • No significant resource usage
      • All hosts with a mon should have an mgr, for availability

    • N ceph-mds *1 active, 1 standby-replay, and 1 standby, per filesystem
      • Number of active MDS is configurable
        • multi-mds doesn't seem needed at this point
        • Load is already spread over 1 MDS per FS

    • 18 ceph-osd bluestore daemons (eg ceph-osd@28) each using:
      • 1 HDD
      • 32GiB DB/WAL LV on Optane (eg lv_db_1)

    • 2 ceph-osd daemons each using a 128GiB LV (lv_osd_1) on the Optane
      • Multiple daemons can be needed to saturate NVMe
      • Alternatively throw more cores at it:
        • osd_op_num_shards_ssd (default 8)
        • osd_op_num_threads_per_shard_ssd (default 2)
        • Defaults are set for SATA SSDs, not NVMe/Optane

  • Pools:
    • CRUSH rule 'fast-room' that groups by room and uses only the 'ssd' device class OSDs
      • The Optanes are detected as 'ssd' rather than 'nvme'; one type will need to be manually changed if we add SATA SSDs.
    • ditto 'slow-room' for 'hdd'
    • CRUSH rule "default" (0) exists but should not be used
    • cephfs metadata pools on 'fast-room', size = 3, min_size = 2
    • cephfs data pools on 'slow-room', size = 3, min_size = 2

  • CephFS:
    • Allow multiple filesystems: ceph fs flag set enable_multiple true
    • On each cephfs: ceph fs set cephfs_cscf-home allow_standby_replay true * standby-replays are currently selected arbitrarily, and can end up on the same box. TODO.


Map drive to OSD

root@mc-3015-422:~# ceph-volume inventory /dev/sdm

====== Device report /dev/sdm ======

     path                      /dev/sdm
     ceph device               True
     lsm data                  {}
     available                 False
     rejected reasons          Insufficient space (<10 extents) on vgs, locked, LVM detected
     device id                 SEAGATE_ST12000NM0027_ZJV0FE3G0000R8167H79
     removable                 0
     ro                        0
     vendor                    SEAGATE
     model                     ST12000NM0027
     sas address               0x5000c500953f4af1
     rotational                1
     scheduler mode            mq-deadline
     human readable size       10.91 TB
    --- Logical Volume ---
     name                      osd-block-d6d29605-9d11-4787-8ffc-56359a5ece0f
     osd id                    30
     cluster name              ceph
     type                      block
     osd fsid                  d6d29605-9d11-4787-8ffc-56359a5ece0f
     cluster fsid              97ad84d6-3c01-49af-9ca0-a9fe1ff79597
     osdspec affinity          
     block uuid                l8fLhq-t9Pj-XgqS-6EkM-RDYV-Ga0x-hzjYQn

# sdm is used by osd.30

Map OSD to drive

root@dc-3558-422:~# lvdisplay | grep -B 5 34FxBi-nOnW-wbok-sYoC-TiSy-aPmo-BHgbo0
  --- Logical volume ---
  LV Path                /dev/ceph-98aa9789-bb28-4ae3-b14d-28918e5d7983/osd-block-42cc657c-50fb-4568-8ac3-b1ab462bf567
  LV Name                osd-block-42cc657c-50fb-4568-8ac3-b1ab462bf567
  VG Name                ceph-98aa9789-bb28-4ae3-b14d-28918e5d7983
  LV UUID                34FxBi-nOnW-wbok-sYoC-TiSy-aPmo-BHgbo0
root@dc-3558-422:~# lvdisplay | grep -B 5 nvUXDJ-hohD-vAW1-LxFt-CcmV-7Ext-mT6U1u
  --- Logical volume ---
  LV Path                /dev/vg_optane/lv_db_23
  LV Name                lv_db_23
  VG Name                vg_optane
  LV UUID                nvUXDJ-hohD-vAW1-LxFt-CcmV-7Ext-mT6U1u

Physical device can be found from the VG associated with the "osd-block" LV above eg. osd-block-42cc657c-50fb-4568-8ac3-b1ab462bf567

root@dc-3558-422:~# pvdisplay |grep -B 3 98aa9789-bb28-4ae3-b14d-28918e5d7983
  --- Physical volume ---
  PV Name               /dev/sdw
  VG Name               ceph-98aa9789-bb28-4ae3-b14d-28918e5d7983

OSD failure

The cluster will generally heal itself on the failure of an OSD. Hence, the failure of a single OSD is not an emergency situation and can often be left unattended until the next regular reboot of the affected host.

If a restart of the OSD is desired, there are two options for intervention on the affected host:

  • Restart OSD daemon by numerical ID eg.
    systemctl restart ceph-osd@12.service 
    A check of daemon status or ceph status will give results.

  • Not recommended A more invasive restart which will kick off a high-bandwidth/load reconfiguration which will severely degrade cluster performance and require OSD compactions to reduce the length of time to resolve is:
     ceph-volume lvm activate --all 

  • In extreme cases when an OSD cannot be restarted, it may be appropriate to remove the OSD and rebuild it as in the instructions for replacing a failed OSD below.

Failed drive/OSD replacement

When a hard drive fails, the Ceph system should automatically redistribute data to regain required redundancy. Immediate attention is generally not required. Due to the loads associated with such a rebuild, it may be prudent to proactively replace failing drives during times to low user activity.

If a drive fails, the associated OSD should be reported down. For this example, osd.61.

On the host with the failed drive:

root@dc-3558-422:~# ceph osd out 61
# Wait until move complete

root@dc-3558-422:~# ceph -w

# Get the block.db LV ID
root@dc-3558-422:~# ls -lh /var/lib/ceph/osd/ceph-61/block.db
lrwxrwxrwx 1 ceph ceph 50 Oct 8 07:15 block.db -> /dev/mapper/nvUXDJ-hohD-vAW1-LxFt-CcmV-7Ext-mT6U1u

root@dc-3558-422:~# lvdisplay | grep -B 4 nvUXDJ-hohD-vAW1-LxFt-CcmV-7Ext-mT6U1u
 --- Logical volume ---
  LV Path                /dev/vg_optane/lv_db_23

root@dc-3558-422:~# ceph osd safe-to-destroy osd.61
OSD(s) 61 are safe to destroy without reducing data durability.

root@dc-3558-422:~# systemctl stop ceph-osd@61

root@dc-3558-422:~# ceph osd destroy 61 --yes-i-really-mean-it

Identify and replace sdw using usual tools eg.

root@dc-3558-422:~# ledctl locate=/dev/sdw
root@dc-3558-422:~# ledctl locate_off=/dev/sdw

Once new drive is ready, using VG identified above:

# Confirm new drive letter
root@dc-3558-422:~# dmesg | tail 
root@dc-3558-422:~# ceph-volume lvm zap  vg_optane/lv_db_23

# Create new OSD, re-using the previous ID & block-db, and enabling dmcrypt.  
root@dc-3558-422:~# ceph-volume lvm create --osd-id 61 --data /dev/sdX --block.db vg_optane/lv_db_23 --dmcrypt 
root@dc-3558-422:~# ceph osd unset noout

PG_DAMAGED, Possible data damage, pg inconsistent

Normal operation of the cluster and its scrubbing functionality will occasionally reveal inconsistencies in placement groups. Identifying these inconsistencies is a strength of the Ceph Bluestore storage backend and should not pose any issues for user performance or data integrity if addressed in a reasonable timeframe. Such errors are often due to hard drive failures.

When the system presents such a critical warning eg.

# ceph status
    id:     97ad84d6-3c01-49af-9ca0-a9fe1ff79597
    health: HEALTH_ERR
            BlueFS spillover detected on 25 OSD(s)
            1 scrub errors
            Possible data damage: 1 pg inconsistent
            1 daemons have recently crashed

Check for the details and identify the affected pg (7.e below):

# ceph health detail
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 7.e is active+clean+inconsistent+snaptrim_wait, acting [2,23,43]
RECENT_CRASH 1 daemons have recently crashed
osd.2 crashed on host at 2020-12-15 21:08:24.422408Z

One can find out the nature of the inconsistency for the various shards of the pg (a read error below):

# rados list-inconsistent-obj 7.e --format=json-pretty
"epoch": 58062,
"inconsistents": [
"object": {
"name": "1000aa293ca.00000000",
"nspace": "",
"locator": "",
"snap": "head",
"version": 6029909
"errors": [],
"union_shard_errors": [
"selected_object_info": {
"oid": {
"oid": "1000aa293ca.00000000",
"key": "",
"snapid": -2,
"hash": 2873700878,
"max": 0,
"pool": 7,
"namespace": ""
"version": "17883'6071326",
"prior_version": "17880'6029909",
"last_reqid": "osd.2.0:526673",
"user_version": 6029909,
"size": 131100,
"mtime": "2019-12-12 15:36:52.344640",
"local_mtime": "2019-12-12 15:36:52.351355",
"lost": 0,
"flags": [
"truncate_seq": 0,
"truncate_size": 0,
"data_digest": "0xc18b6723",
"omap_digest": "0xffffffff",
"expected_object_size": 0,
"expected_write_size": 0,
"alloc_hint_flags": 0,
"manifest": {
"type": 0
"watchers": {}
"shards": [
"osd": 2,
"primary": true,
"errors": [
"size": 131100
"osd": 23,
"primary": false,
"errors": [],
"size": 131100,
"omap_digest": "0xffffffff",
"data_digest": "0xc18b6723"
"osd": 43,
"primary": false,
"errors": [],
"size": 131100,
"omap_digest": "0xffffffff",
"data_digest": "0xc18b6723"}]}]}

For a "read error", which is likely due to a drive failing to provide data during the scrub procedure, an additional deep scrub of the pg should clear the error withing a few hours:

ceph pg deep-scrub 7.e

For more details, refer to the Ceph documentation:

The Ceph pg repair function can be manually invoked:

# ceph pg repair 7.e
instructing pg 7.e on osd.2 to repair


Monthly Maintenance

  • Tuesdays, dc, mc, m3 each month
  • schedule 30 minutes of flexible downtime for host at icinga.cscf
  • On salt-master eg.
  • salt state.apply --state-verbose=False test=True
  • salt state.apply --state-verbose=False
  • ssh to the server: ssh mc-3015-422
  • apt update && apt full-upgrade
  • ceph osd set noout
  • reboot
  • Confirm HEALTH_OK. If not all OSDs are up, see OSD failure section above
  • Once HEALTH_OK (all OSDs are up), clear noout: ceph osd unset noout

End of Term Maintenance

  • Removal of old account data. Remove "trash" directories created by (re)moving files/folders
  • Update and restart of all client systems
  • Rebalance the cluster:
root@dc-3558-422:~# ceph balancer status
    "last_optimize_duration": "", 
    "plans": [], 
    "mode": "upmap", 
    "active": false, 
    "optimize_result": "", 
    "last_optimize_started": ""

root@dc-3558-422:~# ceph balancer on 

ceph status will report "recovery" activity. This may take some time to complete.

Once complete, disable the balancer:

root@m3-3101-422:~# ceph balancer status
    "last_optimize_duration": "0:00:00.027298", 
    "plans": [], 
    "mode": "upmap", 
    "active": true, 
    "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect", 
    "last_optimize_started": "Fri Aug 21 10:08:00 2020"

root@dc-3558-422:~# ceph balancer off 

Manually Adding JSON Support for Graphing

JSON isn't supported by default on older versions of Ceph. We use the JSON output to generate filesystem graphs. It can be manually patched in to older version of Ceph by manually replacing the python script responsible for the ceph fs status command. These changes will be overwritten if Ceph is ever updated.

On each server, backup /usr/share/ceph/mgr/status/ to a safe place.

Overwrite /usr/share/ceph/mgr/status/ with the following:

Restart the mgr service

systemctl restart ceph-mgr@dc-3558-422.service

After all servers are complete, verify that ceph fs status cephfs_cscf-home -f json provides a JSON formatted output.

Setup Steps

Raw notes:

Note the hostnames below; some commands must be run on the salt-master, and some on one or all of the DFS machines. Install ceph packages + dependencies, and the cluster ssh key:

root@salt-204:~# salt -N dfs-c state.apply dfs-c --state-verbose=False test=True
root@salt-204:~# salt -N dfs-c state.apply dfs-c --state-verbose=False

Only if creating a new cluster:

root@m3-3101-422:~# cd /etc/ceph
root@m3-3101-422:/etc/ceph# ceph-deploy new dc-3558-422 mc-3015-422 m3-3101-422
root@m3-3101-422:/etc/ceph# ceph-deploy mon create-initial

# Copy admin keys to all hosts 
root@m3-3101-422:/etc/ceph# ceph-deploy admin dc-3558-422 mc-3015-422 m3-3101-422

# Only one mgr is active at a time, but for failover, all failure domains should have one
root@m3-3101-422:/etc/ceph# ceph-deploy mgr create dc-3558-422 mc-3015-422 m3-3101-422

# We need N MDS' per host, where N is number of filesystems
# We use Salt to make them in parallel, and label them -A, -B, etc
root@salt-204:~# salt -N dfs-c cwd=/etc/ceph 'ceph-deploy mds create $(hostname -s):$(hostname -s)-A' 

# This also needs to be run on all nodes, to deposit all keyrings in /etc/ceph:
root@salt-204:~# salt -N dfs-c cwd=/etc/ceph 'ceph-deploy gatherkeys $(hostname -s)'

# CRUSH configuration
# Define rooms (on any Ceph host)
ceph osd crush add-bucket dc-3558 room
ceph osd crush add-bucket mc-3015 room
ceph osd crush add-bucket m3-3101 room

# Assign rooms to the root "default"
ceph osd crush move dc-3558 root=default
ceph osd crush move mc-3015 root=default
ceph osd crush move m3-3101 root=default

# Move machines into their rooms
ceph osd crush move m3-3101-422 room=m3-3101
ceph osd crush move dc-3558-422 room=dc-3558
ceph osd crush move mc-3015-422 room=mc-3015

# Confirm
ceph osd crush tree

# Define CRUSH rules for HDD and Optanes, using 'room' as the failure domain
ceph osd crush rule create-replicated slow-room default room hdd
ceph osd crush rule create-replicated fast-room default room ssd

If instead adding a new node to an existing cluster:

# Where dc-3558-422 is an existing node, and m3-3101-422 is the new one
root@m3-3101-422:~# cd /etc/ceph
root@m3-3101-422:/etc/ceph# ceph-deploy gatherkeys dc-3558-422
root@m3-3101-422:/etc/ceph# ceph-deploy mon create m3-3101-422
root@m3-3101-422:/etc/ceph# ceph-deploy mgr create m3-3101-422
root@m3-3101-422:/etc/ceph# ceph-deploy mds create m3-3101-422-A

Now that control software is in place, we add OSDs on each host to actually store data:

root@m3-3101-422:~# cd /etc/ceph

# Uses ceph-deploy to create all 18 HDD OSDs
# Will create 18 32GB LVs on vg_optane for OSD DB + WAL
root@m3-3101-422:/etc/ceph# ~/bin/ $(hostname -s)

# Create NVMe OSDs for metadata - no extra devices
root@salt-204:~# salt -N dfs-c cwd=/etc/ceph 'lvcreate -L 32G --name lv_osd_1 vg_optane; done'

Confirm status:

root@m3-3101-422:/etc/ceph# ceph status
root@m3-3101-422:/etc/ceph# ceph health detail

To create a new cephfs:

# metadata on NVMe OSDs
root@m3-3101-422:/etc/ceph# ceph osd pool create cephfs_metadata 16 replicated fast-room
# data on HDDs
root@m3-3101-422:/etc/ceph# ceph osd pool create cephfs_data 64 replicated slow-room

# This is default in the /etc/ceph/ceph.conf pushed by Salt, but for reference:
# Set both to size = 3 (keep 3 copies) and min_size = 2 (stop all IO if less than 2 copies remains)
root@m3-3101-422:~# ceph osd pool set cephfs_metadata size 3
root@m3-3101-422:~# ceph osd pool set cephfs_metadata min_size 2
root@m3-3101-422:~# ceph osd pool set cephfs_data size 3
root@m3-3101-422:~# ceph osd pool set cephfs_data min_size 2

# Filesystem:
root@m3-3101-422:~# ceph fs new cephfs_fsname cephfs_fsname_metadata cephfs_fsname_data

Confirm status again:

root@m3-3101-422:/etc/ceph# ceph status
root@m3-3101-422:/etc/ceph# ceph health detail


Clients should be running Ceph 14 Nautilus for best results. Luminous (12) mostly works but is not recommended. 13 (Mimic) would probably work ok but has not been tested.

In /etc/fstab:,,        /mnt/ceph       ceph    defaults,noauto,name=admin,secretfile=/etc/ceph/ceph.admin.secret,mds_namespace=cephfs_bs_db,x-systemd.automount 0 0

  • name = the Cephx username for auth
  • secretfile = Cephx secret
  • mds_namespace = the filesystem name

Manual mounting:

root@ubuntu1804-2000:~# mount -t ceph -o name=jimmylin,secretfile=/etc/ceph/ceph.client.jimmylin.secret,mds_namespace=cephfs_jimmylin        /mnt/ceph     

Reference documentation

Ceph Bugs / Our contributions

Bugs submitted by nfish:

-- NathanFish - 2019-04-16

Edit | Attach | Watch | Print version | History: r34 < r33 < r32 < r31 < r30 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r34 - 2023-05-27 - LoriPaniak
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback