Project Charter

Dfs-C

See also: https://cs.uwaterloo.ca/twiki/view/CFPrivate/DFSC

Current Layout

  • 3 servers (-422), in dc,mc,m3
  • Hardware: Each server has:
    • Ubuntu 18.04.3 Server installed on mdraid1 SSDs
    • 18 12TB 7200rpm hard drives (18 bays empty)
    • 1 750GB NVMe Optane SSD
    • 1 960GB NVMe Optane SSD
    • Both Optanes in LVM vg_optane

  • Ceph daemons: Each server has:
    • 1 ceph-mon - all active, quorum
      • In deployments our size, all hosts should have a mon

    • 1 ceph-mgr - 1 active, 2 standby
      • mgr's are always 1 active
      • No significant resource usage
      • All hosts with a mon should have an mgr, for availability

    • N ceph-mds *1 active, 1 standby-replay, and 1 standby, per filesystem
      • Number of active MDS is configurable
        • multi-mds doesn't seem needed at this point
        • Load is already spread over 1 MDS per FS

    • 18 ceph-osd bluestore daemons (eg ceph-osd@28) each using:
      • 1 HDD
      • 32GiB DB/WAL LV on Optane (eg lv_db_1)

    • 2 ceph-osd daemons each using a 128GiB LV (lv_osd_1) on the Optane
      • Multiple daemons can be needed to saturate NVMe
      • Alternatively throw more cores at it:
        • osd_op_num_shards_ssd (default 8)
        • osd_op_num_threads_per_shard_ssd (default 2)
        • Defaults are set for SATA SSDs, not NVMe/Optane

  • Pools:
    • CRUSH rule 'fast-room' that groups by room and uses only the 'ssd' device class OSDs
      • The Optanes are detected as 'ssd' rather than 'nvme'; one type will need to be manually changed if we add SATA SSDs.
    • ditto 'slow-room' for 'hdd'
    • CRUSH rule "default" (0) exists but should not be used
    • cephfs metadata pools on 'fast-room', size = 3, min_size = 2
    • cephfs data pools on 'slow-room', size = 3, min_size = 2

  • CephFS:
    • Allow multiple filesystems: ceph fs flag set enable_multiple true
    • On each cephfs: ceph fs set cephfs_cscf-home allow_standby_replay true * standby-replays are currently selected arbitrarily, and can end up on the same box. TODO.

Troubleshooting

Map drive to OSD

root@mc-3015-422:~# pvdisplay /dev/sdr
  --- Physical volume ---
  PV Name               /dev/sdr
  VG Name               ceph-54aff479-11d1-46b1-b12f-e8c95369a18b
  PV Size               10.91 TiB / not usable 1.00 GiB
  Allocatable           yes (but full)
  PE Size               1.00 GiB
  Total PE              11175
  Free PE               0
  Allocated PE          11175
  PV UUID               VuTr6q-RX70-TDdK-UBXE-OqcO-CVgm-SoJDo1

# Look up LVs in VG named above 
root@mc-3015-422:~# lvdisplay ceph-54aff479-11d1-46b1-b12f-e8c95369a18b
  --- Logical volume ---
  LV Path                /dev/ceph-54aff479-11d1-46b1-b12f-e8c95369a18b/osd-block-ee401215-39bc-477c-a87e-ddbdcc36dd96
  LV Name                osd-block-ee401215-39bc-477c-a87e-ddbdcc36dd96
  VG Name                ceph-54aff479-11d1-46b1-b12f-e8c95369a18b
  LV UUID                02uI5i-UiFM-VnAD-p3H3-PcWc-rmTC-ju6Ivq
  LV Write Access        read/write
  LV Creation host, time mc-3015-422, 2019-06-18 11:48:15 -0400
  LV Status              available
  # open                 1
  LV Size                10.91 TiB
  Current LE             11175
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:54

# Grep for symlinks that match the LV UUID:
root@mc-3015-422:~# file /var/lib/ceph/osd/ceph-*/block | grep 02uI5i-UiFM-VnAD-p3H3-PcWc-rmTC-ju6Ivq
/var/lib/ceph/osd/ceph-35/block: symbolic link to /dev/mapper/02uI5i-UiFM-VnAD-p3H3-PcWc-rmTC-ju6Ivq

# sdr is used by osd.35

Map OSD to drive

 
root@dc-3558-422:~# lvdisplay | grep -B 5 34FxBi-nOnW-wbok-sYoC-TiSy-aPmo-BHgbo0
   
  --- Logical volume ---
  LV Path                /dev/ceph-98aa9789-bb28-4ae3-b14d-28918e5d7983/osd-block-42cc657c-50fb-4568-8ac3-b1ab462bf567
  LV Name                osd-block-42cc657c-50fb-4568-8ac3-b1ab462bf567
  VG Name                ceph-98aa9789-bb28-4ae3-b14d-28918e5d7983
  LV UUID                34FxBi-nOnW-wbok-sYoC-TiSy-aPmo-BHgbo0
root@dc-3558-422:~# lvdisplay | grep -B 5 nvUXDJ-hohD-vAW1-LxFt-CcmV-7Ext-mT6U1u
   
  --- Logical volume ---
  LV Path                /dev/vg_optane/lv_db_23
  LV Name                lv_db_23
  VG Name                vg_optane
  LV UUID                nvUXDJ-hohD-vAW1-LxFt-CcmV-7Ext-mT6U1u

Physical device can be found from the VG associated with the "osd-block" LV above eg. osd-block-42cc657c-50fb-4568-8ac3-b1ab462bf567

root@dc-3558-422:~# pvdisplay |grep -B 3 98aa9789-bb28-4ae3-b14d-28918e5d7983
   
  --- Physical volume ---
  PV Name               /dev/sdw
  VG Name               ceph-98aa9789-bb28-4ae3-b14d-28918e5d7983

OSD failure

The cluster will generally heal itself on the failure of an OSD. Hence, the failure of a single OSD is not an emergency situation and can often be left unattended until the next regular reboot of the affected host.

If a restart of the OSD is desired, there are two options for intervention on the affected host:

  • Restart OSD daemon by numerical ID eg.
    systemctl restart ceph-osd@12.service 
    A check of daemon status or ceph status will give results.

  • Not recommended A more invasive restart which will kick off a high-bandwidth/load reconfiguration which will severely degrade cluster performance and require OSD compactions to reduce the length of time to resolve is:
     ceph-volume lvm activate --all 

  • In extreme cases when an OSD cannot be restarted, it may be appropriate to remove the OSD and rebuild it as in the instructions for replacing a failed OSD below.

Failed drive/OSD replacement

When a hard drive fails, the Ceph system should automatically redistribute data to regain required redundancy. Immediate attention is generally not required. Due to the loads associated with such a rebuild, it may be prudent to proactively replace failing drives during times to low user activity.

If a drive fails, the associated OSD should be reported down. For this example, osd.61

On the host with the failed drive:

# Migrate data off the failing drive
root@dc-3558-422:~# ceph osd out 61
# Wait until move complete
root@dc-3558-422:~# ceph -w
root@dc-3558-422:~# systemctl stop ceph-osd@61.service
root@dc-3558-422:~# ls -lh /var/lib/ceph/osd/ceph-61
total 28K
lrwxrwxrwx 1 ceph ceph  50 Oct  8 07:15 block -> /dev/mapper/34FxBi-nOnW-wbok-sYoC-TiSy-aPmo-BHgbo0
lrwxrwxrwx 1 ceph ceph  50 Oct  8 07:15 block.db -> /dev/mapper/nvUXDJ-hohD-vAW1-LxFt-CcmV-7Ext-mT6U1u
...

root@dc-3558-422:~# ceph osd purge 61
purged osd.61

Identify and replace sdw using usual tools eg.

root@dc-3558-422:~# ledctl locate=/dev/sdw
root@dc-3558-422:~# ledctl locate_off=/dev/sdw

Once new drive is ready, using VG identified above:

root@dc-3558-422:/etc/ceph# ceph-volume lvm zap  vg_optane/lv_db_23
root@dc-3558-422:/etc/ceph# ceph-volume lvm zap ceph-98aa9789-bb28-4ae3-b14d-28918e5d7983/osd-block-42cc657c-50fb-4568-8ac3-b1ab462bf567

# ceph-deploy must be run with keys in $PWD
root@m3-3101-422:/var/lib/ceph/osd# cd /etc/ceph/

root@dc-3558-422:/etc/ceph# ceph-deploy osd create --bluestore --dmcrypt --data  eph-98aa9789-bb28-4ae3-b14d-28918e5d7983/osd-block-42cc657c-50fb-4568-8ac3-b1ab462bf567 --block-db vg_optane/lv_db_23 $(hostname -s)
root@m3-3101-422:~#  ceph osd unset noout

PG_DAMAGED, Possible data damage, pg inconsistent

Normal operation of the cluster and its scrubbing functionality will occasionally reveal inconsistencies in placement groups. Identifying these inconsistencies is a strength of the Ceph Bluestore storage backend and should not pose any issues for user performance or data integrity if addressed in a reasonable timeframe. Such errors are often due to hard drive failures.

When the system presents such a critical warning eg.

# ceph status
  cluster:
    id:     97ad84d6-3c01-49af-9ca0-a9fe1ff79597
    health: HEALTH_ERR
            BlueFS spillover detected on 25 OSD(s)
            1 scrub errors
            Possible data damage: 1 pg inconsistent
            1 daemons have recently crashed

Check for the details and identify the affected pg (7.e below):

# ceph health detail
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 7.e is active+clean+inconsistent+snaptrim_wait, acting [2,23,43]
RECENT_CRASH 1 daemons have recently crashed
osd.2 crashed on host dc-3558-422.cloud.cs.uwaterloo.ca at 2020-12-15 21:08:24.422408Z

One can find out the nature of the inconsistency for the various shards of the pg (a read error below):

# rados list-inconsistent-obj 7.e --format=json-pretty
{
"epoch": 58062,
"inconsistents": [
{
"object": {
"name": "1000aa293ca.00000000",
"nspace": "",
"locator": "",
"snap": "head",
"version": 6029909
},
"errors": [],
"union_shard_errors": [
"read_error"
],
"selected_object_info": {
"oid": {
"oid": "1000aa293ca.00000000",
"key": "",
"snapid": -2,
"hash": 2873700878,
"max": 0,
"pool": 7,
"namespace": ""
},
"version": "17883'6071326",
"prior_version": "17880'6029909",
"last_reqid": "osd.2.0:526673",
"user_version": 6029909,
"size": 131100,
"mtime": "2019-12-12 15:36:52.344640",
"local_mtime": "2019-12-12 15:36:52.351355",
"lost": 0,
"flags": [
"dirty",
"data_digest",
"omap_digest"
],
"truncate_seq": 0,
"truncate_size": 0,
"data_digest": "0xc18b6723",
"omap_digest": "0xffffffff",
"expected_object_size": 0,
"expected_write_size": 0,
"alloc_hint_flags": 0,
"manifest": {
"type": 0
},
"watchers": {}
},
"shards": [
{
"osd": 2,
"primary": true,
"errors": [
"read_error"
],
"size": 131100
},
{
"osd": 23,
"primary": false,
"errors": [],
"size": 131100,
"omap_digest": "0xffffffff",
"data_digest": "0xc18b6723"
},
{
"osd": 43,
"primary": false,
"errors": [],
"size": 131100,
"omap_digest": "0xffffffff",
"data_digest": "0xc18b6723"}]}]}

For a "read error", which is likely due to a drive failing to provide data during the scrub procedure, an additional deep scrub of the pg should clear the error withing a few hours:

ceph pg deep-scrub 7.e

For more details, refer to the Ceph documentation: https://docs.ceph.com/en/latest/rados/operations/pg-repair/

The Ceph pg repair function can be manually invoked:

# ceph pg repair 7.e
instructing pg 7.e on osd.2 to repair

Maintenance

Monthly Maintenance

  • Tuesdays, dc, mc, m3 each month
  • schedule 30 minutes of flexible downtime for host at icinga.cscf
  • On salt-master eg. salt-204.cscf.uwaterloo.ca:
  • salt mc-3015-422.cloud.cs.uwaterloo.ca state.apply --state-verbose=False test=True
  • salt mc-3015-422.cloud.cs.uwaterloo.ca state.apply --state-verbose=False
  • ssh to the server: ssh mc-3015-422
  • apt update && apt full-upgrade
  • ceph osd set noout
  • reboot
  • Confirm HEALTH_OK. If not all OSDs are up, see OSD failure section above
  • Once HEALTH_OK (all OSDs are up), clear noout: ceph osd unset noout

End of Term Maintenance

  • Removal of old account data. Remove "trash" directories created by (re)moving files/folders
  • Update and restart of all client systems
  • Rebalance the cluster:
root@dc-3558-422:~# ceph balancer status
{
    "last_optimize_duration": "", 
    "plans": [], 
    "mode": "upmap", 
    "active": false, 
    "optimize_result": "", 
    "last_optimize_started": ""
}

root@dc-3558-422:~# ceph balancer on 

ceph status will report "recovery" activity. This may take some time to complete.

Once complete, disable the balancer:

root@m3-3101-422:~# ceph balancer status
{
    "last_optimize_duration": "0:00:00.027298", 
    "plans": [], 
    "mode": "upmap", 
    "active": true, 
    "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect", 
    "last_optimize_started": "Fri Aug 21 10:08:00 2020"
}

root@dc-3558-422:~# ceph balancer off 

Manually Adding JSON Support for Graphing

JSON isn't supported by default on older versions of Ceph. We use the JSON output to generate filesystem graphs. It can be manually patched in to older version of Ceph by manually replacing the python script responsible for the ceph fs status command. These changes will be overwritten if Ceph is ever updated.

On each server, backup /usr/share/ceph/mgr/status/module.py to a safe place.

Overwrite /usr/share/ceph/mgr/status/module.py with the following: https://github.com/chenerqi/ceph/blob/master/src/pybind/mgr/status/module.py

Restart the mgr service

systemctl restart ceph-mgr@dc-3558-422.service

After all servers are complete, verify that ceph fs status cephfs_cscf-home -f json provides a JSON formatted output.

Setup Steps

Raw notes: https://rt.uwaterloo.ca/Ticket/Display.html?id=969986

Note the hostnames below; some commands must be run on the salt-master, and some on one or all of the DFS machines. Install ceph packages + dependencies, and the cluster ssh key:

root@salt-204:~# salt -N dfs-c state.apply dfs-c --state-verbose=False test=True
root@salt-204:~# salt -N dfs-c state.apply dfs-c --state-verbose=False

Only if creating a new cluster:

root@m3-3101-422:~# cd /etc/ceph
root@m3-3101-422:/etc/ceph# ceph-deploy new dc-3558-422 mc-3015-422 m3-3101-422
root@m3-3101-422:/etc/ceph# ceph-deploy mon create-initial

# Copy admin keys to all hosts 
root@m3-3101-422:/etc/ceph# ceph-deploy admin dc-3558-422 mc-3015-422 m3-3101-422

# Only one mgr is active at a time, but for failover, all failure domains should have one
root@m3-3101-422:/etc/ceph# ceph-deploy mgr create dc-3558-422 mc-3015-422 m3-3101-422

# We need N MDS' per host, where N is number of filesystems
# We use Salt to make them in parallel, and label them -A, -B, etc
root@salt-204:~# salt -N dfs-c cmd.run cwd=/etc/ceph 'ceph-deploy mds create $(hostname -s):$(hostname -s)-A' 

# This also needs to be run on all nodes, to deposit all keyrings in /etc/ceph:
root@salt-204:~# salt -N dfs-c cmd.run cwd=/etc/ceph 'ceph-deploy gatherkeys $(hostname -s)'

# CRUSH configuration
# Define rooms (on any Ceph host)
ceph osd crush add-bucket dc-3558 room
ceph osd crush add-bucket mc-3015 room
ceph osd crush add-bucket m3-3101 room

# Assign rooms to the root "default"
ceph osd crush move dc-3558 root=default
ceph osd crush move mc-3015 root=default
ceph osd crush move m3-3101 root=default

# Move machines into their rooms
ceph osd crush move m3-3101-422 room=m3-3101
ceph osd crush move dc-3558-422 room=dc-3558
ceph osd crush move mc-3015-422 room=mc-3015

# Confirm
ceph osd crush tree

# Define CRUSH rules for HDD and Optanes, using 'room' as the failure domain
ceph osd crush rule create-replicated slow-room default room hdd
ceph osd crush rule create-replicated fast-room default room ssd

If instead adding a new node to an existing cluster:

# Where dc-3558-422 is an existing node, and m3-3101-422 is the new one
root@m3-3101-422:~# cd /etc/ceph
root@m3-3101-422:/etc/ceph# ceph-deploy gatherkeys dc-3558-422
root@m3-3101-422:/etc/ceph# ceph-deploy mon create m3-3101-422
root@m3-3101-422:/etc/ceph# ceph-deploy mgr create m3-3101-422
root@m3-3101-422:/etc/ceph# ceph-deploy mds create m3-3101-422-A

Now that control software is in place, we add OSDs on each host to actually store data:

root@m3-3101-422:~# cd /etc/ceph

# Uses ceph-deploy to create all 18 HDD OSDs
# Will create 18 32GB LVs on vg_optane for OSD DB + WAL
root@m3-3101-422:/etc/ceph# ~/bin/osd.py $(hostname -s)

# Create NVMe OSDs for metadata - no extra devices
root@salt-204:~# salt -N dfs-c cmd.run cwd=/etc/ceph 'lvcreate -L 32G --name lv_osd_1 vg_optane; done'

Confirm status:

root@m3-3101-422:/etc/ceph# ceph status
root@m3-3101-422:/etc/ceph# ceph health detail

To create a new cephfs:

# metadata on NVMe OSDs
root@m3-3101-422:/etc/ceph# ceph osd pool create cephfs_metadata 16 replicated fast-room
# data on HDDs
root@m3-3101-422:/etc/ceph# ceph osd pool create cephfs_data 64 replicated slow-room

# This is default in the /etc/ceph/ceph.conf pushed by Salt, but for reference:
# Set both to size = 3 (keep 3 copies) and min_size = 2 (stop all IO if less than 2 copies remains)
root@m3-3101-422:~# ceph osd pool set cephfs_metadata size 3
root@m3-3101-422:~# ceph osd pool set cephfs_metadata min_size 2
root@m3-3101-422:~# ceph osd pool set cephfs_data size 3
root@m3-3101-422:~# ceph osd pool set cephfs_data min_size 2

# Filesystem:
root@m3-3101-422:~# ceph fs new cephfs_fsname cephfs_fsname_metadata cephfs_fsname_data

Confirm status again:

root@m3-3101-422:/etc/ceph# ceph status
root@m3-3101-422:/etc/ceph# ceph health detail

Mounting

Clients should be running Ceph 14 Nautilus for best results. Luminous (12) mostly works but is not recommended. 13 (Mimic) would probably work ok but has not been tested.

In /etc/fstab:

10.1.152.122,10.1.154.122,10.1.155.122:/        /mnt/ceph       ceph    defaults,noauto,name=admin,secretfile=/etc/ceph/ceph.admin.secret,mds_namespace=cephfs_bs_db,x-systemd.automount 0 0

  • name = the Cephx username for auth
  • secretfile = Cephx secret
  • mds_namespace = the filesystem name

Manual mounting:

root@ubuntu1804-2000:~# mount -t ceph -o name=jimmylin,secretfile=/etc/ceph/ceph.client.jimmylin.secret,mds_namespace=cephfs_jimmylin  10.1.152.122:/        /mnt/ceph     

Reference documentation

Ceph Bugs / Our contributions

Bugs submitted by nfish:

-- NathanFish - 2019-04-16

Edit | Attach | Watch | Print version | History: r31 < r30 < r29 < r28 < r27 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r31 - 2021-04-15 - DevonMerner
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback