The beauty in Ceph’s modularity, replication, and self-healing mechanisms

Prerequisites

  • A running Ceph cluster (minimum luminous/RHCS3.0)

Introduction

  • OSD — Object Storage Daemon, a process responsible for writing our data to the disk. usually 1:1 ratio between the OSD process and a disk (each process writes and reads from one disk only).
  • OSD Backend — Holds the information on how workloads perform against the disk, whether it’s and LVM, an entire disk, or a partition that the OSD process needs to interact with. In earlier versions, Filestore was used (was based on a Journal device used as a “persistent write cache” and a partition uses a whole disk for storing the data), and in the current versions Filestore is deprecated and Bluestore is being used (based on LVM, having no filesystem overhead by using block device for storing data and a dedicated LVM for storing metadata).
  • Objectstore API — a unified API, that provides the ability of running both Bluestore and Filestore OSDs in the same cluster sharing data. They are both using the same API, so data can be moved between them with no problem although they have a whole different backend implementations.
  • CRUSH map — responsible holding the location of every data component in the cluster, this map tells Ceph how it should treat our data, where it should store it, and what does it needs to do when a failure occurs.
  • CRUSH rule — Tells Ceph which protection strategy to use (EC/Replica), where to store the data (which devices and servers), and how.
  • CRUSH bucket — A container that holds virtually a set of devices (hosts, racks, disks, datacenters, etc).
$ ceph osd tree ID  CLASS WEIGHT  TYPE NAME        STATUS REWEIGHT PRI-AFF 
-15 0.28738 root destination
-7 0.09579 host osd3
2 hdd 0.04790 osd.2 up 1.00000 1.00000
8 hdd 0.04790 osd.8 up 1.00000 1.00000
-11 0.09579 host osd4
5 hdd 0.04790 osd.5 up 1.00000 1.00000
9 hdd 0.04790 osd.9 up 1.00000 1.00000
-9 0.09579 host osd5
4 hdd 0.04790 osd.4 up 1.00000 1.00000
10 hdd 0.04790 osd.10 up 1.00000 1.00000
-1 0.28738 root default
-3 0.09579 host osd0
1 hdd 0.04790 osd.1 up 1.00000 1.00000
6 hdd 0.04790 osd.6 up 1.00000 1.00000
-13 0.09579 host osd1
0 hdd 0.04790 osd.0 up 1.00000 1.00000
7 hdd 0.04790 osd.7 up 1.00000 1.00000
-5 0.09579 host osd2
3 hdd 0.04790 osd.3 up 1.00000 1.00000
11 hdd 0.04790 osd.11 up 1.00000 1.00000
$ ceph osd pool create bench 128 128  pool 'bench' created
$ ceph osd df -f json-pretty | jq '.nodes[0:6][].pgs'0
0
0
0
0
0
$ ceph osd df -f json-pretty | jq '.nodes[6:12][].pgs'83
77
68
92
76
84
$ rados bench -p bench -o 4096 -t 16 200 write --no-cleanup
$ ceph -s  cluster:
id: 6c701fa4-15b3-4276-b252-c22591ea5410
health: HEALTH_OK

services:
mon: 1 daemons, quorum mon0 (age 53m)
mgr: mon0(active, since 43m)
osd: 12 osds: 12 up (since 49m), 12 in (since 49m)
rgw: 1 daemon active (mon0.rgw0)

task status:

data:
pools: 5 pools, 160 pgs
objects: 68.64k objects, 267 MiB
usage: 25 GiB used, 563 GiB / 588 GiB avail
pgs: 160 active+clean
$ ceph osd crush rule create-replicated replicated_destination destination host hdd
$ ceph osd crush dump -f json-pretty | jq '.rules'[
{
"rule_id": 0,
"rule_name": "replicated_rule",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default~hdd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 1,
"rule_name": "replicated_destination",
"ruleset": 1,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -16,
"item_name": "destination~hdd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]
$ ceph osd pool set bench crush_rule replicated_destination
set pool 5 crush_rule to replicated_destination
$ ceph -s   data:
pools: 5 pools, 160 pgs
objects: 68.64k objects, 267 MiB
usage: 25 GiB used, 563 GiB / 588 GiB avail
pgs: 0.625% pgs not active
133332/205929 objects degraded (64.747%)
70281/205929 objects misplaced (34.129%)
124 active+recovery_wait+undersized+degraded+remapped
32 active+clean
2 active+recovering+undersized+remapped
1 remapped+peering
1 active+recovering+undersized+degraded+remapped
$ ceph tell osd.* injectargs '--osd-max-backfills 10'osd.0: osd_max_backfills = '10' 
osd.1: osd_max_backfills = '10'
osd.2: osd_max_backfills = '10'
osd.3: osd_max_backfills = '10'
osd.4: osd_max_backfills = '10'
osd.5: osd_max_backfills = '10'
osd.6: osd_max_backfills = '10'
osd.7: osd_max_backfills = '10'
osd.8: osd_max_backfills = '10'
osd.9: osd_max_backfills = '10'
osd.10: osd_max_backfills = '10'
osd.11: osd_max_backfills = '10'
$ ceph -s  cluster:
id: 6c701fa4-15b3-4276-b252-c22591ea5410
health: HEALTH_OK

services:
mon: 1 daemons, quorum mon0 (age 2h)
mgr: mon0(active, since 2h)
osd: 12 osds: 12 up (since 2h), 12 in (since 2h)
rgw: 1 daemon active (mon0.rgw0)

task status:

data:
pools: 5 pools, 160 pgs
objects: 68.64k objects, 267 MiB
usage: 26 GiB used, 562 GiB / 588 GiB avail
pgs: 160 active+clean
$ ceph osd df -f json-pretty | jq '.nodes[0:6][].pgs'81
79
76
84
88
72
$ ceph osd df -f json-pretty | jq '.nodes[6:12][].pgs'0
0
0
0
0
0
$ ceph mgr module enable balancer
$ ceph balancer on
$ ceph osd set-require-min-compat-client luminous set require_min_compat_client to luminous
$ ceph balancer mode upmap
$ ceph balancer off && ceph balancer on
$ ceph balancer status{
"last_optimize_duration": "0:00:00.003862",
"plans": [],
"mode": "upmap",
"active": true,
"optimize_result": "Optimization plan created successfully",
"last_optimize_started": "Wed May 6 15:30:07 2020"
}
$ ceph osd df -f json-pretty | jq '.nodes[0:6][].pgs'80
80
80
80
80
80
ceph osd crush move osd0 root=destination
moved item id -3 name 'osd0' to location {root=destination} in crush map
ceph osd crush move osd1 root=destination
moved item id -13 name 'osd1' to location {root=destination} in crush map
ceph osd crush move osd2 root=destination
moved item id -5 name 'osd2' to location {root=destination} in crush map
$ ceph osd tree ID  CLASS WEIGHT  TYPE NAME        STATUS REWEIGHT PRI-AFF 
-15 0.57477 root destination
-3 0.09579 host osd0
0 hdd 0.04790 osd.0 up 1.00000 1.00000
6 hdd 0.04790 osd.6 up 1.00000 1.00000
-13 0.09579 host osd1
3 hdd 0.04790 osd.3 up 1.00000 1.00000
11 hdd 0.04790 osd.11 up 1.00000 1.00000
-5 0.09579 host osd2
4 hdd 0.04790 osd.4 up 1.00000 1.00000
9 hdd 0.04790 osd.9 up 1.00000 1.00000
-11 0.09579 host osd3
2 hdd 0.04790 osd.2 up 1.00000 1.00000
8 hdd 0.04790 osd.8 up 1.00000 1.00000
-7 0.09579 host osd4
5 hdd 0.04790 osd.5 up 1.00000 1.00000
10 hdd 0.04790 osd.10 up 1.00000 1.00000
-9 0.09579 host osd5
1 hdd 0.04790 osd.1 up 1.00000 1.00000
7 hdd 0.04790 osd.7 up 1.00000 1.00000
-1 0 root default
$ ceph osd df -f json-pretty | jq '.nodes[0:12][].pgs'31
48
34
43
48
37
37
42
40
42
45
33
$ ceph osd df -f json-pretty | jq '.nodes[0:12][].pgs'39
40
39
40
41
40
39
40
40
41
40
39

Conclusion

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Shon Paz

Shon Paz

Sr. Solution Architect, Red Hat