The beauty in Ceph’s modularity, replication, and self-healing mechanisms

Shon Paz
9 min readMay 6, 2020

--

Ceph is a distributed storage system, most of the people treat Ceph as it is a very complex system, full of components needed to be managed. Hard work and effort has been given to turn Ceph to what it is today, a portable, resilient, performant, self-healing storage system. With Ceph, you can easily have millions and billions of objects moving in the cluster to save the wanted state. The fact that Ceph is a Software-Defined Storage system helps us in being more flexible with the hardware we choose, the operating system we pick, and even the location the servers are in. For example, we can have a Ceph cluster running both RHEL and CentOS operating systems in a hybrid way, running on different racks or even geo-locations (not recommended). Today I want to talk with you guys on the beauty of Ceph’s modularity, we’ll see how we can move ~70,000 objects between servers running different OSD backends (TBD) live without having any “maintenance” window needed to be taken, we just move the data between servers while the customers can still continue working with our Ceph cluster. We’ll see how we can throttle the migration process, improve our performance by using some new features, and eventually have our entire data located on a whole new set of servers located on a different location (can be a rack, a datacenter, or even a region).

Prerequisites

  • A running Ceph cluster (minimum luminous/RHCS3.0)

Introduction

First of all, let’s have a short recap on Ceph’s components:

  • OSD — Object Storage Daemon, a process responsible for writing our data to the disk. usually 1:1 ratio between the OSD process and a disk (each process writes and reads from one disk only).
  • OSD Backend — Holds the information on how workloads perform against the disk, whether it’s and LVM, an entire disk, or a partition that the OSD process needs to interact with. In earlier versions, Filestore was used (was based on a Journal device used as a “persistent write cache” and a partition uses a whole disk for storing the data), and in the current versions Filestore is deprecated and Bluestore is being used (based on LVM, having no filesystem overhead by using block device for storing data and a dedicated LVM for storing metadata).
  • Objectstore API — a unified API, that provides the ability of running both Bluestore and Filestore OSDs in the same cluster sharing data. They are both using the same API, so data can be moved between them with no problem although they have a whole different backend implementations.
  • CRUSH map — responsible holding the location of every data component in the cluster, this map tells Ceph how it should treat our data, where it should store it, and what does it needs to do when a failure occurs.
  • CRUSH rule — Tells Ceph which protection strategy to use (EC/Replica), where to store the data (which devices and servers), and how.
  • CRUSH bucket — A container that holds virtually a set of devices (hosts, racks, disks, datacenters, etc).

Let’s start by looking at our OSD tree, showing us that we have 2 root buckets, which means that we have two separate virtual containers in this cluster. I have added the destination root bucket and added 3 hosts where each host has 2 OSDs. These hosts contain Bluestore OSDs, while the default hosts contain Filestore OSDs, they'll use the Objectstore API to move data between them.

$ ceph osd tree ID  CLASS WEIGHT  TYPE NAME        STATUS REWEIGHT PRI-AFF 
-15 0.28738 root destination
-7 0.09579 host osd3
2 hdd 0.04790 osd.2 up 1.00000 1.00000
8 hdd 0.04790 osd.8 up 1.00000 1.00000
-11 0.09579 host osd4
5 hdd 0.04790 osd.5 up 1.00000 1.00000
9 hdd 0.04790 osd.9 up 1.00000 1.00000
-9 0.09579 host osd5
4 hdd 0.04790 osd.4 up 1.00000 1.00000
10 hdd 0.04790 osd.10 up 1.00000 1.00000
-1 0.28738 root default
-3 0.09579 host osd0
1 hdd 0.04790 osd.1 up 1.00000 1.00000
6 hdd 0.04790 osd.6 up 1.00000 1.00000
-13 0.09579 host osd1
0 hdd 0.04790 osd.0 up 1.00000 1.00000
7 hdd 0.04790 osd.7 up 1.00000 1.00000
-5 0.09579 host osd2
3 hdd 0.04790 osd.3 up 1.00000 1.00000
11 hdd 0.04790 osd.11 up 1.00000 1.00000

Let’s create a pool so that our OSDs will have some PGs distributed among them:

$ ceph osd pool create bench 128 128  pool 'bench' created

Now let’s look a the first 6 OSDs (the destination hosts OSDs), we see that we have 0 PGs, which means they won't get any data when we will write to the cluster. This happens because the CRUSH rule tells Ceph to store data only on the default OSDs, we'll see it later:

$ ceph osd df -f json-pretty | jq '.nodes[0:6][].pgs'0
0
0
0
0
0

Now let’s look at the second set of 6 OSDs (the default hosts OSDs), we see that we have pgs which means those OSDs will get data when we'll first write to the cluster:

$ ceph osd df -f json-pretty | jq '.nodes[6:12][].pgs'83
77
68
92
76
84

Let’s write some data to the cluster to fill it up a bit, we’ll use the rados bench tool write 4KiB objects for 200 seconds without cleaning up the data after the writing process has finished:

$ rados bench -p bench -o 4096 -t 16 200 write --no-cleanup

Now if we look at the cluster, we see that we have ~70,000 objects written to the cluster (don’t mind the capacity, I had some data before starting this demo):

$ ceph -s  cluster:
id: 6c701fa4-15b3-4276-b252-c22591ea5410
health: HEALTH_OK

services:
mon: 1 daemons, quorum mon0 (age 53m)
mgr: mon0(active, since 43m)
osd: 12 osds: 12 up (since 49m), 12 in (since 49m)
rgw: 1 daemon active (mon0.rgw0)

task status:

data:
pools: 5 pools, 160 pgs
objects: 68.64k objects, 267 MiB
usage: 25 GiB used, 563 GiB / 588 GiB avail
pgs: 160 active+clean

Let’s create a new CRUSH rule, that says that data should reside on the root bucket called destination, the replica factor is the default (which is 3), the failure domain is host, and the device type is HDD (all of the device types are HDD in this demo, the only difference is the root bucket). Eventually, Ceph will use the destination root bucket resources to satisfy the end state.

$ ceph osd crush rule create-replicated replicated_destination destination host hdd

Let’s validate the new CRUSH rule has been created, and compare between the two of them. We see that under item_name we have a different root bucket to use. Of course, creating this CRUSH rule won't do anything yet because the pool still used the old CRUSH rule:

$ ceph osd crush dump -f json-pretty | jq '.rules'[
{
"rule_id": 0,
"rule_name": "replicated_rule",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default~hdd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 1,
"rule_name": "replicated_destination",
"ruleset": 1,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -16,
"item_name": "destination~hdd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]

Let’s change bench’s pool CRUSH rule, by changing this value we tell Ceph to move all the data from the old servers to the new servers. to be precise, between the Filestore OSDs to the Bluestore ones:

$ ceph osd pool set bench crush_rule replicated_destination
set pool 5 crush_rule to replicated_destination

Let’s look a the cluster’s status to see if things have started to move:

$ ceph -s   data:
pools: 5 pools, 160 pgs
objects: 68.64k objects, 267 MiB
usage: 25 GiB used, 563 GiB / 588 GiB avail
pgs: 0.625% pgs not active
133332/205929 objects degraded (64.747%)
70281/205929 objects misplaced (34.129%)
124 active+recovery_wait+undersized+degraded+remapped
32 active+clean
2 active+recovering+undersized+remapped
1 remapped+peering
1 active+recovering+undersized+degraded+remapped

We see that there are object in degraded and misplaces state, which means that Ceph has come to understanding that the data should be moved to the new servers (don't mind the inactive PGs, I caught it in peering state). Now let's spin up the speed a little bit by telling Ceph to have more PGs migrating from the old servers to the new ones (this value throttles the migration of the PGs, so setting it to 1 will have the migration moving slowly so that customers will be much less impacted by the migration process):

$ ceph tell osd.* injectargs '--osd-max-backfills 10'osd.0: osd_max_backfills = '10' 
osd.1: osd_max_backfills = '10'
osd.2: osd_max_backfills = '10'
osd.3: osd_max_backfills = '10'
osd.4: osd_max_backfills = '10'
osd.5: osd_max_backfills = '10'
osd.6: osd_max_backfills = '10'
osd.7: osd_max_backfills = '10'
osd.8: osd_max_backfills = '10'
osd.9: osd_max_backfills = '10'
osd.10: osd_max_backfills = '10'
osd.11: osd_max_backfills = '10'

Let’s verify that the data migration process has finished and the new servers are now holding the data:

$ ceph -s  cluster:
id: 6c701fa4-15b3-4276-b252-c22591ea5410
health: HEALTH_OK

services:
mon: 1 daemons, quorum mon0 (age 2h)
mgr: mon0(active, since 2h)
osd: 12 osds: 12 up (since 2h), 12 in (since 2h)
rgw: 1 daemon active (mon0.rgw0)

task status:

data:
pools: 5 pools, 160 pgs
objects: 68.64k objects, 267 MiB
usage: 26 GiB used, 562 GiB / 588 GiB avail
pgs: 160 active+clean

It seems like the cluster is in a healthy state, let’s verify that the new OSDs have the data:

$ ceph osd df -f json-pretty | jq '.nodes[0:6][].pgs'81
79
76
84
88
72

Let’s check it for the old servers too:

$ ceph osd df -f json-pretty | jq '.nodes[6:12][].pgs'0
0
0
0
0
0

Now that we have our data fully migrated, Let’s use the balancer feature to create an even distribution of the PGs among the OSDS. By default, the PGs are distributed with CRUSH between the OSDs so that each PG is backed up by a set of OSDS (depends on the protection strategy and the replica factor). Let's enable the balancer and create a plan:

$ ceph mgr module enable balancer
$ ceph balancer on
$ ceph osd set-require-min-compat-client luminous set require_min_compat_client to luminous
$ ceph balancer mode upmap
$ ceph balancer off && ceph balancer on

After enabling the automatic plan (balancer will run whenever it will reveal that there are imbalanced PGs in the cluster), let’s check the balancer’s status:

$ ceph balancer status{
"last_optimize_duration": "0:00:00.003862",
"plans": [],
"mode": "upmap",
"active": true,
"optimize_result": "Optimization plan created successfully",
"last_optimize_started": "Wed May 6 15:30:07 2020"
}

After the distribution process has finished, we see that now we have an even number of PGs per OSD. This thing can improve your performance dramatically (eventually it will prevent imbalanced utilization on the disks and will create harmony):

$ ceph osd df -f json-pretty | jq '.nodes[0:6][].pgs'80
80
80
80
80
80

Now after we have a perfect new environment, let’s throw the old servers away. NO! the whole thing about software-defined is to re-use our hardware! let’s scale-out the new environment. To do so, I have ruined the old servers, created Bluestore OSDs on them, and added them to the cluster. Now let’s move them into the destination root bucket:

ceph osd crush move osd0 root=destination
moved item id -3 name 'osd0' to location {root=destination} in crush map
ceph osd crush move osd1 root=destination
moved item id -13 name 'osd1' to location {root=destination} in crush map
ceph osd crush move osd2 root=destination
moved item id -5 name 'osd2' to location {root=destination} in crush map

After moving the servers, let’s verify they were indeed transferred to the destination root bucket. By moving them into the new root bucket, Ceph will understand that it has new devices to use, so he will try to move the data to use those extra disks:

$ ceph osd tree ID  CLASS WEIGHT  TYPE NAME        STATUS REWEIGHT PRI-AFF 
-15 0.57477 root destination
-3 0.09579 host osd0
0 hdd 0.04790 osd.0 up 1.00000 1.00000
6 hdd 0.04790 osd.6 up 1.00000 1.00000
-13 0.09579 host osd1
3 hdd 0.04790 osd.3 up 1.00000 1.00000
11 hdd 0.04790 osd.11 up 1.00000 1.00000
-5 0.09579 host osd2
4 hdd 0.04790 osd.4 up 1.00000 1.00000
9 hdd 0.04790 osd.9 up 1.00000 1.00000
-11 0.09579 host osd3
2 hdd 0.04790 osd.2 up 1.00000 1.00000
8 hdd 0.04790 osd.8 up 1.00000 1.00000
-7 0.09579 host osd4
5 hdd 0.04790 osd.5 up 1.00000 1.00000
10 hdd 0.04790 osd.10 up 1.00000 1.00000
-9 0.09579 host osd5
1 hdd 0.04790 osd.1 up 1.00000 1.00000
7 hdd 0.04790 osd.7 up 1.00000 1.00000
-1 0 root default

Not let’s verify all 12 OSDs have PGs:

$ ceph osd df -f json-pretty | jq '.nodes[0:12][].pgs'31
48
34
43
48
37
37
42
40
42
45
33

Great! but we haven’t finished, let’s take a few minutes and let the balancer do its magic. After the balancer finishes, we see that we have our PGs distributed on more devices evenly. This thing can significantly help your performance because we have more spindles, more servers, and our workloads are evenly distributed among those disks!

$ ceph osd df -f json-pretty | jq '.nodes[0:12][].pgs'39
40
39
40
41
40
39
40
40
41
40
39

Conclusion

We saw how we can take advantage of Ceph’s portability, replication and self-healing mechanisms to create a harmonic cluster moving data between locations, servers, and OSD backends without the customers even have to know that we have played with their data. We also saw how we can tune performance by using the balancer feature and how we can throttle the migration process to prevent customers performance issues. I hope this article has convinced you that Ceph is more than just a complex distributed storage system. Hope you have enjoyed this article, see you next time :)

--

--

Shon Paz

Sr. Solution Architect, Red Hat