Ceph deployment (building up the cluster)

Here is a simple tutorial to help you with the initial installation and configuration of a Ceph cluster to create a distributed file storage for your systems.
More information about the Ceph distributed filesystem is available on their web site.

Quick overview of Ceph

Ceph support 3 kinds of file access :
  1. The Ceph filesystem : filesystem that you can mount remotely using mount.ceph, like any other remote filesystem (NFS, CIFS, ...)
  2. The block storage access : the Ceph cluster, through the so called RADOS Block Device layer provide remote access to the Ceph storage by presenting a block device to the host on top of which you can put your preferred filesystem (like Ext4, XFS, ...) or use it as a physical device of a LVM volume group configuration for instance.
  3. The object storage access : this implement a REST API compatible with Amazon S3 and OpenStack Swift to store and access files like independant objects.
Ceph consist of various components that have to be configured to have distributed filesystem fully functional.
  1. Ceph monitor : maintains maps of the cluster state. Ceph maintains a history (called an “epoch”) of each state change in the Ceph Monitors, Ceph OSD Daemons, and Placement Groups
  2. Ceph metadata server (MDS) : store metadata for the Ceph filesystem. Object storage and block storage doesn't need this component
  3. Ceph OSD : the daemon that effectively store the data, handle the replication of it, provide information to the monitors, ... A Ceph Storage Cluster requires at least two Ceph OSD Daemons to achieve an active + clean state when the cluster makes two copies of your data (Ceph makes 2 copies by default, but you can adjust it).

From the official Ceph architecture presentation (click on the image to go to their web site) :


Implementing Ceph

At the time of writing, the last version of Ceph was code-named Hammer (version 0.94.3). This tutorial has been tested on Ubuntu 14.04.3 LTS Trusty

Preparation & pre-requesites

Install 4 machines running Ubuntu Trusty.
The nodes of the Ceph cluster will be named ceph1, ceph2 and ceph3
The fourth node will be called admin.
Each nodes does not have more than one network interface. Setting up a ceph cluster where the nodes have more than one interface is possible, but needs some more tweaking & parametrization.
On each machine, you will need to have a non-root account allowed to use sudo to perform root actions on any commands.
I have a user called benoit on each machine that I will use in this tutorial. This user is the one the Ubuntu installer ask me to create when installing the machine for the first time. So by default, it receives all rights to perform sudo actions.
If you want to create a new user manually after installation, here are the steps:

$ sudo useradd [-s <path to a shell>] -m [-d <path to home dir>] -G sudo <username>
$ sudo passwd <username>

By default on Ubuntu, a group called sudo is created and any user member of this group has the right to use sudo on any command they want.
Once created, log into the newly create user that will become your Ceph administration user.

Be sure that from the admin node you can log on into the 3 other nodes using SSH without being prompted to enter a password and using a non-root account (benoit in my case).
Following my tutorial about SSH passwordless connections on the same site.

Part 1: ceph-deploy

We will use this nice tool to deploy our Ceph cluster from one single location, our administration node. Be sure that from this node you can do an SSH connection to each nodes (ceph1, ceph2 and ceph3) with your user and without being prompted to enter a password or ceph-deploy actions can be unsuccessful.
On admin, we are going to add an APT repository to get the last version of the Ceph software.

benoit@admin:~$ wget -q -O- 'https://git.ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc' | sudo apt-key add -
benoit@admin:~$ sudo apt-add-repository 'deb http://download.ceph.com/debian-hammer $(lsb_release -sc) main'
benoit@admin:~$ sudo apt-get update
benoit@admin:~$ sudo apt-get dist-upgrade
benoit@admin:~$ sudo apt-get install ceph-deploy

This will install ceph-deploy version 1.5.28 on Trusty.
Remark: there is currently no package for the Ubuntu version Utopic Unicorn (14.10) and Vivid Vervet (15.04) on the Ceph download server.
In your home directory, on the admin node, create a directory where ceph-deploy will store a few files like administration authentication keys. I call this directory ceph-deploy and it is located under /home/benoit in my case.
Move into this directory to perform the rest of the operation of this tutorial.

benoit@admin:~$ mkdir ceph-deploy
benoit@admin:~$ cd ceph-deploy

Remark : do not call ceph-deploy with sudo or run it as root if you are logged in as a different user, because it will not issue sudo commands needed on the remote host.
Remark: to perform the SSH connection, ceph-deploy will issue by default the SSH commands without username, so the currently logged in user name will be used. You can override this by :

  1. by using the “--username <user>” parameter of the ceph-deploy command
  2. by defining a ~/.ssh/config file where the username to use is specified :
Host ceph1
Hostname ceph1
User benoit

            See the man page of ssh_config for more information about the parameters you can use.

Part 2: setup of the cluster (MON & OSD processes)

We are now ready to setup the Ceph cluster that will run on 3 nodes.
We are going to install on MON process on each node. The MON process, the monitor, as its name indicate, is used to monitor the status of the cluster and all its components. Best practises recommend to have more than one instance of it on the cluster for resilience.
The OSD processes are the ones who write and read data. Each OSD process manage its own place to store data. It can be a file written on an existing filesystem or raw disk partition. OSD processes manage a so-called journal associated with the data storage itself. This journal can be located on a totally different location. For instance, one can use “cheap slow big disks” (eg. SATA disks) to store the data but use “expensive fast small disks” (eg. SSD disks) to store this journal. This allows the whole performances not being too much impacted by the journaling operations.

1. Cluster initialization, with 3 nodes:

benoit@admin:~/ceph-deploy$ ceph-deploy new ceph1 ceph3

You can follow what is done on screen.
Once this operation is complete, you should find the following files in the current directory: ceph.log, ceph.conf and ceph.mon.keyring.
These 2 nodes will be the initial monitor nodes of the cluster.

2. Software installation on the 3 nodes:

benoit@admin:~/ceph-deploy$ ceph-deploy install ceph1 ceph2 ceph3

3. Configuration of the initial monitors – the initial monitor(s) are the nodes that were specified with the ceph-deploy new command

benoit@admin:~/ceph-deploy$ ceph-deploy mon create-initial

Once this process is complete, you should find the following files in the current directory: ceph.client.admin.keyring, ceph.bootstrap-osd.keyring and ceph.bootstrap-mds.keyring.
Remark: on version of ceph-deploy before 1.1.3, there is no “mon create-initial” command, but two different steps: “mon create <node name>” and “gatherkeys <node name>”.

4. Create initial OSD (OSD are the daemon that manage the physical storage, at least we need to have 2 for redundancy). In our setup, we will create one OSD on each node that will use a physical drive as location. This is a second disk (/dev/sdb) but this can be another partition or even a file.

benoit@admin:~/ceph-deploy$ ceph-deploy osd prepare ceph1:/dev/sdb1 ceph2:/dev/sdb1 ceph3:/dev/sdb1
benoit@admin:~/ceph-deploy$ ceph-deploy osd activate ceph1:/dev/sdb1 ceph2:/dev/sdb1 ceph3:/dev/sdb1

Remark: because by default a journal of 5 GB will be created on the partition, the disk and the partition must be greater than 5 GB. The default form of the command above put the journal on the partition used for the data, but you can specify a different location for the journal using the following syntax:

benoit@admin:~/ceph-deploy$ ceph-deploy osd prepare ceph1:/dev/sdb1:/dev/sdc1

Where /dev/sdc1 is another partition on another disk where the journal will be located.
Instead of device files pointing to disk partitions, you can use :

  • directory on the node to store the data
  • file on the node to store the journal
Remark : when I rebooted my nodes, the OSD processes didn't start. I've investigated and it turns out that this was because the filesystem was not mounted by default. So I had to modify the /etc/fstab too and added the following line :
/dev/sdb1 /var/lib/ceph/osd/ceph-2 xfs defaults 0 1

To mount my /dev/sdb1 filesystem at the mount point where the Ceph's init script expect the OSD information to be found.

5. Deploy the keyring to each node, so when you use Ceph commands on these hosts you don’t need to manually specify keyring and so on:

benoit@admin:~/ceph-deploy$ ceph-deploy admin ceph1 ceph2 ceph3

The necessary files are copied under /etc/ceph on each nodes.
Once these files are copied, you can use the ceph command to check the status of the cluster: ceph health or ceph status

benoit@ceph1:~$ ceph health
benoit@ceph1:~$ ceph status
    cluster dc05d0bd-3173-4a20-acad-04beddd749af
     health HEALTH_OK
     monmap e5: 2 mons at {ceph1=,ceph3=}
            election epoch 28, quorum 0,1 ceph1,ceph3
     osdmap e64: 6 osds: 6 up, 6 in
      pgmap v469: 160 pgs, 1 pools, 0 bytes data, 0 objects
            36835 MB used, 112 GB / 155 GB avail
                 160 active+clean

Remark: for this to work with non-root user, the keyring must be readable by your user. So I had to do:
benoit@ceph1:~$ sudo chown benoit /etc/ceph/ceph.client.admin.keyring
But this contain a key, so you should be careful when allowing access to this file.

With all the above steps completed, you now have a running Ceph cluster with storage associated to it. This was the first step.
The next steps is to add service(s) to access this storage :

  • Adding a Ceph block device
  • Adding a Ceph filesystem
  • Adding a Ceph Object gateway