CephFS

The Ceph File System (CephFS) is a file system compatible with POSIX standards that is built on top of Ceph’s distributed object store, called RADOS (Reliable Autonomic Distributed Object Storage)

In CephFS service we had a connection to ceph cluster by native Kernel connection way to a pool or FUSE way.

CephFS give us a filesystem like space to mount it on OS (in a path in linux or virtual DiskDrive in Windows) to use it.

CephFS embodies some of concept like volume and subvolume , which refers to the storage space used in our mounting path or drive. so :

Volume: An abstraction for a named CephFS filesystem, including its metadata and data pools.

Subvolume: An independent directory tree within a volume, with its own quota, permissions, and optional isolated namespace, used e.g., for CSI volumes or shares.

Subvolume Group: A higher-level directory abstraction grouping subvolumes to apply shared policies (e.g., file layouts, snapshots across the group).

FS volumes, an abstraction for CephFS file systems
FS subvolume groups, an abstraction for a directory level higher than FS subvolumes. Used to effect policies (e.g., File layouts) across a set of subvolumes
FS subvolumes, an abstraction for independent CephFS directory trees

Now here we go !

First of all we nned to deploy CephFS service in our cluster it should be done by :

ceph fs volume create FILESYSTEM_NAME --placement="NUMBER_OF_DAEMONS HOST_NAME_1 HOST_NAME_2 HOST_NAME_3"

In our cluster command is :

ceph fs volume create vol1 --placement="2 ceph01 ceph02"

But Dont use this For production !
Because its creating itself ceph rados pools and pg and replica without attention to your cluster situation

for controling on pool creation on CephFS you need to first create needed pools and then create CephFS Volumes

create two pool, one for metadata and another for data with your desired name , create pool due to your cluster for pg calculation and failure domain

ceph osd pool create fs.meta 16 16 replicated rack-ssd --autoscale-mode=warn

ceph osd pool create fs.data 16 16 replicated rack-ssd --autoscale-mode=warn

ceph osd pool application enable fs.meta cephfs

ceph osd pool application enable fs.data cephfs

now we can create CephFS Volume:

ceph fs new vol1 fs.meta fs.data

apply mds server to make CephFS work functionally

ceph orch apply mds vol1 --placement="2 ceph01 ceph02"

#Show Volume information

ceph fs volume info vol1

{

"mon_addrs": [

"192.168.11.21:6789",

"192.168.11.23:6789",

"192.168.11.22:6789",

"192.168.11.34:6789",

"192.168.11.33:6789",

"192.168.11.32:6789",

"192.168.11.31:6789"

],

"pools": {

"data": [

{

"avail": 7964125696,

"name": "fs.data",

"used": 0

}

],

"metadata": [

{

"avail": 7964125696,

"name": "fs.meta",

"used": 98304

}

]

}

#Show MDS daemon status

ceph orch ps --daemon_type=mds

NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID

mds.vol1.ceph01.hpcjlt ceph01 running (65s) 61s ago 65s 15.1M - 19.2.3 aade1b12b8e6 bfc660905dca

mds.vol1.ceph02.ydfzue ceph02 running (63s) 61s ago 63s 12.9M - 19.2.3 aade1b12b8e6 8fc341196912

#Show FS volume status

ceph fs status vol1

vol1 - 1 clients

====

RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS

0 active vol1.ceph01.hpcjlt Reqs: 0 /s 10 13 12 1

POOL TYPE USED AVAIL

fs.meta metadata 96.0k 7595M

fs.data data 0 7595M

STANDBY MDS

vol1.ceph02.ydfzue

MDS version: ceph version 19.2.3 (c92aebb279828e9c3c1f5d24613efca272649e62) squid (stable)

you can now mount thos volume on a client , for this you should add a keyring and minimal ceph.conf copy or scp it from ceph mon node

I used client.admin keyring for mounting and access to cephFS and ceph cluster

Before performing any mount, remember to install the above packages on the client that wants to perform the mount to continue working :

apt install attr ceph-fuse ceph-common acl -y

you can mount this volume (I mean vol1) by ceph-fuse or kernel native way

sudo mount -t ceph <CLIENT_USER>@06204170-8641-11f0-ae89-000c296e34b9.vol1=/ <MOUNT_DIRECTORY>

OR

sudo ceph-fuse <MOUNT_DIRECTORY> -r / --client_mds_namespace=vol1

In volume mounting the " / " path is mounting on your desired path in client system (other linux server)

I making a directory for mounting an using it for after perpouse

mkdir -p /cephfs/main /cephfs/hot /cephfs/cold

tree /cephfs/

/cephfs/

├── cold

├── hot

└── main

4 directories, 0 files

sudo mount -t ceph admin@06204170-8641-11f0-ae89-000c296e34b9.vol1=/ /cephfs/main

df -h /cephfs/main

Filesystem Size Used Avail Use% Mounted on

admin@06204170-8641-11f0-ae89-000c296e34b9.vol1=/ 7.5G 0 7.5G 0% /cephfs/main

It show that vol1 on cluster 06204170-8641-11f0-ae89-000c296e34b9 (ceph fsid) is mounted /cephfs/main and shows the mail size of fs.data pool

ceph df | egrep 'data|meta'

fs.data 57 16 126 B 1 12 KiB 0 7.4 GiB

fs.meta 58 16 46 KiB 22 228 KiB 0 7.4 GiB

now as a owner of this directory (root) you can read and write in this directory

At this moment we want to add another pool, Like an EC pool to our volumes and change some paths to save data in EC pool instead of replecated pools, This is very useful for times when you need to Archiva data.
In general, EC pools, due to their nature, are more suitable for archival purposes than for data that has a high rate of data changes and partial writes.

Since we previously created two separate directories named hot and cold, it is clear that we want to mount the first pool, which is located on the SSD disks, to the hot directory and the second pool, which is located on the EC mode, to the cold directory.

So first of all we need to create an EC pool. Unlike replicated pools, these types of pools first need a profile for their crush settings.

#create EC profile

ceph osd erasure-code-profile set sataprofile \

k=2 \

m=2 \

crush-failure-domain=host \

crush-device-class=sata

As we can see, I placed this pool only on SATA class disks that I had previously added to the cluster, and I also changed the class of those disks to SATA.
As a point, I want to explain why I can't place the failure domain on the rack!
Because for this number of racks (as we already know, our number of racks is 2) the numbers K and M in this profile will look for 4 racks and not 2 racks! And if we assuming the failure domain on the rack, our profile will be inconsistent with the reality of our cluster and will cause an undersized state in the PGs of our pool. So I placed the failure domain on Host and this will support our cluster situation well, because the number k+m, which means 2+2=4, will now look for 4 hosts and this will be consistent with our cluster situation and our disk class.

now create crsh rule based on this EC profile

ceph osd crush rule create-erasure ecrule sataprofile

and create EC pools with our profile and crush rule

ceph osd pool create fs.data.ec 16 16 erasure sataprofile ecrule

!!!Please note that the calculation of the number of PGs in EC pools must still be done correctly, and the number 16 is given here only as an example!!! 🤨

As we know, due to the nature of EC pools and the issue of full object writes and appends, after creating this pool, enable the partial writes option with the following command:

ceph osd pool set fs.data.ec allow_ec_overwrites true

and enable CephFS application on pool:

ceph osd pool application enable fs.data.ec cephfs

now we can add second pool to our volume which is vol1:

ceph fs add_data_pool vol1 fs.data.ec

if we look at our volume info we can see another pool is added to it:

ceph fs volume info vol1 | jq .pools.data

[

{

"avail": 7963787264,

"name": "fs.data",

"used": 12288

},

{

"avail": 14014265344,

"name": "fs.data.ec",

"used": 0

}

]

Now that we have both pools in this volume, we can use two methods to set different paths for these two pools.
One of these methods (which is also a better method) is subvolume and the other is using setfattr, both of which we will explain.

The setfattr command in Linux is used to set extended attributes on files and directories.

Now we need to create two different subvolumes that we can place on two separate pools, fs.data and fs.data.ec.

There is a point that is due to the difference in the functionality of the web GUI and the cli. When creating a subvolume in the web GUI, it is not necessary to create a subvolumegroup, but in the cli mode, you must first create a subvolumegroup and add your subvolume to that subvolumegroup. In both creation command, you must specify the name of the pool on which data is to be written and read.

ceph fs subvolumegroup create vol1 subg1 fs.data

ceph fs subvolume create vol1 subvol1 subg1 fs.data

ceph fs subvolumegroup create vol1 subg2 fs.data

ceph fs subvolume create vol1 subvol2 subg2 fs.data

Using the above command, you can create a subvolumegroup and subvolume, but if you need to limit the space of this subvolume, that is, apply the quota method, you can limit it directly by using --size and entering a number in bytes, as follows:

ceph fs subvolume create vol1 subvol2 subg2 fs.data --size 1073741824

Now that we have our subvolumes and each of them connected to different pools, we can mount each subvolume individually to the different directories we created on our client.
But before that, as we mounted the main fs volume with the path / on the /cephfs/main directory, now we need to find the new paths that the subvolumes created for us and mount them on the hot and cold directories.

ceph fs subvolume getpath vol1 subvol1 subg1

/volumes/subg1/subvol1/fcd93dcb-5699-400b-af19-7fc98089269a

ceph fs subvolume getpath vol1 subvol2 subg2

/volumes/subg2/subvol2/1448737b-fd7e-44ed-a07e-ba6e2ae83f29

Now, given the new paths we have, we mount them on our client:

sudo mount -t ceph admin@06204170-8641-11f0-ae89-000c296e34b9.vol1=/volumes/subg1/subvol1/fcd93dcb-5699-400b-af19-7fc98089269a /cephfs/cold

sudo mount -t ceph admin@06204170-8641-11f0-ae89-000c296e34b9.vol1=/volumes/subg2/subvol2/1448737b-fd7e-44ed-a07e-ba6e2ae83f29 /cephfs/hot

df -h

Filesystem Size Used Avail Use% Mounted on

admin@06204170-8641-11f0-ae89-000c296e34b9.vol1=/ 92G 1.2G 91G 2% /cephfs/main

admin@06204170-8641-11f0-ae89-000c296e34b9.vol1=/volumes/subg1/subvol1/fcd93dcb-5699-400b-af19-7fc98089269a 92G 1.2G 91G 2% /cephfs/cold

admin@06204170-8641-11f0-ae89-000c296e34b9.vol1=/volumes/subg2/subvol2/1448737b-fd7e-44ed-a07e-ba6e2ae83f29 92G 1.2G 91G 2% /cephfs/hot

As is clear from the df -h output, something interesting has happened and it now shows us the total size of the cluster without even looking at the device classes that pool is in! 😱

ceph df | grep TOTAL

TOTAL 92 GiB 91 GiB 1.1 GiB 1.1 GiB 1.20

This is obviously wrong, but we can prevent this problem by using --size switch that we used when creating the subvolume, and showing the correct number to users of these file systems to avoid confusion. 😉

Now it's time to figure out how to split paths using setfattr commands without even creating a subvolume!
To do this, I first delete the subvolumes we created.

ceph fs subvolume rm vol1 subvol2 subg2

ceph fs subvolume rm vol1 subvol1 subg1

ceph fs subvolumegroup rm vol1 subg1

ceph fs subvolumegroup rm vol1 subg2

Now we need to mount our main volume to a main directory, in this case (setfattr ) we cannot mount any desired directory like in the case of subvolumes, we need to maintain the tree state of that FS, now for us it is first vol1, and then we can have two other directories under / which can be different paths that are in separate pools

sudo mount -t ceph admin@06204170-8641-11f0-ae89-000c296e34b9.vol1=/ /cephfs/

now we can set two other directory /cephfs/cold and /cephfs/hot on diffrent pool.
But since this structure is stored in our main OS disk directory, those hot and cold directories no longer exist and must be recreated really in our file system.

mkdir -p /cephfs/hot /cephfs/cold

setfattr -n ceph.dir.layout.pool -v fs.data /cephfs/hot/

setfattr -n ceph.dir.layout.pool -v fs.data.ec /cephfs/cold/

If the commands are executed successfully, you should not see any output or errors.

Now we can use the getfattr command to find out what the actual status of each directory is on the CephFS Volume side. This command works when we also use subvolumes to separate paths.

getfattr -n ceph.dir.layout /cephfs/hot

getfattr: Removing leading '/' from absolute path names

# file: cephfs/hot

ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=fs.data"

getfattr -n ceph.dir.layout /cephfs/cold/

getfattr: Removing leading '/' from absolute path names

# file: cephfs/cold/

ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=fs.data.ec"

As you can see in the highlighted section, the name of each pool is listed separately for each path.
But what about the main path itself?

getfattr -n ceph.dir.layout /cephfs/

getfattr: Removing leading '/' from absolute path names

# file: cephfs/

ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=fs.data"

As we know, when creating this volume, we first introduced the fs.data pool as the first data pool, and therefore the root path " / " will actually be stored in the fs.data pool.
By using setfattr commands, we can also perform other tasks such as applying quotas, all of which are executed on the client side.

Now its time to use NFS protocol, to mount these Volume or subvolumes on Clients.
Note that in this case the client no longer communicates directly with the cluster and does not need a keyring or ceph.conf file, and connects to the cluster just by NFS. Note that here you must select one or more nodes of your choice to pass NFS traffic. You can choose dedicated nodes (the better way) or shared with your OSDNode for this, which will not be very interesting.

To do this, you must first deploy the NFS service on the cluster to listen on NFS port on one or more nodes. I used node ceph01 and ceph02, and nfs-vol1 is just an arbitrary name !

ceph nfs cluster create nfs-vol1 ceph01,ceph02

OR

ceph orch apply nfs nfs-vol1 --placement="2 ceph01 ceph02"

ceph orch ps --daemon_type=nfs

NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID

nfs.nfs-vol1.0.0.ceph01.lolnah ceph01 *:2049 running (10s) 3s ago 10s 52.6M - 5.9 aade1b12b8e6 55b2acdb929a

nfs.nfs-vol1.1.0.ceph02.mdnexx ceph02 *:2049 running (5s) 3s ago 5s 19.1M - 5.9 aade1b12b8e6 6bce62e7726e

ceph nfs cluster info nfs-vol1

{

"nfs-vol1": {

"backend": [

{

"hostname": "ceph01",

"ip": "192.168.11.21",

"port": 2049

},

{

"hostname": "ceph02",

"ip": "192.168.11.22",

"port": 2049

}

],

"virtual_ip": null

}

Now we need to export our volume or subvolume to NFS. Because of using NFS-Ganesha we must define a pseudo path which cannot be " / " itself.

and one other things ! be before we go further, we need to know about NFS squash options :

no_root_squash → Client’s root user = server’s root user (full privileges). ❌ Very insecure.

root_id_squash → Client’s root UID gets mapped to an anonymous user (like nobody), but other UIDs stay unchanged.

root_squash → Only the client’s root is squashed to nobody, other users keep their real IDs. (default & safer)

all_squash → All client users (root and non-root) are squashed to nobody. Useful for shared/public exports.

❗❗❗Be notice thah in Default NFS export creation, sqush is no_root_squash ❗❗❗

ceph nfs export create cephfs nfs-vol1 /vol1 vol1

{

"bind": "/vol1",

"cluster": "nfs-vol1",

"fs": "vol1",

"mode": "RW",

"path": "/"

}

you may want to make it readonly, which could be done by --readonly option.

you may want to make your nfs eport to access to limited IP or subnet , which could be done by --client_addr

mount -t nfs 192.168.11.21:/vol1 /nfs/

or you can use ceph02 IP address

if you want to mount a subvolume by NFS you need to fetch subvolume path by ceph fs subvolume getpath and mount on you diretory on clinet side.

Users Access and Permissions

Regarding users and their permissions in CephFS, there are two issues, one is the user who is mounting the path and the other is the users who can read or write inside that mounted path. Using ceph users controlled by cephx, we can create users who can mount a volume (or subvolume), or mount it but only as read-only. The task of read-only mode is clear! But in cases where a user with full access such as client.admin mounts a volume using his keyring on the client, now the users must be given access for more access on that unix. It would be better to understand the situation if we looked at it together in a table :

CephX user = the entry gate → decides if the client can mount CephFS and whether it’s read-only or read-write

Linux user = POSIX/ACL rules → decides which local OS users can actually read/write inside the mounted directory

If CephX grants only r → even chmod 777 won’t allow writing

If CephX grants rw → POSIX permissions/ACLs decide the fine-grained access for local users

For more information, visit the link below.

https://docs.ceph.com/en/reef/cephfs/client-auth/#cephfs-client-capabilities