ceph rbd中的对象解析

在使用ceph rbd块设备时,创建一个rbd image可以指定格式:–image-format=1或者–image-format=2,官方文档给出了如下解释

  • format 1 - Use the original format for a new rbd image. This format is understood by all versions of librbd and the kernel rbd module, but does not support newer features like cloning.
  • format 2 - Use the second rbd format, which is supported by librbd (but not the kernel rbd module) at this time. This adds support for cloning and is more easily extensible to allow more features in the future.

默认的image格式是format 1,这种格式支持所有版本的librbd和内核rbd模块,但是不支持新的功能如克隆;而format 2由librbd支持,但是不支持内核rbd模块,这种格式支持克隆,并且便于扩展后续其他功能。

1. format 1格式的对象解析

首先创建一个format 1格式的image

1
2
3
root@ceph1:~# rbd create test_vol --size=10
root@ceph1:~# rbd -p rbd ls
test_vol

然后使用rados命令看到对应产生一个test_vol.rbd的对象

1
2
3
root@ceph1:~# rados -p rbd ls
test_vol.rbd
rbd_directory

其中rbd_directory是保存该存储池里的image列表
从下面的命令可以看到对于format 1格式的image,没有将image name和id的映射保存到rbd_directory里

1
2
root@ceph1:~#rados -p rbd listomapvals rbd_directory
root@ceph1:~#

使用rbd info可以看到对应image的一些元数据,包括size,order就是块大小(1<<order),block_name_prefix是数据对象名称的前缀,format表示是哪种image格式。

1
2
3
4
5
6
root@ceph1:/var/local/osd/current# rbd info rbd/test_vol
rbd image 'test_vol':
size 10240 kB in 3 objects
order 22 (4096 kB objects)
block_name_prefix: rb.0.12ae.74b0dc51
format: 1

另外,可以通过如下方式看到test_vol.rbd里包含的元数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
root@ceph1:~# rados -p rbd stat test_vol.rbd
rbd/test_vol.rbd mtime 1445237108, size 141

root@ceph1:~# rados -p rbd get test_vol.rbd 1.txt
root@ceph1:~# xxd 1.txt
0000000: 3c3c 3c20 5261 646f 7320 426c 6f63 6b20 <<< Rados Block
0000010: 4465 7669 6365 2049 6d61 6765 203e 3e3e Device Image >>>
0000020: 0a00 0000 0000 0000 7262 2e30 2e31 3261 ........rb.0.12a
0000030: 652e 3734 6230 6463 3531 0000 0000 0000 e.74b0dc51......
0000040: 5242 4400 3030 312e 3030 3500 1600 0000 RBD.001.005.....
0000050: 0000 a000 0000 0000 0400 0000 0000 0000 ................
0000060: 0100 0000 0000 0000 0d00 0000 0000 0000 ................
0000070: 0400 0000 0000 0000 0000 a000 0000 0000 ................
0000080: 7465 7374 766f 6c5f 736e 6170 00 testvol_snap.

往这个image里写数据,然后在使用rados -p rbd ls查看发现多了一个该image的数据对象

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
root@ceph1:~# rbd map test_vol
root@ceph1:~# rbd showmapped
id pool image snap device
2 volumes test_vol1 - /dev/rbd2
3 rbd myrbd - /dev/rbd3
4 rbd test_vol - /dev/rbd4
root@ceph1:~# dd if=/dev/urandom of=/dev/rbd/rbd/test_vol bs=4M count=1 oflag=direct
1+0 records in
1+0 records out
4194304 bytes (4.2 MB) copied, 1.03652 s, 4.0 MB/s
root@ceph1:~#
root@ceph1:~# rados -p rbd ls
rbd_directory
test_vol.rbd
rb.0.12ae.74b0dc51.000000000000

并且可以看到具体的数据对象

1
2
root@ceph1:/var/local/osd/current# ls -la *head |grep 74b0dc51
-rw-r--r-- 1 root root 4194304 Oct 19 14:26 rb.0.12ae.74b0dc51.000000000000__head_331FC0BA__2

对象的命名规则为:
block_name_prefix.fragment.head(snap_num)_hashpoolid

  • block_name_prefix:对象名前缀
  • fragment:按照块大小来划分的偏移,比如4MB的块大小,那么000000000000就表示第一个4MB的块
  • head(snap_num):snapshot版本,如果是head,表示是image的,如果是数字,表示是snapshot的序号
  • hash:由block_nama_prefix计算得到的
  • poolid:image所属的pool id

2. format 2格式的对象解析

创建format 2格式的image时加上–image-format=2即可,这里就不罗列具体的创建过程了,下面直接来看一下rbd format 2下有哪些对象。
rbd主要的几个osd的对象

1
2
3
4
5
6
7
8
9
10
root@ceph1:~# rados -p rbd ls
rbd_id.myrbdsnapclone1
rbd_header.11ee2ae8944a
rbd_children
rbd_directory
rbd_data.11ee2ae8944a.0000000000000000
rbd_id.myrbd
rbd_data.11ee2ae8944a.0000000000000002
rbd_header.123f238e1f29
rbd_data.11ee2ae8944a.0000000000000001

2.1 rbd_id.{image name}

rbd_id对象的格式为:rbd\uid.{image name}head_hashpoolid
比如

1
2
root@ceph1:var/local/osd/current# ls *head -la |grep id
-rw-r--r-- 1 root root 16 Oct 19 10:25 rbd\uid.myrbd__head_422A8C36__2

对象里存的就是image的id,可以通过以下方式查看

1
2
3
root@ceph1:/var/local/osd/current# rados -p rbd get rbd_id.myrbd 1.txt
root@ceph1:/var/local/osd/current# xxd 1.txt
0000000: 0c00 0000 3131 6565 3261 6538 3934 3461 ....11ee2ae8944a

或者直接找对对应的对象进行查看

1
2
root@ceph1:/var/local/osd/current# xxd 2.36_head/rbd\\uid.myrbd__head_422A8C36__2 
0000000: 0c00 0000 3131 6565 3261 6538 3934 3461 ....11ee2ae8944a

2.2 rbd_header.{image id}

rbd_header对象的格式为:rbd\uheader.{image id}head_hashpoolid
比如

1
2
root@ceph1:/var/local/osd/current#ls *head -la |grep header
-rw-r--r-- 1 root root 0 Oct 19 10:25 rbd\uheader.11ee2ae8944a__head_3FB535C5__2

记录rbd image的元数据,其内容包括size,order,object_prefix, snapseq, parent(克隆的image才有), snapshot{snap id}(各个快照的信息)。
下面分别对不同类型的image的rbd_header里的数据进行说明

1)没有快照的image

  • object_prefix:对象的名字前缀
  • order:用来计算block size的,比如22,那么块大小就是1<<22=4MB
  • size:对象大小
  • snap_seq:快照编号,没有快照的时候是0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
root@ceph1:/var/local/osd/current# rados -p rbd listomapvals rbd_header.12812ae8944a
features
value: (8 bytes) :
0000 : 01 00 00 00 00 00 00 00 : ........

object_prefix
value: (25 bytes) :
0000 : 15 00 00 00 72 62 64 5f 64 61 74 61 2e 31 32 38 : ....rbd_data.128
0010 : 31 32 61 65 38 39 34 34 61 : 12ae8944a

order
value: (1 bytes) :
0000 : 16 : .

size
value: (8 bytes) :
0000 : 00 00 a0 00 00 00 00 00 : ........

snap_seq
value: (8 bytes) :
0000 : 00 00 00 00 00 00 00 00 : ........

2)做过快照的image

  • object_prefix:对象的名字前缀
  • order:用来计算block size的,比如22,那么块大小就是1<<22=4MB
  • size:对象大小
  • snap_seq:快照编号
  • snapshot_id:记录对应快照的信息
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
root@ceph1:/var/local/osd/current# rados -p rbd listomapvals rbd_header.11ee2ae8944a
features
value: (8 bytes) :
0000 : 01 00 00 00 00 00 00 00 : ........

object_prefix
value: (25 bytes) :
0000 : 15 00 00 00 72 62 64 5f 64 61 74 61 2e 31 31 65 : ....rbd_data.11e
0010 : 65 32 61 65 38 39 34 34 61 : e2ae8944a

order
value: (1 bytes) :
0000 : 16 : .

size
value: (8 bytes) :
0000 : 00 00 a0 00 00 00 00 00 : ........

snap_seq
value: (8 bytes) :
0000 : 03 00 00 00 00 00 00 00 : ........

snapshot_0000000000000002
value: (78 bytes) :
0000 : 03 01 48 00 00 00 02 00 00 00 00 00 00 00 09 00 : ..H.............
0010 : 00 00 6d 79 72 62 64 73 6e 61 70 00 00 a0 00 00 : ..myrbdsnap.....
0020 : 00 00 00 01 00 00 00 00 00 00 00 01 01 1c 00 00 : ................
0030 : 00 ff ff ff ff ff ff ff ff 00 00 00 00 fe ff ff : ................
0040 : ff ff ff ff ff 00 00 00 00 00 00 00 00 02 : ..............

snapshot_0000000000000003
value: (79 bytes) :
0000 : 03 01 49 00 00 00 03 00 00 00 00 00 00 00 0a 00 : ..I.............
0010 : 00 00 6d 79 72 62 64 73 6e 61 70 32 00 00 a0 00 : ..myrbdsnap2....
0020 : 00 00 00 00 01 00 00 00 00 00 00 00 01 01 1c 00 : ................
0030 : 00 00 ff ff ff ff ff ff ff ff 00 00 00 00 fe ff : ................
0040 : ff ff ff ff ff ff 00 00 00 00 00 00 00 00 00 : ...............

3)从快照克隆的image

  • object_prefix:对象的名字前缀
  • order:用来计算block size的,比如22,那么块大小就是1<<22=4MB
  • size:对象大小
  • snap_seq:快照编号,没有快照的时候是0
  • parent:记录该image的父image的id
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
root@ceph1:/var/local/osd/current# rados -p rbd listomapvals rbd_header.123f238e1f29
features
value: (8 bytes) :
0000 : 01 00 00 00 00 00 00 00 : ........

object_prefix
value: (25 bytes) :
0000 : 15 00 00 00 72 62 64 5f 64 61 74 61 2e 31 32 33 : ....rbd_data.123
0010 : 66 32 33 38 65 31 66 32 39 : f238e1f29

order
value: (1 bytes) :
0000 : 16 : .

parent
value: (46 bytes) :
0000 : 01 01 28 00 00 00 02 00 00 00 00 00 00 00 0c 00 : ..(.............
0010 : 00 00 31 31 65 65 32 61 65 38 39 34 34 61 02 00 : ..11ee2ae8944a..
0020 : 00 00 00 00 00 00 00 00 a0 00 00 00 00 00 : ..............

size
value: (8 bytes) :
0000 : 00 00 a0 00 00 00 00 00 : ........

snap_seq
value: (8 bytes) :
0000 : 00 00 00 00 00 00 00 00 : ........

2.3 rbd_data.{image id}.{offset}

rbd_data的对象命名格式为:rbd\udata.{image id}.fragementhead(snap)_hashpoolid
rbd image的数据对象,存放具体的数据内容

1
2
3
4
5
6
root@ceph2:/var/local/osd/current# ls *head -la|grep data
-rw-r--r-- 1 root root 1048576 Oct 19 10:47 rbd\udata.11ee2ae8944a.0000000000000000__2_88F8F929__2
-rw-r--r-- 1 root root 4194304 Oct 19 11:37 rbd\udata.11ee2ae8944a.0000000000000000__3_88F8F929__2
-rw-r--r-- 1 root root 4194304 Oct 19 11:37 rbd\udata.11ee2ae8944a.0000000000000000__head_88F8F929__2
-rw-r--r-- 1 root root 4194304 Oct 19 11:37 rbd\udata.11ee2ae8944a.0000000000000001__3_0F1C4EFE__2
-rw-r--r-- 1 root root 4194304 Oct 19 11:37 rbd\udata.11ee2ae8944a.0000000000000001__head_0F1C4EFE__2

2.4 rbd_directory

rbd_directory对象的命名格式为:rbd\udirectoryhead_hashpoolid

1
2
root@ceph1:/var/local/osd/current# ls *head -la|grep directory
-rw-r--r-- 1 root root 8 Oct 19 10:22 rbd\udirectory__head_30A98C1C__2

这个对象里面包含对应存储池里所有的image的name和id的双向映射

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
root@ceph1:/var/local/osd/current# rados -p rbd listomapvals rbd_directory
id_11ee2ae8944a
value: (9 bytes) :
0000 : 05 00 00 00 6d 79 72 62 64 : ....myrbd

id_123f238e1f29
value: (19 bytes) :
0000 : 0f 00 00 00 6d 79 72 62 64 73 6e 61 70 63 6c 6f : ....myrbdsnapclo
0010 : 6e 65 31 : ne1

id_12812ae8944a
value: (10 bytes) :
0000 : 06 00 00 00 6d 79 72 62 64 32 : ....myrbd2

name_myrbd
value: (16 bytes) :
0000 : 0c 00 00 00 31 31 65 65 32 61 65 38 39 34 34 61 : ....11ee2ae8944a

name_myrbd2
value: (16 bytes) :
0000 : 0c 00 00 00 31 32 38 31 32 61 65 38 39 34 34 61 : ....12812ae8944a

name_myrbdsnapclone1
value: (16 bytes) :
0000 : 0c 00 00 00 31 32 33 66 32 33 38 65 31 66 32 39 : ....123f238e1f29

2.5 rbd_children

rbd_children对象命名格式为:rbd\uchildrenhead_hashpoolid

1
2
root@ceph3:/var/local/osd/current# ls *head -la|grep children 
-rw-r--r-- 1 root root 0 Oct 19 11:19 rbd\uchildren__head_0FA1CACA__2

记录父子关系,key是parent image的id,value是克隆的image的id,用来快速查找确认父子关系的。

1
2
3
4
5
6
7
8
root@ceph1:/var/local/osd/current# rados -p rbd listomapvals rbd_children 
key: (32 bytes):
0000 : 02 00 00 00 00 00 00 00 0c 00 00 00 31 31 65 65 : ............11ee
0010 : 32 61 65 38 39 34 34 61 02 00 00 00 00 00 00 00 : 2ae8944a........

value: (20 bytes) :
0000 : 01 00 00 00 0c 00 00 00 31 32 33 66 32 33 38 65 : ........123f238e
0010 : 31 66 32 39 : 1f29