坏盘导致IO hang问题分析

近期线上系统出现坏盘导致IO卡住的问题,有两种情况:

1)一种是出现坏盘过程中raid卡的行为有所异常,在这台机器上的执行raid卡的命令megacli都卡住,观察到这台机器上的物理盘的io都时不时异常繁忙(io不大,但是svctm甚至达到几千ms),从而导致我们的块存储的卷IO hang住,表现就是用户的卷io util 100%较长时间;

2)另外一种情况是坏盘后,将坏盘从raid卡中剔除(做的单盘raid0,默认WB缓存策略,盘坏后需要从raid卡里删除逻辑卷),这台机器上的物理盘都io卡住一会,并且megacli命令也卡住。从而也导致部分用户的卷io util 100%较长时间;

情况1)通过收集日志反馈给厂家确认是由于线上机器的raid卡的固件版本比较低,需要升级raid卡固件;
情况2)因为坏盘后,raid卡就进入保护模式,物理盘的缓存策略都从writeback变成writethrough,然后剔除坏盘时,就会将raid卡缓存里的数据进行回刷到磁盘上,这个过程中是会有短时间的物理盘io卡住;

上面两种情况,物理盘的io卡住都在1分钟内,可是从用户反馈来看用户卷的io卡了好几分钟,比物理盘卡的时间长不少。另外,我们这个线上环境的块存储采用的2副本,有些卷的其中一个副本落在坏盘的机器上,卡了这么久为什么没有副本降级。

简单说一下我们的块存储的使用方式:存储机上采用device mapper机制将物理盘进行切片,然后不同的存储切片通过iscsi将逻辑设备输出到宿主机上(虚拟机所在的物理机),然后在宿主机上使用软raid将2个副本分片做成raid1,然后再将1个或多个raid1使用device mapper拼成1个逻辑设备,再提供到虚拟机进行使用。这样宿主机就是iscsi的initiator端,存储机就是iscsi的target端。其中iscsi的恢复超时时间replacement_timeout设置为10s。

有了上面的背景再来分析io hang的问题:
通常如果是宿主机和存储机的网络异常,iscsi连接就会异常进入iscsi恢复阶段,如果超过replacement_timeout的话那这个副本直接io报错给软raid,然后软raid将该副本设备标记为降级,降级状态下的raid读写都只会发到另外一个好的成员设备。

上面两种情况下导致卷的IO hang的时间都比较长,也就是说那段时间内一直没有io报错给软raid来触发raid降级。那么为什么会卡这么长时间而不IO报错呢?

模拟复现坏盘导致的io hang

由于线上系统不便于调试分析,要想在测试环境稳定复现坏盘也比较难,找过硬盘厂家寻求相关硬盘错误注入的工具,不过最后厂家给的也是注入io错误的办法,而不是让硬盘io卡住的方法。因此,为了稳定复现这种情况,采用device mapper设备,dm设备提供suspend操作来hang住io,以此为基础,借助iscsi和软raid来模拟块存储的使用方式。

1.在两台存储机器分别创建出两个device mapper设备

1
2
3
4
5
6
7
nbs@10-165-160-13:~$ sudo dmsetup create test_chunk_0 << EOF
> 0 20972800 linear /dev/sdf 1006696448
> EOF

nbs@10-165-160-14:~$ sudo dmsetup create test_chunk_1 << EOF
> 0 20972800 linear /dev/sdf 1006696448
> EOF

2.然后分别使用iet导出chunk设备

1
2
3
4
5
6
7
8
9
10
11
nbs@10-165-160-13:~$ sudo ietadm --op new --tid=1000 --params Name=iqn.test.1
nbs@10-165-160-13:~$ sudo ietadm --op new --tid=1000 --lun=0 --params Path=/dev/mapper/test_chunk_0,Type=blockio
nbs@10-165-160-13:~$ cat /proc/net/iet/volume
tid:1000 name:iqn.test.1
lun:0 state:0 iotype:blockio iomode:wt blocks:20972800 blocksize:512 path:/dev/mapper/test_chunk_0

nbs@10-165-160-14:~$ sudo ietadm --op new --tid=1000 --params Name=iqn.test.2
nbs@10-165-160-14:~$ sudo ietadm --op new --tid=1000 --lun=0 --params Path=/dev/mapper/test_chunk_1,Type=blockio
nbs@10-165-160-14:~$ cat /proc/net/iet/volume
tid:1000 name:iqn.test.2
lun:0 state:0 iotype:blockio iomode:wt blocks:20972800 blocksize:512 path:/dev/mapper/test_chunk_1

3.在宿主机端使用open-iscsi连接这两个target

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
nbs@10-165-160-15:~$ sudo iscsiadm -m discovery -t st -p 10.165.160.13:3260
10.165.160.13:3260,1 iqn.test.1
10.165.164.13:3260,1 iqn.test.1
10.165.168.13:3260,1 iqn.test.1
10.165.172.13:3260,1 iqn.test.1
nbs@10-165-160-15:~$ sudo iscsiadm -m node -T iqn.test.1 --op=update --name=node.session.timeo.replacement_timeout --value=10
nbs@10-165-160-15:~$ sudo iscsiadm -m node -T iqn.test.1 -p 10.165.160.13:3260 -l
Logging in to [iface: default, target: iqn.test.1, portal: 10.165.160.13,3260] (multiple)
Login to [iface: default, target: iqn.test.1, portal: 10.165.160.13,3260] successful.


nbs@10-165-160-15:~$ sudo iscsiadm -m discovery -t st -p 10.165.160.14:3260
10.165.160.14:3260,1 iqn.test.2
10.165.164.14:3260,1 iqn.test.2
10.165.168.14:3260,1 iqn.test.2
10.165.172.14:3260,1 iqn.test.2
nbs@10-165-160-15:~$ sudo iscsiadm -m node -T iqn.test.2 --op=update --name=node.session.timeo.replacement_timeout --value=10
nbs@10-165-160-15:~$ sudo iscsiadm -m node -T iqn.test.2 -p 10.165.160.14:3260 -l
Logging in to [iface: default, target: iqn.test.2, portal: 10.165.160.14,3260] (multiple)
Login to [iface: default, target: iqn.test.2, portal: 10.165.160.14,3260] successful.

nbs@10-165-160-15:~$ sudo iscsiadm -m session -P3|grep -A1 Lun
scsi1 Channel 00 Id 0 Lun: 0
Attached scsi disk sdq State: running
--
scsi2 Channel 00 Id 0 Lun: 0
Attached scsi disk sdr State: running

4.在宿主机端使用mdadm创建软raid设备

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
nbs@10-165-160-15:~$ sudo mdadm --create /dev/md/mdtest1 --run --force --metadata=0.9 --assume-clean --bitmap=/home/nbs/wd/mdtest1-bitmap --level=1 --raid-devices=2 /dev/sdq /dev/sdr
mdadm: array /dev/md/mdtest1 started.
nbs@10-165-160-15:~$ cat /proc/mdstat
Personalities : [raid1]
md127 : active (auto-read-only) raid1 sdr[1] sdq[0]
10486336 blocks [2/2] [UU]
bitmap: 321/321 pages [1284KB], 16KB chunk, file: /home/nbs/wd/mdtest1-bitmap

unused devices: <none>
nbs@10-165-160-15:~$ sudo mdadm --readwrite /dev/md127
nbs@10-165-160-15:~$ cat /proc/mdstat
Personalities : [raid1]
md127 : active raid1 sdr[1] sdq[0]
10486336 blocks [2/2] [UU]
bitmap: 0/321 pages [0KB], 16KB chunk, file: /home/nbs/wd/mdtest1-bitmap

unused devices: <none>

5.使用devicemapper再将软raid设备映射一次,做成跟云硬盘的使用方式一样

1
2
3
4
5
nbs@10-165-160-15:~$ sudo dmsetup create nbs_test1 << EOF
> 0 20971520 linear /dev/md127 0
> EOF
nbs@10-165-160-15:~$ sudo dmsetup table nbs_test1
0 20971520 linear 9:127 0

6.使用fio来测试这块逻辑卷,为了不产生过多的io影响分析,采用fio的rate_iops来限制io数

fio配置如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[global]
#write_iops_log=write_iops_log
rw=randwrite
ioengine=libaio
direct=1
iodepth=1
bsrange=4k-4k
rate_iops=10
runtime=18000
time_based
group_reporting

[dm-0]
filename=/dev/dm-0

7.将fio跑起来,并且在宿主机上使用iostat监控dm-0的io状况

1
iostat -mxt 1 dm-0|tee dm-0.stats

8.将其中一台存储机上的chunk设备suspend住

1
nbs@10-180-0-30:~/wd$ sudo dmsetup suspend test_chunk_0

9.io hang住的时间统计

然后可以观察到fio的io hang住了,并且iostat显示util 100%
fio开始后的10s时刻suspend住chunk设备,观察到fio恢复正常的时间点是eta 04h:53m:36s,这是fio的倒计时,因此算出来fio io hang了6m14s,然后就可以看到宿主机上这个iscsi设备报io error了。

iostat观察到的各个时间点:
1)io开始时间: 04:50:37 PM
2)io hang开始时间: 04:50:47 PM
3)io开始恢复时间: 04:56:58 PM
4)因为是sas盘,堆积的io一直处理,util 还是100%,恢复正常的时间: 04:57:22 PM
可以算出io hang的时间为6m11s,跟fio上观察到的差不多。

10.资源使用情况监控

网络连接状况,发现一直出错后重连

1
2
3
4
5
6
7
8
Every 1.0s: sudo ss -apnt |grep 3260                                                                                                                                            

FIN-WAIT-1 0 481 10.180.0.46:59440 10.180.0.30:3260
FIN-WAIT-1 0 481 10.180.0.46:59446 10.180.0.30:3260
ESTAB 0 480 10.180.0.46:59451 10.180.0.30:3260 users:(("iscsid",4237,7))
FIN-WAIT-1 0 481 10.180.0.46:59444 10.180.0.30:3260
FIN-WAIT-1 0 481 10.180.0.46:59442 10.180.0.30:3260
FIN-WAIT-1 0 481 10.180.0.46:59449 10.180.0.30:3260

target的内核线程某个时间点后一直处于cpu 100%

1
2
 PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND                                                                                                                                        
3138 root 20 0 0 0 0 R 100 0.0 192:04.28 istd1000

排查分析过程

iscsi相关超时排查

根据上面的资源使用情况的监控发现当有异常时,iscsi的连接一直在重连,然后关闭,又重连,因此首先就怀疑是否是iscsi配置的相关超时导致的,iscsi initiator端(也就是openiscsi)的相关超时总共有以下几项:

1
2
3
4
5
6
7
8
9
10
node.conn[0].timeo.auth_timeout = 45
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 5
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.host_reset_timeout = 60
node.session.err_timeo.lu_reset_timeout = 30
node.session.err_timeo.tgt_reset_timeout = 30
node.session.timeo.replacement_timeout = 10

这些超时的含义通过名字可以知晓,这里具体说明,详细的可以看open-iscsi的文档

通过将这些超时值调小,再进行测试,仍然是一样的效果,fio hang的时间也是6分多钟。说明不是这些参数影响的。

open-iscsi和iscsitarget的日志

io hang住过程中iscsitarget的日志如下:

1
2
3
4
5
6
7
8
9
Jan 13 16:51:48 10-180-0-30 kernel: [2540033.667744] iscsi_trgt: Abort Task (01) issued on tid:1000 lun:0 by sid:281474997486080 (Function Complete)
Jan 13 16:52:50 10-180-0-30 kernel: [2540095.762670] iscsi_trgt: Logical Unit Reset (05) issued on tid:1000 lun:0 by sid:281474997486080 (Function Complete)
Jan 13 16:53:52 10-180-0-30 kernel: [2540157.745521] iscsi_trgt: Abort Task (01) issued on tid:1000 lun:0 by sid:281474997486080 (Function Complete)
Jan 13 16:54:54 10-180-0-30 kernel: [2540219.840457] iscsi_trgt: Logical Unit Reset (05) issued on tid:1000 lun:0 by sid:281474997486080 (Function Complete)
Jan 13 16:55:56 10-180-0-30 kernel: [2540281.823308] iscsi_trgt: Abort Task (01) issued on tid:1000 lun:0 by sid:281474997486080 (Function Complete)
Jan 13 16:56:58 10-180-0-30 kernel: [2540343.790186] iscsi_trgt: Logical Unit Reset (05) issued on tid:1000 lun:0 by sid:281474997486080 (Function Complete)
Jan 13 17:00:48 10-180-0-30 kernel: [2540574.243189] iscsi_trgt: nop_out_start(915) ignore this request 4b000000
Jan 13 17:00:48 10-180-0-30 kernel: [2540574.244276] iscsi_trgt: cmnd_rx_start(1942) 0 4b000000 -7
Jan 13 17:00:48 10-180-0-30 kernel: [2540574.244999] iscsi_trgt: cmnd_skip_pdu(471) 4b000000 0 0 0

从target的错误日志来看,也只是收到initiator端发过来的abort task消息进行处理,没有明显线索。
另外开启了open-iscsi的debug日志,io hang过程中会发现很多重连的日志,跟之前网络连接状态监控的信息一致。

设备的timeout

一直怀疑是某个阶段的超时导致,既然从iscsi层面没有发现明显的线索,那么就转到设备层面,linux的块设备都有一个timeout(/sys/block//device/timeout),对于iscsi连过来的设备这个超时值默认是30s,那么将这个超时时间改小试试。

1)将iscsi设备的超时由默认的30s改成10s

1
2
3
4
5
6
echo 10 > /sys/block/sdu/device/timeout

iostat io hang的时间: 11:03:11 AM
iostat io开始恢复: 11:05:22 AM
iostat io恢复正常: 11:05:31 AM
总共io hang的时间为2m10s左右

2)将scsi的timeout再改短成5s

1
2
3
4
5
6
echo 5 > /sys/block/sdw/device/timeout

iostat io hang开始时间: 04:31:04 PM
iostat io开始恢复时间: 04:32:13 PM
iostat io完成恢复时间: 04:32:19 PM
总共io hang的时间1m9s左右。

上面的结果显示,scsi设备的timeout影响了io hang的时间,超时改小后,io hang的时间明显减小了。那么具体是什么样的关系呢?下面就结合scsi的调试日志进行分析。

开启scsi调试日志

1
2
3
4
5
6
7
8
9
10
11
12
13
$ scsi_logging_level -s --error 7 --timeout 7 --all 0
New scsi logging level:
/proc/sys/dev/scsi/logging_level = 63
SCSI_LOG_ERROR=7
SCSI_LOG_TIMEOUT=7
SCSI_LOG_SCAN=0
SCSI_LOG_MLQUEUE=0
SCSI_LOG_MLCOMPLETE=0
SCSI_LOG_LLQUEUE=0
SCSI_LOG_LLCOMPLETE=0
SCSI_LOG_HLQUEUE=0
SCSI_LOG_HLCOMPLETE=0
SCSI_LOG_IOCTL=0

然后对比宿主机上的日志,从日志上看是abort cmd后进入到scsi error handler的处理流程了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
Jan 17 10:48:38 10-180-0-46 -su: HISTORY: PID=4475 UID=2010 sudo fio fio.cfg 
Jan 17 10:48:45 10-180-0-46 -su: HISTORY: PID=46333 UID=2010 iostat -mxt 1 dm-0 |tee dm-0.stats
Jan 17 10:49:00 10-180-0-46 kernel: [5368505.769870] sd 7:0:0:0: [sdu] scmd ffff883373616200 abort scheduled
Jan 17 10:49:00 10-180-0-46 kernel: [5368505.777865] sd 7:0:0:0: [sdu] aborting command ffff883373616200
Jan 17 10:49:00 10-180-0-46 kernel: [5368505.777978] sd 7:0:0:0: [sdu] scmd ffff883373616200 retry aborted command
Jan 17 10:49:12 10-180-0-46 kernel: [5368517.774336] sd 7:0:0:0: [sdu] scmd ffff883373616200 previous abort failed
Jan 17 10:49:12 10-180-0-46 kernel: [5368517.774342] Waking error handler thread
Jan 17 10:49:12 10-180-0-46 kernel: [5368517.774346] scsi_eh_7: waking up 0/1/1
Jan 17 10:49:12 10-180-0-46 kernel: [5368517.774355] sd 7:0:0:0: scsi_eh_prt_fail_stats: cmds failed: 1, cancel: 0
Jan 17 10:49:12 10-180-0-46 kernel: [5368517.774362] Total of 1 commands on 1 devices require eh work
Jan 17 10:49:12 10-180-0-46 kernel: [5368517.774365] scsi_eh_7: Sending BDR sdev: 0xffff881fb23ff800
Jan 17 10:49:12 10-180-0-46 kernel: [5368517.774565] scsi_eh_done scmd: ffff883373616200 result: 2
Jan 17 10:49:12 10-180-0-46 kernel: [5368517.774573] scsi_send_eh_cmnd: scmd: ffff883373616200, timeleft: 2500
Jan 17 10:49:12 10-180-0-46 kernel: [5368517.774577] scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
Jan 17 10:49:12 10-180-0-46 kernel: [5368517.774579] scsi_eh_tur: scmd ffff883373616200 rtn 2001
Jan 17 10:49:12 10-180-0-46 kernel: [5368517.774643] scsi_eh_done scmd: ffff883373616200 result: 0
Jan 17 10:49:12 10-180-0-46 kernel: [5368517.774649] scsi_send_eh_cmnd: scmd: ffff883373616200, timeleft: 2500
Jan 17 10:49:12 10-180-0-46 kernel: [5368517.774650] scsi_send_eh_cmnd: scsi_eh_completed_normally 2002
Jan 17 10:49:12 10-180-0-46 kernel: [5368517.774653] scsi_eh_tur: scmd ffff883373616200 rtn 2002
Jan 17 10:49:12 10-180-0-46 kernel: [5368517.774658] scsi_eh_7: flush retry cmd: ffff883373616200
Jan 17 10:49:12 10-180-0-46 kernel: [5368517.774665] scsi_eh_7 waking up host to restart
Jan 17 10:49:12 10-180-0-46 kernel: [5368517.774673] scsi_eh_7: sleeping
Jan 17 10:49:24 10-180-0-46 kernel: [5368529.746792] sd 7:0:0:0: [sdu] scmd ffff883373616200 abort scheduled
Jan 17 10:49:24 10-180-0-46 kernel: [5368529.754787] sd 7:0:0:0: [sdu] aborting command ffff883373616200
Jan 17 10:49:24 10-180-0-46 kernel: [5368529.754891] sd 7:0:0:0: [sdu] scmd ffff883373616200 retry aborted command
Jan 17 10:49:35 10-180-0-46 kernel: [5368540.790899] sd 7:0:0:0: [sdu] scmd ffff883373616200 previous abort failed
Jan 17 10:49:35 10-180-0-46 kernel: [5368540.790906] Waking error handler thread
Jan 17 10:49:35 10-180-0-46 kernel: [5368540.790910] scsi_eh_7: waking up 0/1/1
Jan 17 10:49:35 10-180-0-46 kernel: [5368540.790918] sd 7:0:0:0: scsi_eh_prt_fail_stats: cmds failed: 1, cancel: 0
Jan 17 10:49:35 10-180-0-46 kernel: [5368540.790925] Total of 1 commands on 1 devices require eh work
Jan 17 10:49:35 10-180-0-46 kernel: [5368540.790929] scsi_eh_7: Sending BDR sdev: 0xffff881fb23ff800
Jan 17 10:49:35 10-180-0-46 kernel: [5368540.791141] scsi_eh_done scmd: ffff883373616200 result: 2
Jan 17 10:49:35 10-180-0-46 kernel: [5368540.791150] scsi_send_eh_cmnd: scmd: ffff883373616200, timeleft: 2500
Jan 17 10:49:35 10-180-0-46 kernel: [5368540.791155] scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
Jan 17 10:49:35 10-180-0-46 kernel: [5368540.791157] scsi_eh_tur: scmd ffff883373616200 rtn 2001
Jan 17 10:49:35 10-180-0-46 kernel: [5368540.791224] scsi_eh_done scmd: ffff883373616200 result: 0
Jan 17 10:49:35 10-180-0-46 kernel: [5368540.791230] scsi_send_eh_cmnd: scmd: ffff883373616200, timeleft: 2500
Jan 17 10:49:35 10-180-0-46 kernel: [5368540.791233] scsi_send_eh_cmnd: scsi_eh_completed_normally 2002
Jan 17 10:49:35 10-180-0-46 kernel: [5368540.791234] scsi_eh_tur: scmd ffff883373616200 rtn 2002
Jan 17 10:49:35 10-180-0-46 kernel: [5368540.791243] scsi_eh_7: flush retry cmd: ffff883373616200
Jan 17 10:49:35 10-180-0-46 kernel: [5368540.791249] scsi_eh_7 waking up host to restart
Jan 17 10:49:35 10-180-0-46 kernel: [5368540.791257] scsi_eh_7: sleeping
Jan 17 10:49:47 10-180-0-46 kernel: [5368552.763354] sd 7:0:0:0: [sdu] scmd ffff883373616200 abort scheduled
Jan 17 10:49:47 10-180-0-46 kernel: [5368552.771352] sd 7:0:0:0: [sdu] aborting command ffff883373616200
Jan 17 10:49:47 10-180-0-46 kernel: [5368552.771471] sd 7:0:0:0: [sdu] scmd ffff883373616200 retry aborted command
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.807464] sd 7:0:0:0: [sdu] scmd ffff883373616200 previous abort failed
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.807470] Waking error handler thread
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.807480] scsi_eh_7: waking up 0/1/1
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.807488] sd 7:0:0:0: scsi_eh_prt_fail_stats: cmds failed: 1, cancel: 0
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.807494] Total of 1 commands on 1 devices require eh work
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.807497] scsi_eh_7: Sending BDR sdev: 0xffff881fb23ff800
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.807697] scsi_eh_done scmd: ffff883373616200 result: 2
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.807705] scsi_send_eh_cmnd: scmd: ffff883373616200, timeleft: 2500
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.807710] scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.807712] scsi_eh_tur: scmd ffff883373616200 rtn 2001
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.807783] scsi_eh_done scmd: ffff883373616200 result: 0
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.807790] scsi_send_eh_cmnd: scmd: ffff883373616200, timeleft: 2500
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.807791] scsi_send_eh_cmnd: scsi_eh_completed_normally 2002
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.807793] scsi_eh_tur: scmd ffff883373616200 rtn 2002
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.807798] scsi_eh_7: flush finish cmd: ffff883373616200
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.807804] sd 7:0:0:0: [sdu] Unhandled error code
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.807806] sd 7:0:0:0: [sdu]
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.807808] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.807809] sd 7:0:0:0: [sdu] CDB:
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.807811] Write(10): 2a 00 00 eb b5 80 00 00 08 00
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.807823] end_request: I/O error, dev sdu, sector 15447424
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.808900] md/raid1:md127: Disk failure on sdu, disabling device.
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.808900] md/raid1:md127: Operation continuing on 1 devices.
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.808906] scsi_eh_7 waking up host to restart
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.808912] scsi_eh_7: sleeping
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.811521] RAID1 conf printout:
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.811523] --- wd:1 rd:2
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.811525] disk 0, wo:1, o:0, dev:sdu
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.811527] disk 1, wo:0, o:1, dev:sdv
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.823554] RAID1 conf printout:
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.823556] --- wd:1 rd:2
Jan 17 10:49:58 10-180-0-46 kernel: [5368563.823558] disk 1, wo:0, o:1, dev:sdv

target端的日志:

1
2
3
4
5
6
Jan 17 10:49:00 10-180-0-30 kernel: [2863985.237066] iscsi_trgt: Abort Task (01) issued on tid:1000 lun:0 by sid:12384899046375936 (Function Complete)
Jan 17 10:49:12 10-180-0-30 kernel: [2863997.233531] iscsi_trgt: Logical Unit Reset (05) issued on tid:1000 lun:0 by sid:12384899046375936 (Function Complete)
Jan 17 10:49:24 10-180-0-30 kernel: [2864009.213910] iscsi_trgt: Abort Task (01) issued on tid:1000 lun:0 by sid:12384899046375936 (Function Complete)
Jan 17 10:49:35 10-180-0-30 kernel: [2864020.250035] iscsi_trgt: Logical Unit Reset (05) issued on tid:1000 lun:0 by sid:12384899046375936 (Function Complete)
Jan 17 10:49:47 10-180-0-30 kernel: [2864032.230419] iscsi_trgt: Abort Task (01) issued on tid:1000 lun:0 by sid:12384899046375936 (Function Complete)
Jan 17 10:49:58 10-180-0-30 kernel: [2864043.266527] iscsi_trgt: Logical Unit Reset (05) issued on tid:1000 lun:0 by sid:12384899046375936 (Function Complete)

从上面的日志来看,是经过了多次abort task,然后error handle的处理,但是具体有什么样的关系,还是得从linux内核里这部分代码里找线索。

使用ftrace来追踪关键函数的调用信息

在查阅linux内核源码的同时,使用ftrace来追踪关键函数的调用日志。

1
2
3
4
5
6
7
8
sudo mount -t debugfs debugfs /sys/kernel/debug

echo 1 > /sys/kernel/debug/tracing/events/scsi/scsi_dispatch_cmd_start/enable
echo 1 > /sys/kernel/debug/tracing/events/scsi/scsi_dispatch_cmd_timeout/enable

然后可以查看/sys/kernel/debug/tracing/trace里的内容来看日志信息,不过因为会输出所有设备的信息,所有根据scsi号来进行过滤。

grep "host_no=7 channel=0 id=0 lun=0" /sys/kernel/debug/tracing/trace

测试过程中io hang那段时间的日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
       <...>-15169 [005] .... 5387221.071608: scsi_dispatch_cmd_start: host_no=7 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(WRITE_10 lba=3300944 txlen=8 protect=0 raw=2a 00 00 32 5e 50 00 00 08 00)
<idle>-0 [003] dNs. 5387226.715800: scsi_dispatch_cmd_timeout: host_no=7 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(WRITE_10 lba=3300944 txlen=8 protect=0 raw=2a 00 00 32 5e 50 00 00 08 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD)
<idle>-0 [003] d.s. 5387232.718032: scsi_dispatch_cmd_timeout: host_no=7 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(WRITE_10 lba=3300944 txlen=8 protect=0 raw=2a 00 00 32 5e 50 00 00 08 00) result=(driver=DRIVER_OK host=DID_TIME_OUT message=COMMAND_COMPLETE status=SAM_STAT_GOOD)
kworker/3:1H-612 [003] .... 5387232.734040: scsi_dispatch_cmd_start: host_no=7 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(WRITE_10 lba=3300944 txlen=8 protect=0 raw=2a 00 00 32 5e 50 00 00 08 00)
<idle>-0 [003] dNs. 5387238.720266: scsi_dispatch_cmd_timeout: host_no=7 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(WRITE_10 lba=3300944 txlen=8 protect=0 raw=2a 00 00 32 5e 50 00 00 08 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD)
<idle>-0 [003] dNs. 5387244.722500: scsi_dispatch_cmd_timeout: host_no=7 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(WRITE_10 lba=3300944 txlen=8 protect=0 raw=2a 00 00 32 5e 50 00 00 08 00) result=(driver=DRIVER_OK host=DID_TIME_OUT message=COMMAND_COMPLETE status=SAM_STAT_GOOD)
kworker/5:1H-655 [005] .... 5387244.726522: scsi_dispatch_cmd_start: host_no=7 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(WRITE_10 lba=3300944 txlen=8 protect=0 raw=2a 00 00 32 5e 50 00 00 08 00)
<idle>-0 [005] dNs. 5387250.692735: scsi_dispatch_cmd_timeout: host_no=7 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(WRITE_10 lba=3300944 txlen=8 protect=0 raw=2a 00 00 32 5e 50 00 00 08 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD)
<idle>-0 [005] dNs. 5387256.710961: scsi_dispatch_cmd_timeout: host_no=7 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(WRITE_10 lba=3300944 txlen=8 protect=0 raw=2a 00 00 32 5e 50 00 00 08 00) result=(driver=DRIVER_OK host=DID_TIME_OUT message=COMMAND_COMPLETE status=SAM_STAT_GOOD)
kworker/3:1H-612 [003] .... 5387256.726970: scsi_dispatch_cmd_start: host_no=7 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(WRITE_10 lba=3300944 txlen=8 protect=0 raw=2a 00 00 32 5e 50 00 00 08 00)
<idle>-0 [003] dNs. 5387262.729203: scsi_dispatch_cmd_timeout: host_no=7 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(WRITE_10 lba=3300944 txlen=8 protect=0 raw=2a 00 00 32 5e 50 00 00 08 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD)
<idle>-0 [003] dNs. 5387268.731435: scsi_dispatch_cmd_timeout: host_no=7 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(WRITE_10 lba=3300944 txlen=8 protect=0 raw=2a 00 00 32 5e 50 00 00 08 00) result=(driver=DRIVER_OK host=DID_TIME_OUT message=COMMAND_COMPLETE status=SAM_STAT_GOOD)
kworker/3:1H-612 [003] .... 5387268.735459: scsi_dispatch_cmd_start: host_no=7 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(WRITE_10 lba=3300944 txlen=8 protect=0 raw=2a 00 00 32 5e 50 00 00 08 00)
<idle>-0 [003] dNs. 5387274.733669: scsi_dispatch_cmd_timeout: host_no=7 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(WRITE_10 lba=3300944 txlen=8 protect=0 raw=2a 00 00 32 5e 50 00 00 08 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD)
<idle>-0 [003] dNs. 5387280.735902: scsi_dispatch_cmd_timeout: host_no=7 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(WRITE_10 lba=3300944 txlen=8 protect=0 raw=2a 00 00 32 5e 50 00 00 08 00) result=(driver=DRIVER_OK host=DID_TIME_OUT message=COMMAND_COMPLETE status=SAM_STAT_GOOD)
kworker/3:1H-612 [003] .... 5387280.751909: scsi_dispatch_cmd_start: host_no=7 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(WRITE_10 lba=3300944 txlen=8 protect=0 raw=2a 00 00 32 5e 50 00 00 08 00)
<idle>-0 [003] d.s. 5387286.738133: scsi_dispatch_cmd_timeout: host_no=7 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(WRITE_10 lba=3300944 txlen=8 protect=0 raw=2a 00 00 32 5e 50 00 00 08 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD)
<idle>-0 [003] dNs. 5387292.740369: scsi_dispatch_cmd_timeout: host_no=7 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(WRITE_10 lba=3300944 txlen=8 protect=0 raw=2a 00 00 32 5e 50 00 00 08 00) result=(driver=DRIVER_OK host=DID_TIME_OUT message=COMMAND_COMPLETE status=SAM_STAT_GOOD)

以及这个过程中宿主机上的scsi日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
Jan 17 16:00:29 10-180-0-46 bash: HISTORY: PID=3642 UID=0 echo "" > trace
Jan 17 16:00:39 10-180-0-46 -su: HISTORY: PID=4475 UID=2010 sudo fio fio.cfg
Jan 17 16:00:41 10-180-0-46 -su: HISTORY: PID=46333 UID=2010 iostat -mxt 1 dm-0 |tee dm-0.stats
Jan 17 16:01:00 10-180-0-46 kernel: [5387232.718042] sd 7:0:0:0: [sdu] scmd ffff88377bcb56c0 abort scheduled
Jan 17 16:01:00 10-180-0-46 kernel: [5387232.726038] sd 7:0:0:0: [sdu] aborting command ffff88377bcb56c0
Jan 17 16:01:00 10-180-0-46 kernel: [5387232.726148] sd 7:0:0:0: [sdu] scmd ffff88377bcb56c0 retry aborted command
Jan 17 16:01:12 10-180-0-46 kernel: [5387244.722511] sd 7:0:0:0: [sdu] scmd ffff88377bcb56c0 previous abort failed
Jan 17 16:01:12 10-180-0-46 kernel: [5387244.722518] Waking error handler thread
Jan 17 16:01:12 10-180-0-46 kernel: [5387244.722521] scsi_eh_7: waking up 0/1/1
Jan 17 16:01:12 10-180-0-46 kernel: [5387244.722531] sd 7:0:0:0: scsi_eh_prt_fail_stats: cmds failed: 1, cancel: 0
Jan 17 16:01:12 10-180-0-46 kernel: [5387244.722537] Total of 1 commands on 1 devices require eh work
Jan 17 16:01:12 10-180-0-46 kernel: [5387244.722541] scsi_eh_7: Sending BDR sdev: 0xffff881fb23ff800
Jan 17 16:01:12 10-180-0-46 kernel: [5387244.722735] scsi_eh_done scmd: ffff88377bcb56c0 result: 2
Jan 17 16:01:12 10-180-0-46 kernel: [5387244.722743] scsi_send_eh_cmnd: scmd: ffff88377bcb56c0, timeleft: 1250
Jan 17 16:01:12 10-180-0-46 kernel: [5387244.722746] scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
Jan 17 16:01:12 10-180-0-46 kernel: [5387244.722749] scsi_eh_tur: scmd ffff88377bcb56c0 rtn 2001
Jan 17 16:01:12 10-180-0-46 kernel: [5387244.722816] scsi_eh_done scmd: ffff88377bcb56c0 result: 0
Jan 17 16:01:12 10-180-0-46 kernel: [5387244.722822] scsi_send_eh_cmnd: scmd: ffff88377bcb56c0, timeleft: 1250
Jan 17 16:01:12 10-180-0-46 kernel: [5387244.722824] scsi_send_eh_cmnd: scsi_eh_completed_normally 2002
Jan 17 16:01:12 10-180-0-46 kernel: [5387244.722828] scsi_eh_tur: scmd ffff88377bcb56c0 rtn 2002
Jan 17 16:01:12 10-180-0-46 kernel: [5387244.722832] scsi_eh_7: flush retry cmd: ffff88377bcb56c0
Jan 17 16:01:12 10-180-0-46 kernel: [5387244.722840] scsi_eh_7 waking up host to restart
Jan 17 16:01:12 10-180-0-46 kernel: [5387244.722851] scsi_eh_7: sleeping
Jan 17 16:01:24 10-180-0-46 kernel: [5387256.710972] sd 7:0:0:0: [sdu] scmd ffff88377bcb56c0 abort scheduled
Jan 17 16:01:24 10-180-0-46 kernel: [5387256.718969] sd 7:0:0:0: [sdu] aborting command ffff88377bcb56c0
Jan 17 16:01:24 10-180-0-46 kernel: [5387256.719133] sd 7:0:0:0: [sdu] scmd ffff88377bcb56c0 retry aborted command
Jan 17 16:01:36 10-180-0-46 kernel: [5387268.731446] sd 7:0:0:0: [sdu] scmd ffff88377bcb56c0 previous abort failed
Jan 17 16:01:36 10-180-0-46 kernel: [5387268.731452] Waking error handler thread
Jan 17 16:01:36 10-180-0-46 kernel: [5387268.731455] scsi_eh_7: waking up 0/1/1
Jan 17 16:01:36 10-180-0-46 kernel: [5387268.731464] sd 7:0:0:0: scsi_eh_prt_fail_stats: cmds failed: 1, cancel: 0
Jan 17 16:01:36 10-180-0-46 kernel: [5387268.731470] Total of 1 commands on 1 devices require eh work
Jan 17 16:01:36 10-180-0-46 kernel: [5387268.731473] scsi_eh_7: Sending BDR sdev: 0xffff881fb23ff800
Jan 17 16:01:36 10-180-0-46 kernel: [5387268.731668] scsi_eh_done scmd: ffff88377bcb56c0 result: 2
Jan 17 16:01:36 10-180-0-46 kernel: [5387268.731674] scsi_send_eh_cmnd: scmd: ffff88377bcb56c0, timeleft: 1250
Jan 17 16:01:36 10-180-0-46 kernel: [5387268.731676] scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
Jan 17 16:01:36 10-180-0-46 kernel: [5387268.731678] scsi_eh_tur: scmd ffff88377bcb56c0 rtn 2001
Jan 17 16:01:36 10-180-0-46 kernel: [5387268.731742] scsi_eh_done scmd: ffff88377bcb56c0 result: 0
Jan 17 16:01:36 10-180-0-46 kernel: [5387268.731747] scsi_send_eh_cmnd: scmd: ffff88377bcb56c0, timeleft: 1250
Jan 17 16:01:36 10-180-0-46 kernel: [5387268.731748] scsi_send_eh_cmnd: scsi_eh_completed_normally 2002
Jan 17 16:01:36 10-180-0-46 kernel: [5387268.731750] scsi_eh_tur: scmd ffff88377bcb56c0 rtn 2002
Jan 17 16:01:36 10-180-0-46 kernel: [5387268.731754] scsi_eh_7: flush retry cmd: ffff88377bcb56c0
Jan 17 16:01:36 10-180-0-46 kernel: [5387268.731761] scsi_eh_7 waking up host to restart
Jan 17 16:01:36 10-180-0-46 kernel: [5387268.731770] scsi_eh_7: sleeping
Jan 17 16:01:48 10-180-0-46 kernel: [5387280.735913] sd 7:0:0:0: [sdu] scmd ffff88377bcb56c0 abort scheduled
Jan 17 16:01:48 10-180-0-46 kernel: [5387280.743906] sd 7:0:0:0: [sdu] aborting command ffff88377bcb56c0
Jan 17 16:01:48 10-180-0-46 kernel: [5387280.744017] sd 7:0:0:0: [sdu] scmd ffff88377bcb56c0 retry aborted command
Jan 17 16:01:55 10-180-0-46 bash: HISTORY: PID=3642 UID=0 grep "host_no=7 channel=0 id=0 lun=0" trace
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.740380] sd 7:0:0:0: [sdu] scmd ffff88377bcb56c0 previous abort failed
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.740387] Waking error handler thread
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.740391] scsi_eh_7: waking up 0/1/1
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.740400] sd 7:0:0:0: scsi_eh_prt_fail_stats: cmds failed: 1, cancel: 0
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.740405] Total of 1 commands on 1 devices require eh work
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.740408] scsi_eh_7: Sending BDR sdev: 0xffff881fb23ff800
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.740630] scsi_eh_done scmd: ffff88377bcb56c0 result: 2
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.740638] scsi_send_eh_cmnd: scmd: ffff88377bcb56c0, timeleft: 1250
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.740642] scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.740644] scsi_eh_tur: scmd ffff88377bcb56c0 rtn 2001
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.740716] scsi_eh_done scmd: ffff88377bcb56c0 result: 0
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.740722] scsi_send_eh_cmnd: scmd: ffff88377bcb56c0, timeleft: 1250
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.740723] scsi_send_eh_cmnd: scsi_eh_completed_normally 2002
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.740727] scsi_eh_tur: scmd ffff88377bcb56c0 rtn 2002
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.740733] scsi_eh_7: flush finish cmd: ffff88377bcb56c0
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.740737] sd 7:0:0:0: [sdu] Unhandled error code
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.740739] sd 7:0:0:0: [sdu]
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.740740] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.740742] sd 7:0:0:0: [sdu] CDB:
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.740747] Write(10): 2a 00 00 32 5e 50 00 00 08 00
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.740755] end_request: I/O error, dev sdu, sector 3300944
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.742331] md/raid1:md127: Disk failure on sdu, disabling device.
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.742331] md/raid1:md127: Operation continuing on 1 devices.
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.742335] scsi_eh_7 waking up host to restart
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.742343] scsi_eh_7: sleeping
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.745838] RAID1 conf printout:
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.745840] --- wd:1 rd:2
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.745842] disk 0, wo:1, o:0, dev:sdu
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.745843] disk 1, wo:0, o:1, dev:sdv
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.756382] RAID1 conf printout:
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.756384] --- wd:1 rd:2
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.756386] disk 1, wo:0, o:1, dev:sdv

从trace里看到scsi命令在5387221.071608时间点start了,然后5387226.715800超时了,刚好5s。
但是从kern的log来看,是5387232.718042才打出日志,也就是trace里的第二次timeout的时间点,比较奇怪。。。

1
Jan 17 16:01:00 10-180-0-46 kernel: [5387232.718042] sd 7:0:0:0: [sdu] scmd ffff88377bcb56c0 abort scheduled

能够确认的一点是,从trace信息里看到scsi cmnd重试了5次,查阅内核代码确认就是SD_MAX_RETRIES的次数,然后在时间点5387292.740369最后一次timeout。刚好跟kern日志里报io error的时间点吻合。

1
2
3
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.740737] sd 7:0:0:0: [sdu] Unhandled error code
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.740739] sd 7:0:0:0: [sdu]
Jan 17 16:02:00 10-180-0-46 kernel: [5387292.740740] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK

从scsi的日志来看,从5387221.071608这个时间点scsi_dispatch_cmd_start到最后一次5387292.740369 scsi_dispatch_cmd_timeout
时间差是5387292-5387221=71s。跟fio和iostat观察到的io hang的时间一致。
所以io hang的时间可以算出来是: 2timeout + SD_MAX_RETRIES(2timeout),再加上超时时间1s内的误差,那么就还需要加上2SD_MAX_RETRIES1的误差时间,总的超时时间为:
2
timeout + SD_MAX_RETRIES(2timeout) + 2SD_MAX_RETRIES1
所以当timeout设置为5s时,总共io hang的时间可以算出来是: 25 + 5(25) + 25*1 = 70s。

期间为了进一步获取iscsi内核的相关日志,可以通过相关参数设置:

1
2
3
4
echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_conn
echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_eh
echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_session
echo 1 > /sys/module/libiscsi_tcp/parameters/debug_libiscsi_tcp

不过我们目前只关注错误处理的部分,所以只用开启一个:

1
echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_eh

那么上面估算的时间是否正确呢?下面就结合内核源码进行分析。

分析linux内核scsi命令处理的代码

scsi设备初始化

当系统发现新的scsi设备,会进行新设备的注册,其中就涉及到scsi设备的初始化。在这里就注册了该设备的请求处理函数scsi_request_fn(),请求完成函数scsi_softirq_done(),超时处理函数scsi_times_out()。

1
2
3
4
5
6
7
8
9
10
11
scsi_alloc_sdev
->scsi_alloc_queue,赋值给sdev->request_queue
->__scsi_alloc_queue(sdev->host, scsi_request_fn)
->blk_init_queue()
->blk_init_queue_node()
->blk_alloc_queue_node()
->setup_timer(&q->timeout, blk_rq_timed_out_timer, (unsigned long) q)设置定时器
->blk_queue_prep_rq(q, scsi_prep_fn);
->blk_queue_softirq_done(q, scsi_softirq_done);
->blk_queue_rq_timed_out(q, scsi_times_out);
->blk_queue_lld_busy(q, scsi_lld_busy);

scsi cmnd请求下发

io在块层处理完成后会调用q->request_fn进行具体的设备驱动层的处理,对于scsi设备,对应的就是scsi_request_fn处理函数。scsi_request_fn调用scsi_dipatch_cmd,然后hostt->queuecommand放到底层驱动的队列里,由底层进行处理。

1
2
3
4
5
6
7
scsi_request_fn
->blk_peek_request
->blk_start_request
->blk_add_timer设置超时
->scsi_dispatch_cmd
->cmd->scsi_done = scsi_done
->rtn = host->hostt->queuecommand(host, cmd)

scsi cmnd请求完成处理

当scsi命令由底层驱动处理完成后,会触发软中断,调用对于的软中断处理函数,从而调用到scsi命令一开始注册的完成函数scsi_done()。

1
2
3
4
5
6
7
scsi_softirq_done()
is the handler that gets called once the LLD indicates command completed.
scsi_done()
->blk_complete_request()
->causes softirq
->blk_done_softirq()
->scsi_softirq_done()

scsi cmnd超时的设置与处理

在blk_add_timer里会设置req->timeout(赋值为q->rq_timeout,这个值是在sd_probe()里设置的,也就是/sys/block//device/timeout这个值),然后添加到q->timeout_list,由定时器处理函数来处理。注意这里采用round up to nearest second,所以实际会发现timeout会有1s左右的误差。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
void blk_add_timer(struct request *req)
{
struct request_queue *q = req->q;
unsigned long expiry;

if (!q->rq_timed_out_fn)
return;

BUG_ON(!list_empty(&req->timeout_list));

/*
* Some LLDs, like scsi, peek at the timeout to prevent a
* command from being retried forever.
*/
if (!req->timeout)
req->timeout = q->rq_timeout;

req->deadline = jiffies + req->timeout;
list_add_tail(&req->timeout_list, &q->timeout_list);

/*
* If the timer isn't already pending or this timeout is earlier
* than an existing one, modify the timer. Round up to next nearest
* second.
*/
/* 注意这里采用round up to nearest second,所以实际会发现timeout会有1s左右的误差 */
expiry = round_jiffies_up(req->deadline);

if (!timer_pending(&q->timeout) ||
time_before(expiry, q->timeout.expires))
mod_timer(&q->timeout, expiry);
}

在blk_alloc_queue_node初始化时就调用setup_timer初始化了一个定时器(对应超时处理函数为blk_rq_timed_out_timer),然后初始化了q->timeout_list队列,当超时后就会遍历q->timeout_list,从中取出req进行超时的处理。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
void blk_rq_timed_out_timer(unsigned long data)
{
struct request_queue *q = (struct request_queue *) data;
unsigned long flags, next = 0;
struct request *rq, *tmp;
int next_set = 0;

spin_lock_irqsave(q->queue_lock, flags);

list_for_each_entry_safe(rq, tmp, &q->timeout_list, timeout_list) {
if (time_after_eq(jiffies, rq->deadline)) {
list_del_init(&rq->timeout_list);

/*
* Check if we raced with end io completion
*/
if (blk_mark_rq_complete(rq))
continue;
blk_rq_timed_out(rq);
} else if (!next_set || time_after(next, rq->deadline)) {
next = rq->deadline;
next_set = 1;
}
}

if (next_set)
mod_timer(&q->timeout, round_jiffies_up(next));

spin_unlock_irqrestore(q->queue_lock, flags);
}

1
2
3
4
5
6
7
blk_rq_timed_out_timer
->blk_rq_timed_out
->q->rq_timed_out_fn,即scsi_times_out
->scsi_times_out
->scsi_eh_scmd_add
->scsi_eh_wakeup
->wake_up_process(shost->ehandler)

因为从实际测试时观察到是过了两次超时的时间才开始重试,到下一次重试也是隔了两倍超时时间,因此重点查看scsi_times_out的逻辑,发现在调用scsi_eh_scmd_add之前是判断了host->transportt->eh_timed_out(scmd)的返回值是否为BLK_EH_NOT_HANDLED。
对于iscsi来说,eh_timed_out对应的函数是iscsi_eh_cmd_timed_out,当iscsi_eh_cmd_timed_out返回BLK_EH_RESET_TIMER时,对应blk_rq_timed_out里的处理是继续加入定时器中。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
enum blk_eh_timer_return scsi_times_out(struct request *req)
{
struct scsi_cmnd *scmd = req->special;
enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
struct Scsi_Host *host = scmd->device->host;

trace_scsi_dispatch_cmd_timeout(scmd);
scsi_log_completion(scmd, TIMEOUT_ERROR);

if (host->transportt->eh_timed_out)
rtn = host->transportt->eh_timed_out(scmd);
else if (host->hostt->eh_timed_out)
rtn = host->hostt->eh_timed_out(scmd);

scmd->result |= DID_TIME_OUT << 16;

// 这里是当rtn== BLK_EH_NOT_HANDLED才会继续调用scsi_eh_scmd_add去唤醒error handle处理线程
if (unlikely(rtn == BLK_EH_NOT_HANDLED &&
!scsi_eh_scmd_add(scmd, SCSI_EH_CANCEL_CMD)))
rtn = BLK_EH_HANDLED;

return rtn;
}

为什么是每次经过2个timeout才开始重试,查看内核代码后确认是由于iscsi设备,在scsi的超时处理函数scsi_times_out中,调用iscsi的iscsi_eh_cmd_timed_out时第一次检测到不满足唤醒scsi的error handler线程的条件,所以又加入了一次定时器进行等待,到第二次超时后进行处理才满足唤醒scsi_eh的条件。因此每次都是经过了2次timeout的时间才进行的重试。

重试次数的确定

下面先列一下scsi error handler的处理流程

1
2
3
4
5
6
7
8
9
10
11
12
13
scsi_error_handler
->scsi_unjam_host
->scsi_eh_prt_fail_stats
->scsi_eh_get_sense
->scsi_request_sense
->scsi_send_eh_cmnd
->scmd->scsi_done = scsi_eh_done
->scsi_eh_abort_cmds
->scsi_try_to_abort_cmd
->hostt->eh_abort_handler
对iscsi来说就是iscsi_eh_abort
->iscsi_exec_task_mgmt_fn
->scsi_eh_flush_done_q

在scsi_eh_flush_done_q里进行的重试次数的判断,对应的重试次数就是SD_MAX_RETRIES。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
void scsi_eh_flush_done_q(struct list_head *done_q)
{
struct scsi_cmnd *scmd, *next;

list_for_each_entry_safe(scmd, next, done_q, eh_entry) {
list_del_init(&scmd->eh_entry);
if (scsi_device_online(scmd->device) &&
!scsi_noretry_cmd(scmd) &&
(++scmd->retries <= scmd->allowed)) { //这里判断了重试次数
SCSI_LOG_ERROR_RECOVERY(3, printk("%s: flush"
" retry cmd: %p\n",
current->comm,
scmd));
scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY);
} else {
/*
* If just we got sense for the device (called
* scsi_eh_get_sense), scmd->result is already
* set, do not set DRIVER_TIMEOUT.
*/
if (!scmd->result)
scmd->result |= (DRIVER_TIMEOUT << 24);
SCSI_LOG_ERROR_RECOVERY(3, printk("%s: flush finish"
" cmd: %p\n",
current->comm, scmd));
scsi_finish_command(scmd);
}
}
}

小结

从上面的分析可以得出,io hang的根本原因是scsi层的命令处理的超时和重试导致的。
并且io hang的时间可以算出来是: 2timeout + SD_MAX_RETRIES(2timeout),再加上超时计时器时间1s左右的误差,那么就还需要加上2SD_MAX_RETRIES*1的误差时间。

1
io hang的时间=2*timeout + SD_MAX_RETRIES*(2*timeout) + 2*SD_MAX_RETRIES*1