频繁执行iscsiadm导致iscsid不进行target的重连

open-iscsi实现了iscsi的自动重连,当target所在机器重启或者target的网络异常时,open-iscsi都会一直尝试connect,一旦当target重建出来后,open-iscsi就能够重新login成功。但是测试时发现,当频繁执行iscsiadm的命令时,会导致open-iscsi不去进行重连了。

观察到的现象

当target异常时,strace跟踪iscsid进程会发现一直尝试重连,不过一直失败。

1
2
3
4
5
6
$ sudo strace -e connect -p 28022 -t
Process 28022 attached - interrupt to quit
15:19:31 connect(10, {sa_family=AF_INET, sin_port=htons(3260), sin_addr=inet_addr("10.180.0.30")}, 128) = -1 EINPROGRESS (Operation now in progress)
15:19:34 connect(10, {sa_family=AF_INET, sin_port=htons(3260), sin_addr=inet_addr("10.180.0.30")}, 128) = -1 EINPROGRESS (Operation now in progress)
15:19:37 connect(10, {sa_family=AF_INET, sin_port=htons(3260), sin_addr=inet_addr("10.180.0.30")}, 128) = -1 EINPROGRESS (Operation now in progress)
15:19:40 connect(10, {sa_family=AF_INET, sin_port=htons(3260), sin_addr=inet_addr("10.180.0.30")}, 128) = -1 EINPROGRESS (Operation now in progress)

模拟频繁执行iscsiadm的命令

1
$ for i in {1..10000};do sudo iscsiadm -m session -P3 2&> /dev/null;done

这个时候去strace跟踪会发现没有去重连target的信息,反而是一些奇怪的记录。

1
2
3
4
5
17:24:12 connect(9, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
17:24:13 connect(9, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
17:24:14 connect(9, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
17:24:14 connect(9, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
17:24:15 connect(9, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)

开启iscsid的debug日志

1
sudo /usr/sbin/iscsid -d 8

重复上面的过程,当频繁执行iscsiadm命令时,iscsid日志如下:

1
2
3
4
5
6
7
8
9
10
Feb 22 19:37:26 10-180-0-46 iscsid: poll result 1
Feb 22 19:37:26 10-180-0-46 iscsid: mgmt_ipc_write_rsp: rsp to fd 8
Feb 22 19:37:26 10-180-0-46 iscsid: poll result 1
Feb 22 19:37:26 10-180-0-46 iscsid: mgmt_ipc_write_rsp: rsp to fd 8
Feb 22 19:37:26 10-180-0-46 iscsid: poll result 1
Feb 22 19:37:26 10-180-0-46 iscsid: mgmt_ipc_write_rsp: rsp to fd 8
Feb 22 19:37:26 10-180-0-46 iscsid: poll result 1
Feb 22 19:37:26 10-180-0-46 iscsid: mgmt_ipc_write_rsp: rsp to fd 8
Feb 22 19:37:26 10-180-0-46 iscsid: poll result 1
Feb 22 19:37:26 10-180-0-46 iscsid: mgmt_ipc_write_rsp: rsp to fd 8

一旦停止了iscsiadm的命令,就可以看到下面的日志:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
Feb 22 19:37:26 10-180-0-46 iscsid: poll result 1
Feb 22 19:37:26 10-180-0-46 iscsid: mgmt_ipc_write_rsp: rsp to fd 8
Feb 22 19:37:26 10-180-0-46 iscsid: poll result 1
Feb 22 19:37:26 10-180-0-46 iscsid: mgmt_ipc_write_rsp: rsp to fd 8
Feb 22 19:37:27 10-180-0-46 iscsid: thread 01e8e2e8 wait some more
Feb 22 19:37:28 10-180-0-46 iscsid: thread 01e8e2e8 wait some more
Feb 22 19:37:28 10-180-0-46 iscsid: thread 01e8e2e8 was scheduled at 33742:8, curtime 33752 q_forw 0x68b330 &pend_list 0x68b330
Feb 22 19:37:28 10-180-0-46 iscsid: thread 01e8e2e8 now in actor_list
Feb 22 19:37:28 10-180-0-46 iscsid: exec thread 01e8e2e8 callback
Feb 22 19:37:28 10-180-0-46 iscsid: iscsi_login_eh
Feb 22 19:37:28 10-180-0-46 iscsid: login failed ISCSI_CONN_STATE_XPT_WAIT/R_STAGE_SESSION_REOPEN 8
Feb 22 19:37:28 10-180-0-46 iscsid: re-opening session 11 (reopen_cnt 8)
Feb 22 19:37:28 10-180-0-46 iscsid: thread 01e8e2e8 delete: state 3
Feb 22 19:37:28 10-180-0-46 iscsid: thread 01e8e320 delete: state 3
Feb 22 19:37:28 10-180-0-46 iscsid: get ev context 0x1e921e0
Feb 22 19:37:28 10-180-0-46 iscsid: set TCP recv window size to 524288, actually got 425984
Feb 22 19:37:28 10-180-0-46 iscsid: set TCP send window size to 524288, actually got 425984
Feb 22 19:37:28 10-180-0-46 iscsid: connecting to 10.180.0.30:3260
Feb 22 19:37:28 10-180-0-46 iscsid: sched conn context 0x1e921e0 event 2, tmo 0
Feb 22 19:37:28 10-180-0-46 iscsid: thread 0x1e921e0 schedule: delay 0 state 3
Feb 22 19:37:28 10-180-0-46 iscsid: Setting login timer 0x1e8e2e8 timeout 15
Feb 22 19:37:28 10-180-0-46 iscsid: thread 0x1e8e2e8 schedule: delay 60 state 3
Feb 22 19:37:28 10-180-0-46 iscsid: thread removed
Feb 22 19:37:28 10-180-0-46 iscsid: thread 01e921e0 removed from poll_list
Feb 22 19:37:29 10-180-0-46 kernel: [8511784.201478] connection11:0: detected conn error (1020)
Feb 22 19:37:29 10-180-0-46 iscsid: exec thread 01e921e0 callback
Feb 22 19:37:29 10-180-0-46 iscsid: put ev context 0x1e921e0
Feb 22 19:37:29 10-180-0-46 iscsid: connected local port 34342 to 10.180.0.30:3260
Feb 22 19:37:29 10-180-0-46 iscsid: in kbind_conn
Feb 22 19:37:29 10-180-0-46 iscsid: in __kipc_call
Feb 22 19:37:29 10-180-0-46 iscsid: in kwritev
Feb 22 19:37:29 10-180-0-46 iscsid: in nlpayload_read
Feb 22 19:37:29 10-180-0-46 iscsid: in nlpayload_read
Feb 22 19:37:29 10-180-0-46 iscsid: bound iSCSI connection 11:0 to session 11
Feb 22 19:37:29 10-180-0-46 iscsid: sysfs_attr_get_value: open '/class/iscsi_connection/connection11:0'/'exp_statsn'
Feb 22 19:37:29 10-180-0-46 iscsid: sysfs_attr_get_value: new uncached attribute '/sys/class/iscsi_connection/connection11:0/exp_statsn'
Feb 22 19:37:29 10-180-0-46 iscsid: sysfs_attr_get_value: add to cache '/sys/class/iscsi_connection/connection11:0/exp_statsn'
Feb 22 19:37:29 10-180-0-46 iscsid: sysfs_attr_get_value: cache '/sys/class/iscsi_connection/connection11:0/exp_statsn' with attribute value '440'
Feb 22 19:37:29 10-180-0-46 iscsid: sending login PDU with current stage 0, next stage 1, transit 0x80, isid 0x00023d0b0000 exp_statsn 440
Feb 22 19:37:29 10-180-0-46 iscsid: > InitiatorName=iqn.1993-08.org.debian:01:838ab867906c
Feb 22 19:37:29 10-180-0-46 iscsid: > InitiatorAlias=10-180-0-46
Feb 22 19:37:29 10-180-0-46 iscsid: > TargetName=iqn.test.1
Feb 22 19:37:29 10-180-0-46 iscsid: > SessionType=Normal
Feb 22 19:37:29 10-180-0-46 iscsid: > AuthMethod=CHAP,None
Feb 22 19:37:29 10-180-0-46 iscsid: in ksend_pdu_begin
Feb 22 19:37:29 10-180-0-46 iscsid: send PDU began for hdr 48 bytes and data 144 bytes
Feb 22 19:37:29 10-180-0-46 iscsid: in kwritev
Feb 22 19:37:29 10-180-0-46 iscsid: wrote 48 bytes of PDU header
Feb 22 19:37:29 10-180-0-46 iscsid: in kwritev
Feb 22 19:37:29 10-180-0-46 iscsid: wrote 144 bytes of PDU data
Feb 22 19:37:29 10-180-0-46 iscsid: in ksend_pdu_end
Feb 22 19:37:29 10-180-0-46 iscsid: in __kipc_call
Feb 22 19:37:29 10-180-0-46 iscsid: in kwritev
Feb 22 19:37:29 10-180-0-46 iscsid: in nlpayload_read
Feb 22 19:37:29 10-180-0-46 iscsid: in nlpayload_read
Feb 22 19:37:29 10-180-0-46 iscsid: send PDU finished for conn 11:0
Feb 22 19:37:29 10-180-0-46 iscsid: thread removed
Feb 22 19:37:29 10-180-0-46 iscsid: poll result 1
Feb 22 19:37:29 10-180-0-46 iscsid: in ctldev_handle
Feb 22 19:37:29 10-180-0-46 iscsid: in nl_read
Feb 22 19:37:29 10-180-0-46 iscsid: ctldev_handle got event type 101
Feb 22 19:37:29 10-180-0-46 iscsid: get ev context 0x1e921e0
Feb 22 19:37:29 10-180-0-46 iscsid: message real length is 120 bytes, recv_handle 0x1e92230
Feb 22 19:37:29 10-180-0-46 iscsid: in nlpayload_read
Feb 22 19:37:29 10-180-0-46 iscsid: sched conn context 0x1e921e0 event 1, tmo 0
Feb 22 19:37:29 10-180-0-46 iscsid: thread 0x1e921e0 schedule: delay 0 state 3
Feb 22 19:37:29 10-180-0-46 iscsid: poll result 1
Feb 22 19:37:29 10-180-0-46 iscsid: in ctldev_handle
Feb 22 19:37:29 10-180-0-46 iscsid: in nl_read
Feb 22 19:37:29 10-180-0-46 iscsid: ctldev_handle got event type 102
Feb 22 19:37:29 10-180-0-46 iscsid: get ev context 0x1e942b0
Feb 22 19:37:29 10-180-0-46 iscsid: message real length is 72 bytes, recv_handle 0x1e94300
Feb 22 19:37:29 10-180-0-46 iscsid: in nlpayload_read
Feb 22 19:37:29 10-180-0-46 iscsid: sched conn context 0x1e942b0 event 3, tmo 0
Feb 22 19:37:29 10-180-0-46 iscsid: thread 0x1e942b0 schedule: delay 0 state 3
Feb 22 19:37:29 10-180-0-46 iscsid: exec thread 01e921e0 callback
Feb 22 19:37:29 10-180-0-46 iscsid: in krecv_pdu_begin
Feb 22 19:37:29 10-180-0-46 iscsid: recv PDU began, pdu handle 0x1e92268
Feb 22 19:37:29 10-180-0-46 iscsid: in kread 0 48 0x1e8be58 0x1e92268
Feb 22 19:37:29 10-180-0-46 iscsid: read 48 bytes of PDU header
Feb 22 19:37:29 10-180-0-46 iscsid: read 48 PDU header bytes, opcode 0x23, dlength 0, data 0x1e8bec8, max 8192
Feb 22 19:37:29 10-180-0-46 iscsid: in krecv_pdu_end
Feb 22 19:37:29 10-180-0-46 iscsid: recv PDU finished for pdu handle 0x0x1e92268
Feb 22 19:37:29 10-180-0-46 iscsid: put ev context 0x1e921e0
Feb 22 19:37:29 10-180-0-46 iscsid: login response status 0203
Feb 22 19:37:29 10-180-0-46 iscsid: login_rsp ret (0)
Feb 22 19:37:29 10-180-0-46 iscsid: conn 0 login rejected: initiator error - target not found (02/03)
Feb 22 19:37:29 10-180-0-46 iscsid: iscsi_login_eh
Feb 22 19:37:29 10-180-0-46 iscsid: thread 01e942b0 delete: state 2
Feb 22 19:37:29 10-180-0-46 iscsid: deleting a scheduled/waiting thread!
Feb 22 19:37:29 10-180-0-46 iscsid: put ev context 0x1e942b0
Feb 22 19:37:29 10-180-0-46 iscsid: login failed ISCSI_CONN_STATE_IN_LOGIN/R_STAGE_SESSION_REOPEN 8
Feb 22 19:37:29 10-180-0-46 iscsid: re-opening session 11 (reopen_cnt 8)
Feb 22 19:37:29 10-180-0-46 iscsid: thread 01e8e2e8 delete: state 1
Feb 22 19:37:29 10-180-0-46 iscsid: deleting a scheduled/waiting thread!
Feb 22 19:37:29 10-180-0-46 iscsid: thread 01e8e320 delete: state 3
Feb 22 19:37:29 10-180-0-46 iscsid: disconnecting conn 0x1e8be20, fd 8
Feb 22 19:37:29 10-180-0-46 iscsid: in kstop_conn
Feb 22 19:37:29 10-180-0-46 iscsid: in __kipc_call
Feb 22 19:37:29 10-180-0-46 iscsid: in kwritev
Feb 22 19:37:29 10-180-0-46 iscsid: in nlpayload_read
Feb 22 19:37:29 10-180-0-46 iscsid: in nlpayload_read
Feb 22 19:37:29 10-180-0-46 iscsid: connection 11:0 is stopped for recovery
Feb 22 19:37:29 10-180-0-46 iscsid: Waiting 2 seconds before trying to reconnect.

open-iscsi的部分实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
void event_loop(struct iscsi_ipc *ipc, int control_fd, int mgmt_ipc_fd)
{
...........
res = poll(poll_array, POLL_MAX, ACTOR_RESOLUTION);
if (res > 0) {
log_debug(6, "poll result %d", res);
if (poll_array[POLL_CTRL].revents)
ipc->ctldev_handle();

if (poll_array[POLL_IPC].revents)
mgmt_ipc_handle(mgmt_ipc_fd);
} else if (res < 0) {
if (errno == EINTR) {
log_debug(1, "event_loop interrupted");
} else {
log_error("got poll() error (%d), errno (%d), "
"exiting", res, errno);
break;
}
} else
actor_poll();
...........
}

open-iscsi里通过poll的超时来实现的定时器,重连机制就是注册了一个定时器,然后当poll返回0(表示timeout)时,就调用到actor_poll()进而调用之前注册的回调函数。当频繁调用iscsiadm命令时就会触发poll事件,从而一直没有去调用actor_poll()导致的现象就是iscsi没去重连。