Well, let's see.
Before I shut down the iscsitarget service on my san server, everything was fine and dandy. From cluster1 (the only active node at the time):
Code:
cps@cluster1:~$ netstat -ntp | grep 192.168.55.11
(No info could be read for "-p": geteuid()=1000 but you should be root.)
tcp 0 0 192.168.55.12:56798 192.168.55.11:3260 ESTABLISHED -
Then, I shutdown iscsitarget and almost immediately I started to see stuff on cluster1 about it on syslog:
Code:
Jan 3 08:53:26 cluster1 kernel: [ 338.605211] connection1:0: detected conn error (1020)
Jan 3 08:53:27 cluster1 iscsid: Kernel reported iSCSI connection 1:0 error (1020) state (3)
Jan 3 08:53:30 cluster1 iscsid: connect to 192.168.55.11:3260 failed (Connection refused)
Jan 3 08:54:03 cluster1 iscsid: last message repeated 9 times
Jan 3 08:55:04 cluster1 iscsid: last message repeated 16 times
Jan 3 08:55:27 cluster1 iscsid: last message repeated 6 times
Jan 3 08:55:27 cluster1 kernel: [ 458.856309] session1: session recovery timed out after 120 secs
Jan 3 08:55:27 cluster1 kernel: [ 458.856512] sd 2:0:0:0: [sdb] Unhandled error code
Jan 3 08:55:27 cluster1 kernel: [ 458.856519] sd 2:0:0:0: [sdb] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
Jan 3 08:55:27 cluster1 kernel: [ 458.856550] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 00 00 00 7c 00 00 10 00
Jan 3 08:55:27 cluster1 kernel: [ 458.856571] end_request: I/O error, dev sdb, sector 124
Jan 3 08:55:27 cluster1 kernel: [ 458.870106] Buffer I/O error on device sdb5, logical block 0
Jan 3 08:55:27 cluster1 kernel: [ 458.876521] lost page write due to I/O error on sdb5
Jan 3 08:55:27 cluster1 kernel: [ 458.876529] Buffer I/O error on device sdb5, logical block 1
Jan 3 08:55:27 cluster1 kernel: [ 458.881319] lost page write due to I/O error on sdb5
Jan 3 08:55:27 cluster1 kernel: [ 458.881334] sd 2:0:0:0: [sdb] Unhandled error code
Jan 3 08:55:27 cluster1 kernel: [ 458.881336] sd 2:0:0:0: [sdb] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
Jan 3 08:55:27 cluster1 kernel: [ 458.881338] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 00 54 00 84 00 00 10 00
Jan 3 08:55:27 cluster1 kernel: [ 458.881344] end_request: I/O error, dev sdb, sector 5505156
Jan 3 08:55:27 cluster1 kernel: [ 458.885717] Buffer I/O error on device sdb5, logical block 688129
Jan 3 08:55:27 cluster1 kernel: [ 458.890340] lost page write due to I/O error on sdb5
However pacemaker won't care much about it.
Code:
cps@cluster1:~$ sudo crm_mon -1
============
Last updated: Thu Jan 3 08:57:28 2013
Stack: openais
Current DC: cluster1 - partition WITHOUT quorum
Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
2 Nodes configured, 2 expected votes
1 Resources configured.
============
Online: [ cluster1 ]
OFFLINE: [ cluster2 ]
Resource Group: sanos
ip_flotante (ocf::heartbeat:IPaddr2): Started cluster1
san (ocf::heartbeat:iscsi): Started cluster1
sanprobedelay (ocf::heartbeat:Delay): Started cluster1
datapostgres (ocf::heartbeat:Filesystem): Started cluster1
wwwsanos (ocf::heartbeat:Filesystem): Started cluster1
wwwsesion (ocf::heartbeat:Filesystem): Started cluster1
postgres (lsb:postgresql-8.4): Started cluster1
pgbouncer (lsb:pgbouncer): Started cluster1
apache (lsb:apache2): Started cluster1
Failed actions:
pgbouncer_monitor_0 (node=cluster1, call=9, rc=1, status=complete): unknown error
pgbouncer is always complaining but the cluster works fine normally so I don't think that's something to care about.
crm status shows the same thing
Code:
cps@cluster1:~$ sudo crm status
============
Last updated: Thu Jan 3 08:58:48 2013
Stack: openais
Current DC: cluster1 - partition WITHOUT quorum
Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
2 Nodes configured, 2 expected votes
1 Resources configured.
============
Online: [ cluster1 ]
OFFLINE: [ cluster2 ]
Resource Group: sanos
ip_flotante (ocf::heartbeat:IPaddr2): Started cluster1
san (ocf::heartbeat:iscsi): Started cluster1
sanprobedelay (ocf::heartbeat:Delay): Started cluster1
datapostgres (ocf::heartbeat:Filesystem): Started cluster1
wwwsanos (ocf::heartbeat:Filesystem): Started cluster1
wwwsesion (ocf::heartbeat:Filesystem): Started cluster1
postgres (lsb:postgresql-8.4): Started cluster1
pgbouncer (lsb:pgbouncer): Started cluster1
apache (lsb:apache2): Started cluster1
Failed actions:
pgbouncer_monitor_0 (node=cluster1, call=9, rc=1, status=complete): unknown error
I will try the cleanup next. Let's see what happens when I cleanup san resource. After some seconds it ended up like this:
Code:
cps@cluster1:~$ sudo crm status
============
Last updated: Thu Jan 3 09:01:13 2013
Stack: openais
Current DC: cluster1 - partition WITHOUT quorum
Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
2 Nodes configured, 2 expected votes
1 Resources configured.
============
Online: [ cluster1 ]
OFFLINE: [ cluster2 ]
Resource Group: sanos
ip_flotante (ocf::heartbeat:IPaddr2): Started cluster1
san (ocf::heartbeat:iscsi): Started cluster1 FAILED
sanprobedelay (ocf::heartbeat:Delay): Started cluster1
datapostgres (ocf::heartbeat:Filesystem): Started cluster1
wwwsanos (ocf::heartbeat:Filesystem): Started cluster1
wwwsesion (ocf::heartbeat:Filesystem): Started cluster1
postgres (lsb:postgresql-8.4): Started cluster1 (unmanaged) FAILED
pgbouncer (lsb:pgbouncer): Stopped
apache (lsb:apache2): Stopped
Failed actions:
pgbouncer_monitor_0 (node=cluster1, call=9, rc=1, status=complete): unknown error
postgres_stop_0 (node=cluster1, call=24, rc=1, status=complete): unknown error
san_monitor_0 (node=cluster1, call=21, rc=1, status=complete): unknown error