LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Server (http://www.linuxquestions.org/questions/linux-server-73/)
-   -   pacemaker - no checking for iscsi status? (http://www.linuxquestions.org/questions/linux-server-73/pacemaker-no-checking-for-iscsi-status-4175443813/)

eantoranz 01-02-2013 09:08 AM

pacemaker - no checking for iscsi status?
 
Hi!

I just figured out how to set up a iscsi resource in pacemaker. As a first test I shut down the iscsitarget service on the SAN server and I expected pacemaker to realize that the service was down after a few seconds but so far everything is going fine, at least from what pacemaker is telling me (which is not what should happen).

Why is this failure not detected? Thanks in advance.

vishesh 01-03-2013 07:03 AM

Hi

Can you please share output of following command

root# crm status

Also you can give a try by executing following command

root# crm resource cleanup <resource>

If its not working then please share pacemaker configuration so that we can help in better way .

Thanks

eantoranz 01-03-2013 07:33 AM

Well, let's see.

Before I shut down the iscsitarget service on my san server, everything was fine and dandy. From cluster1 (the only active node at the time):

Code:

cps@cluster1:~$ netstat -ntp | grep 192.168.55.11
(No info could be read for "-p": geteuid()=1000 but you should be root.)
tcp        0      0 192.168.55.12:56798    192.168.55.11:3260      ESTABLISHED -

Then, I shutdown iscsitarget and almost immediately I started to see stuff on cluster1 about it on syslog:

Code:

Jan  3 08:53:26 cluster1 kernel: [  338.605211]  connection1:0: detected conn error (1020)
Jan  3 08:53:27 cluster1 iscsid: Kernel reported iSCSI connection 1:0 error (1020) state (3)
Jan  3 08:53:30 cluster1 iscsid: connect to 192.168.55.11:3260 failed (Connection refused)
Jan  3 08:54:03 cluster1 iscsid: last message repeated 9 times
Jan  3 08:55:04 cluster1 iscsid: last message repeated 16 times
Jan  3 08:55:27 cluster1 iscsid: last message repeated 6 times
Jan  3 08:55:27 cluster1 kernel: [  458.856309]  session1: session recovery timed out after 120 secs
Jan  3 08:55:27 cluster1 kernel: [  458.856512] sd 2:0:0:0: [sdb] Unhandled error code
Jan  3 08:55:27 cluster1 kernel: [  458.856519] sd 2:0:0:0: [sdb] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
Jan  3 08:55:27 cluster1 kernel: [  458.856550] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 00 00 00 7c 00 00 10 00
Jan  3 08:55:27 cluster1 kernel: [  458.856571] end_request: I/O error, dev sdb, sector 124
Jan  3 08:55:27 cluster1 kernel: [  458.870106] Buffer I/O error on device sdb5, logical block 0
Jan  3 08:55:27 cluster1 kernel: [  458.876521] lost page write due to I/O error on sdb5
Jan  3 08:55:27 cluster1 kernel: [  458.876529] Buffer I/O error on device sdb5, logical block 1
Jan  3 08:55:27 cluster1 kernel: [  458.881319] lost page write due to I/O error on sdb5
Jan  3 08:55:27 cluster1 kernel: [  458.881334] sd 2:0:0:0: [sdb] Unhandled error code
Jan  3 08:55:27 cluster1 kernel: [  458.881336] sd 2:0:0:0: [sdb] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
Jan  3 08:55:27 cluster1 kernel: [  458.881338] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 00 54 00 84 00 00 10 00
Jan  3 08:55:27 cluster1 kernel: [  458.881344] end_request: I/O error, dev sdb, sector 5505156
Jan  3 08:55:27 cluster1 kernel: [  458.885717] Buffer I/O error on device sdb5, logical block 688129
Jan  3 08:55:27 cluster1 kernel: [  458.890340] lost page write due to I/O error on sdb5

However pacemaker won't care much about it.

Code:

cps@cluster1:~$ sudo crm_mon -1
============
Last updated: Thu Jan  3 08:57:28 2013
Stack: openais
Current DC: cluster1 - partition WITHOUT quorum
Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
2 Nodes configured, 2 expected votes
1 Resources configured.
============

Online: [ cluster1 ]
OFFLINE: [ cluster2 ]

 Resource Group: sanos
    ip_flotante        (ocf::heartbeat:IPaddr2):      Started cluster1
    san        (ocf::heartbeat:iscsi): Started cluster1
    sanprobedelay      (ocf::heartbeat:Delay): Started cluster1
    datapostgres      (ocf::heartbeat:Filesystem):    Started cluster1
    wwwsanos  (ocf::heartbeat:Filesystem):    Started cluster1
    wwwsesion  (ocf::heartbeat:Filesystem):    Started cluster1
    postgres  (lsb:postgresql-8.4):  Started cluster1
    pgbouncer  (lsb:pgbouncer):        Started cluster1
    apache    (lsb:apache2):  Started cluster1

Failed actions:
    pgbouncer_monitor_0 (node=cluster1, call=9, rc=1, status=complete): unknown error

pgbouncer is always complaining but the cluster works fine normally so I don't think that's something to care about.

crm status shows the same thing

Code:

cps@cluster1:~$ sudo crm status
============
Last updated: Thu Jan  3 08:58:48 2013
Stack: openais
Current DC: cluster1 - partition WITHOUT quorum
Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
2 Nodes configured, 2 expected votes
1 Resources configured.
============

Online: [ cluster1 ]
OFFLINE: [ cluster2 ]

 Resource Group: sanos
    ip_flotante        (ocf::heartbeat:IPaddr2):      Started cluster1
    san        (ocf::heartbeat:iscsi): Started cluster1
    sanprobedelay      (ocf::heartbeat:Delay): Started cluster1
    datapostgres      (ocf::heartbeat:Filesystem):    Started cluster1
    wwwsanos  (ocf::heartbeat:Filesystem):    Started cluster1
    wwwsesion  (ocf::heartbeat:Filesystem):    Started cluster1
    postgres  (lsb:postgresql-8.4):  Started cluster1
    pgbouncer  (lsb:pgbouncer):        Started cluster1
    apache    (lsb:apache2):  Started cluster1

Failed actions:
    pgbouncer_monitor_0 (node=cluster1, call=9, rc=1, status=complete): unknown error

I will try the cleanup next. Let's see what happens when I cleanup san resource. After some seconds it ended up like this:
Code:

cps@cluster1:~$ sudo crm status
============
Last updated: Thu Jan  3 09:01:13 2013
Stack: openais
Current DC: cluster1 - partition WITHOUT quorum
Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
2 Nodes configured, 2 expected votes
1 Resources configured.
============

Online: [ cluster1 ]
OFFLINE: [ cluster2 ]

 Resource Group: sanos
    ip_flotante        (ocf::heartbeat:IPaddr2):      Started cluster1
    san        (ocf::heartbeat:iscsi): Started cluster1 FAILED
    sanprobedelay      (ocf::heartbeat:Delay): Started cluster1
    datapostgres      (ocf::heartbeat:Filesystem):    Started cluster1
    wwwsanos  (ocf::heartbeat:Filesystem):    Started cluster1
    wwwsesion  (ocf::heartbeat:Filesystem):    Started cluster1
    postgres  (lsb:postgresql-8.4):  Started cluster1 (unmanaged) FAILED
    pgbouncer  (lsb:pgbouncer):        Stopped
    apache    (lsb:apache2):  Stopped

Failed actions:
    pgbouncer_monitor_0 (node=cluster1, call=9, rc=1, status=complete): unknown error
    postgres_stop_0 (node=cluster1, call=24, rc=1, status=complete): unknown error
    san_monitor_0 (node=cluster1, call=21, rc=1, status=complete): unknown error



All times are GMT -5. The time now is 08:34 AM.