LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   AIX (http://www.linuxquestions.org/questions/aix-43/)
-   -   "NIM thread blocked" & "Deadman Switch" errors (http://www.linuxquestions.org/questions/aix-43/nim-thread-blocked-and-deadman-switch-errors-441627/)

sosborne 05-04-2006 12:53 PM

"NIM thread blocked" & "Deadman Switch" errors
 
I've been looking for descriptions of these services, but can't find a lot of information on them, yet. I've been regularly getting the following errors:

IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
864D2CE3 0501145606 P S topsvcs NIM thread blocked
FA723BD9 0501145606 I S topsvcs Deadman Switch (DMS) close to trigger
864D2CE3 0501145606 P S topsvcs NIM thread blocked
.
.
.
.
3C81E43F 0501145606 P U topsvcs Late in sending heartbeat

Now, from what little I've been able to gather, this sounds like a buffer overrun problem; however, the response to the original poster's questions was unclear.

Does anybody know what I'm looking at here? I'm hoping to understand what is behind the problem, find the solution, and then apply it; but, I'm hoping to learn the 'why's' =).

Any help would be greatly appreciated, thank you.

Steve

Michael AM 05-09-2006 02:26 PM

In this context - NIM is the abbreviation for "Network Interface Module" and should not be confused with nim aka "network installation manager".

What you should be trying to look at is the output of the command:

lssrc -ls topsvcs

I dont have a HACMP cluster handy atm, but among other things it will tell you about how well the heartbeats are being passed.

It appears that you have only one network in your cluster configuration. A non-IP network is needed (read required) to prevent network failures (or NIM failures) from creating a partitioned cluster.

Basically, the function of the deadman switch is to keep track of when the node has last been able to tell the other active nodes that it is still active. A message or heartbeat sent over ANY of the networks is enough to satisfy the deadman switch requirement. (all networks (note plural) is not a single failure (SPOF) and HACMP is designed to handle a single SPOF - that it often handles more is a bonus, not design.

Next step for here at least will be a verbose errpt output:

errpt -aJ 864D2CE3

I am hoping there will be more information about which interface is failing.

And it helps to verify you have the latest fixes installed, etc..

lslpp -L cluster.\*

oslevel -r

etc.

sosborne 05-10-2006 11:03 AM

Thank you for the response, as soon as I can, I'll get the extended error output posted!

sosborne 06-01-2006 11:21 AM

I apologize for taking so long to get back to you, but I've been out of town on a business trip...

Here's the result of errrpt -aj....

Code:

---------------------------------------------------------------------------
LABEL:                TS_NIM_ERROR_STUCK_
IDENTIFIER:        864D2CE3

Date/Time:      Mon May  1 14:56:49 ADT
Sequence Number: 9112
Machine Id:      00C0853E4C00
Node Id:        akkotz
Class:          S
Type:            PERM
Resource Name:  topsvcs       

Description
NIM thread blocked

Probable Causes
A thread in a Topology Services Network Interface Module (NIM) process
was blocked
Topology Services NIM process cannot get timely access to CPU

User Causes
Excessive memory consumption is causing high memory contention
Excessive disk I/O is causing high memory contention

        Recommended Actions
        Examine I/O and memory activity on the system
        Reduce load on the system
        Tune virtual memory parameters
        Call IBM Service if problem persists

Failure Causes
Excessive virtual memory activity prevents NIM from making progress
Excessive disk I/O traffic is interfering with paging I/O

        Recommended Actions
        Examine I/O and memory activity on the system
        Reduce load on the system
        Tune virtual memory parameters
        Call IBM Service if problem persists

Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39,5455                 
ERROR ID
6XnGH40l6dJ2/j1T1/w7k.1...................
REFERENCE CODE
                                         
Thread which was blocked
receive thread
Interval in seconds during which process was blocked
        229
Interface name
en1
---------------------------------------------------------------------------
LABEL:                TS_NIM_ERROR_STUCK_
IDENTIFIER:        864D2CE3

Date/Time:      Mon May  1 14:56:49 ADT
Sequence Number: 9110
Machine Id:      00C0853E4C00
Node Id:        akkotz
Class:          S
Type:            PERM
Resource Name:  topsvcs       

Description
NIM thread blocked

Probable Causes
A thread in a Topology Services Network Interface Module (NIM) process
was blocked
Topology Services NIM process cannot get timely access to CPU

User Causes
Excessive memory consumption is causing high memory contention
Excessive disk I/O is causing high memory contention

        Recommended Actions
        Examine I/O and memory activity on the system
        Reduce load on the system
        Tune virtual memory parameters
        Call IBM Service if problem persists

Failure Causes
Excessive virtual memory activity prevents NIM from making progress
Excessive disk I/O traffic is interfering with paging I/O

        Recommended Actions
        Examine I/O and memory activity on the system
        Reduce load on the system
        Tune virtual memory parameters
        Call IBM Service if problem persists

Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39,5455                 
ERROR ID
6XnGH40l6dJ2/1HV./w7k.1...................
REFERENCE CODE
                                         
Thread which was blocked
receive thread
Interval in seconds during which process was blocked
        228
Interface name
en2

I ran diagnostics on the card (netstat, etc.) & AIX did not find any problems with the card itself. I don't have the error code, but AIX reported that it is either the cable connection to our switch or the port on the switch itself.

Thank you!


All times are GMT -5. The time now is 11:49 PM.