LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Other *NIX Forums > AIX
User Name
Password
AIX This forum is for the discussion of IBM AIX.
eserver and other IBM related questions are also on topic.

Notices


Reply
  Search this Thread
Old 05-04-2006, 12:53 PM   #1
sosborne
LQ Newbie
 
Registered: Apr 2006
Posts: 23

Rep: Reputation: 15
"NIM thread blocked" & "Deadman Switch" errors


I've been looking for descriptions of these services, but can't find a lot of information on them, yet. I've been regularly getting the following errors:

IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
864D2CE3 0501145606 P S topsvcs NIM thread blocked
FA723BD9 0501145606 I S topsvcs Deadman Switch (DMS) close to trigger
864D2CE3 0501145606 P S topsvcs NIM thread blocked
.
.
.
.
3C81E43F 0501145606 P U topsvcs Late in sending heartbeat

Now, from what little I've been able to gather, this sounds like a buffer overrun problem; however, the response to the original poster's questions was unclear.

Does anybody know what I'm looking at here? I'm hoping to understand what is behind the problem, find the solution, and then apply it; but, I'm hoping to learn the 'why's' =).

Any help would be greatly appreciated, thank you.

Steve
 
Old 05-09-2006, 02:26 PM   #2
Michael AM
Member
 
Registered: May 2006
Distribution: AIX 5.3, AIX 6.1, AIX 7.1
Posts: 123

Rep: Reputation: 33
In this context - NIM is the abbreviation for "Network Interface Module" and should not be confused with nim aka "network installation manager".

What you should be trying to look at is the output of the command:

lssrc -ls topsvcs

I dont have a HACMP cluster handy atm, but among other things it will tell you about how well the heartbeats are being passed.

It appears that you have only one network in your cluster configuration. A non-IP network is needed (read required) to prevent network failures (or NIM failures) from creating a partitioned cluster.

Basically, the function of the deadman switch is to keep track of when the node has last been able to tell the other active nodes that it is still active. A message or heartbeat sent over ANY of the networks is enough to satisfy the deadman switch requirement. (all networks (note plural) is not a single failure (SPOF) and HACMP is designed to handle a single SPOF - that it often handles more is a bonus, not design.

Next step for here at least will be a verbose errpt output:

errpt -aJ 864D2CE3

I am hoping there will be more information about which interface is failing.

And it helps to verify you have the latest fixes installed, etc..

lslpp -L cluster.\*

oslevel -r

etc.
 
Old 05-10-2006, 11:03 AM   #3
sosborne
LQ Newbie
 
Registered: Apr 2006
Posts: 23

Original Poster
Rep: Reputation: 15
Thank you for the response, as soon as I can, I'll get the extended error output posted!
 
Old 06-01-2006, 11:21 AM   #4
sosborne
LQ Newbie
 
Registered: Apr 2006
Posts: 23

Original Poster
Rep: Reputation: 15
I apologize for taking so long to get back to you, but I've been out of town on a business trip...

Here's the result of errrpt -aj....

Code:
---------------------------------------------------------------------------
LABEL:		TS_NIM_ERROR_STUCK_
IDENTIFIER:	864D2CE3

Date/Time:       Mon May  1 14:56:49 ADT 
Sequence Number: 9112
Machine Id:      00C0853E4C00
Node Id:         akkotz
Class:           S
Type:            PERM
Resource Name:   topsvcs         

Description
NIM thread blocked

Probable Causes
A thread in a Topology Services Network Interface Module (NIM) process
was blocked
Topology Services NIM process cannot get timely access to CPU

User Causes
Excessive memory consumption is causing high memory contention
Excessive disk I/O is causing high memory contention

	Recommended Actions
	Examine I/O and memory activity on the system
	Reduce load on the system
	Tune virtual memory parameters
	Call IBM Service if problem persists

Failure Causes
Excessive virtual memory activity prevents NIM from making progress
Excessive disk I/O traffic is interfering with paging I/O

	Recommended Actions
	Examine I/O and memory activity on the system
	Reduce load on the system
	Tune virtual memory parameters
	Call IBM Service if problem persists

Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39,5455                  
ERROR ID 
6XnGH40l6dJ2/j1T1/w7k.1...................
REFERENCE CODE
                                          
Thread which was blocked
receive thread
Interval in seconds during which process was blocked
         229
Interface name
en1
---------------------------------------------------------------------------
LABEL:		TS_NIM_ERROR_STUCK_
IDENTIFIER:	864D2CE3

Date/Time:       Mon May  1 14:56:49 ADT 
Sequence Number: 9110
Machine Id:      00C0853E4C00
Node Id:         akkotz
Class:           S
Type:            PERM
Resource Name:   topsvcs         

Description
NIM thread blocked

Probable Causes
A thread in a Topology Services Network Interface Module (NIM) process
was blocked
Topology Services NIM process cannot get timely access to CPU

User Causes
Excessive memory consumption is causing high memory contention
Excessive disk I/O is causing high memory contention

	Recommended Actions
	Examine I/O and memory activity on the system
	Reduce load on the system
	Tune virtual memory parameters
	Call IBM Service if problem persists

Failure Causes
Excessive virtual memory activity prevents NIM from making progress
Excessive disk I/O traffic is interfering with paging I/O

	Recommended Actions
	Examine I/O and memory activity on the system
	Reduce load on the system
	Tune virtual memory parameters
	Call IBM Service if problem persists

Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39,5455                  
ERROR ID 
6XnGH40l6dJ2/1HV./w7k.1...................
REFERENCE CODE
                                          
Thread which was blocked
receive thread
Interval in seconds during which process was blocked
         228
Interface name
en2
I ran diagnostics on the card (netstat, etc.) & AIX did not find any problems with the card itself. I don't have the error code, but AIX reported that it is either the cable connection to our switch or the port on the switch itself.

Thank you!
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Telling people to use "Google," to "RTFM," or "Use the search feature" Ausar General 77 03-21-2010 11:26 AM
"Xlib: extension "XFree86-DRI" missing on display ":0.0"." zaps Linux - Games 9 05-14-2007 03:07 PM
FC4 install errors, "diabling IRQ #10" "nobody cares" error message??? A6Quattro Fedora 6 07-20-2005 12:49 PM
Take all posts from "Website Suggestions & Feedback" out of the "0 Reply Thread&q t3gah LQ Suggestions & Feedback 7 03-21-2005 07:27 PM
Does "ac97 & RH8" imply "buy a Sound Blaster Live"? nicktoop Linux - Hardware 3 02-13-2004 08:53 AM

LinuxQuestions.org > Forums > Other *NIX Forums > AIX

All times are GMT -5. The time now is 06:18 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration