SP2 node issue - Auto-join helping or hurting?
Greetings fellow AIX'ers,
I administer an SP2 (AIX 4.3.3 and PSSP 3.2)....recently I typed: spmon -d (on the control work station -- CWS) and the following was outputted for one of our nodes: ----------------------------------- Frame 1 ------------------------------------------------------------- Host Switch Key Env Front Panel LCD/LED Slot Node Type Power Responds Responds Switch Error LCD/LED Flashes ---- ---- ----- ----- -------- -------- ------- ----- ---------------- ---------------------------------------- 15 15 thin on yes autojn N/A no LCDs are blank no So node 15 has a Switch responds values of: "autojn" What exactly does this mean? The output for other nodes in this frame look similar to: 14 14 thin on yes yes N/A no LCDs are blank no where a value of "yes" sits in the Switch Responds column....indicating the node is able to speak over the switch (SP Switch) I went into the "Perspectives" Gui and i opened up the node 15 configuration properties and what was listed under the "SWITCH RESPONDS" was: "Fenced with Autojoin"....I clicked through to "unfence" this node and the following error popped up: Unfencing nodes: 15 Eunfence: 0028-162 Node specified by Node Number 15 is not currently fenced. Unable to unfence the following nodes: n15 -- SP node number 15 Not Fenced Can any provide an indication what the "Autojoin" indicator is, and why it does not allow me to unfence this seemingly fenced off node??? thanks a bunch zepp |
The autojoin was a feature that was added to PSSP (in one of the 3.x versions, I think) that will let the nodes automatically join the switch instead of having to do it manually.
Check the node itself and see if the worm is running: # ps –ef | grep –i worm root 19954 1 0 Jul 10 - 3:22 /usr/lpp/ssp/css/fault_service_Wo rm_RTG_SP -r 0 -b 1 -s 5 -p 3 -a TB3 -t 28 If it isn’t running then run the rc.switch that is in /etc/inittab to start it: # /usr/lpp/ssp/css/rc.switch Also, since it shows fenced but replies unfenced, you can try to fence the node: # fence 15 followed by: # unfenced 15 Sometimes it can get messed up and show fenced/unfenced when nodes have all been brought down/up without having Equiesced the switch first. But in this case it should tell you when trying to unfenced the node to start the switch, which is done with ‘Estart’. The log files are in /var/adm/SPlogs, and the ones that you can look at are: rc.switch.log and worm.trace which should identify problems with the node in question. I would look on the primary node for the SPlogs, too. You can find the primary node (and backup) by typing ‘Eprimary’. |
It is 'unfence' though I mistakenly typed 'unfenced' a couple of times.
|
thanks for the help screwloose...and sorry about the delay:
i was able to Efence the node...and the current line now reads (after typing spmon -d: ) 15 15 thin on yes no N/A no LCDs are blank no So under the column: Switch Responds...there is now the label: No How do i get this to go to: Yes? Thanks.... |
Did you run 'Eunfence 15' on the node after running Efence? Also make sure the worm daemon is running on node15 using 'ps -ef | grep -i worm'. If the worm is not running then run '/usr/lpp/ssp/css/rc.switch'.
Let me know the results. |
root@CWS:/var/adm/SPlogs# Eunfence 15
Eunfence: 0028-162 Node specified by Node Number 15 is not currently fenced. Unable to unfence the following nodes: N15, SP node number 15 Not Fenced # ps -ef |grep -i worm root 7264 1 0 Jul 17 - 0:00 /usr/lpp/ssp/css/fault_service_Worm_RTG_SP -r 14 -b 1 -s 7 -p 2 -a TB3 -t 28 So the worm is running on node15 and according to the Eunfence command...it is not fenced!!! in circles we go...any help? |
Be cautious about this one, because it could disrupt the switch network if it currently working.
Try Eclock -d Estart That resychronises the switch internal clock between the nodes and then restarts the switch. |
iainr...thanks a bunch...this worked!!!!
Question though...when the switch-responds indicated: "No" <--- is it correct to say that the node was "off" of the switch...that is, communication between that node and say, another node, through the switch, was not possible? thanks.... zepp |
If you had tried to Efence/Eunfence other nodes and had the same problem then you would have known the switch wasn’t running. It is possible that you could have ran ‘Estart’ without clocking the switch and the nodes would have rejoined. I have seen this happen where one or two nodes are not on the switch and all other nodes report they are joined. However, if you try to run anything it will respond the switch isn’t running.
|
All times are GMT -5. The time now is 02:28 AM. |