LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 09-13-2010, 09:43 AM   #1
gandalf85
LQ Newbie
 
Registered: Sep 2010
Posts: 1

Rep: Reputation: 0
Intermittent connectivity issues with ROCKS on a compute cluster


I have a cluster set up with a head node and compute nodes running TORQUE and MOAB. The distro is ROCKS 5.3. I've been having problems with the connectivity for the past couple weeks now. Every couple hours it seems like the network connectivity will just stop working: sometimes it'll start back up in 10-15 minutes, sometimes I have to reboot the machine. I have SAMBA set up, and the network drive I have mounted on my windows PC won't respond (often causing windows explorer to crash) and I can't putty in. During this time, if I already have a putty window open, I can do basic commands like "ls" and "cd" but qstat and pbsnodes don't work. If I'm putty'd into the head node, I can ssh into one of the compute nodes. Eventually the putty window will crash though. Also, I can ping the server just fine.

The SAMBA logs were reporting all sorts of problems:

[2010/09/10 03:51:29, 0] lib/fault.c:fault_report(42)
INTERNAL ERROR: Signal 7 in pid 9816 (3.0.33-3.15.el5_4)
[2010/09/10 03:51:29, 0] smbd/close.c:close_directory(430)
close_directory: Could not get share mode lock for Pao
Please read the Trouble-Shooting section of the Samba3-HOWTO
[2010/09/10 03:51:29, 0] lib/fault.c:fault_report(44)

From: http://www.samba.org/samba/docs/Samba3-HOWTO.pdf
[2010/09/10 03:51:29, 0] lib/fault.c:fault_report(41)
[2010/09/10 03:51:29, 0] lib/fault.c:fault_report(45)
[2010/09/10 03:51:29, 0] lib/fault.c:fault_report(42)
[2010/09/10 03:51:29, 0] lib/util.c:smb_panic(1655)
INTERNAL ERROR: Signal 7 in pid 8475 (3.0.33-3.15.el5_4)
PANIC (pid 9816): internal error
Please read the Trouble-Shooting section of the Samba3-HOWTO
[2010/09/10 03:51:30, 0] lib/util.c:log_stack_trace(1759)
[2010/09/10 03:51:30, 0] lib/fault.c:fault_report(44)

I turned off SAMBA, still have the same problems. /var/log/messages contained this:

Sep 10 10:38:02 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth0
Sep 10 10:38:02 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth0
Sep 10 10:38:02 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth1
Sep 10 10:38:02 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth1

bnx2i is some sort of driver for the broadcom network card. I updated the broadcom multi-function drivers and the firmware according to http://h20000.www2.hp.com/bc/docs/su.../c01309558.pdf, still have problems. One thing I couldn't get working was the bnx2i iSCSI offload driver -- I ran into version issues with the RPMs. I've ran MEMTEST and a couple hardware diagnostic checks -- can't find any problems. Here's /var/log/messages from when I reboot the machine. Note that I hosed the x server somehow, and I'm not really worried about fixing that.

Sep 13 04:49:25 wantsh01 gdm[3930]: Failed to start X server several times in a short time period; disabling display :0
Sep 13 04:49:29 wantsh01 mountd[3527]: Caught signal 15, un-registering and exiting.
Sep 13 04:52:12 wantsh01 kernel: Memory for crash kernel (0x0 to 0x0) notwithin permissible range
Sep 13 04:52:12 wantsh01 kernel: PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved
Sep 13 04:52:12 wantsh01 kernel: PCI: Not using MMCONFIG.
Sep 13 04:52:13 wantsh01 kernel: intel_rng: FWH not detected
Sep 13 04:52:13 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth0
Sep 13 04:52:13 wantsh01 kernel: bnx2i: dev eth0 does not support iscsi
Sep 13 04:52:13 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth1
Sep 13 04:52:13 wantsh01 kernel: bnx2i: dev eth1 does not support iscsi
Sep 13 04:52:13 wantsh01 named[3028]: the working directory is not writable
Sep 13 04:52:19 wantsh01 sshd[3428]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use.
Sep 13 04:52:19 wantsh01 xinetd[3445]: /etc/xinetd.d/RCS is not a regular file. It is being skipped.
Sep 13 04:52:24 wantsh01 smartd[3926]: Problem creating device name scan list
Sep 13 04:52:24 wantsh01 smartd[3926]: Problem creating device name scan list
Sep 13 04:52:24 wantsh01 smartd[3926]: In the system's table of devices NO devices found to scan
Sep 13 04:52:31 wantsh01 gdm[4042]: gdm_slave_xioerror_handler: Fatal X error - Restarting :0
Sep 13 04:52:40 wantsh01 gdm[4188]: gdm_slave_xioerror_handler: Fatal X error - Restarting :0
Sep 13 04:52:49 wantsh01 gdm[4210]: gdm_slave_xioerror_handler: Fatal X error - Restarting :0
Sep 13 04:53:19 wantsh01 gdm[3940]: Failed to start X server several times in a short time period; disabling display :0
Sep 13 04:53:32 wantsh01 dhcpd: receive_packet failed on eth0: Network is down
Sep 13 04:53:33 wantsh01 kernel: bnx2i: dev eth0 does not support iscsi
Sep 13 04:53:33 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth0
Sep 13 04:53:37 wantsh01 kernel: bnx2i: dev eth1 does not support iscsi
Sep 13 04:53:37 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth1
Sep 13 04:54:50 wantsh01 snmpd[3379]: c64 32 bit check failed
Sep 13 04:55:20 wantsh01 snmpd[3379]: looks like a 64bit wrap, but prev!=new

Thanks for any help, I'd really appreciate some advice.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Network connectivity intermittent vaadoo Linux - Networking 11 09-03-2009 08:40 PM
Suggestions for compute cluster timnp Linux - Server 6 10-05-2008 01:19 AM
intermittent wireless connectivity with bcm4306 maclenin Linux - Wireless Networking 2 01-11-2007 02:18 PM
Intermittent network connectivity problem... does not make sense!!! please help!! SiliconBadger Linux - Networking 0 05-24-2002 10:17 AM
2.4.9-12 rocks cluster - what now?? skatinsky Linux - General 0 03-15-2002 09:55 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 08:27 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration