Connection to RAID being lost

thllgo · 02-23-2010, 09:48 AM

Hello

Problem: About every two weeks our system seems to lose its connection to its RAID. Attempting to to an ls hangs. The system is responsive except for access to the RAID

I am running RHEL 5 on an SGI 450 IA64. I have two FC connections to a Silkworm 200E Brocade. The Brocade is in plugged into an SGI TP9500 RAID. The FC cards are LSIFC949X

Both RAID and brocade report they are ok.

When we loose connectivity I see the following messages in the log files. Once this error occurs I can not reboot gracefully I have to power down and power up. Once the system is powered back up all seems ok for the next week or two. I'm guessing there is some kind of hicup in the FC connection to the RAID but it does not recover.

kernel: mptscsih:ioc2:attempting task abort! (sc=00006011878100)
kernel: sd 2:0:6:4:
kernel: command: Write(10): 2a 00 00 02 b7 b2 00 00 08 00
kernel: mptbase: Initiating ioc2 recovery
kernel: rport 2:0-0: blocked FC remote port time out: saving binding
kernel: rport 1:0-0: blocked FC remote port time out: saving binding
kernel: rport 2:0-1: blocked FC remote port time out: saving binding
kernel: rport 2:0-2: blocked FC remote port time out: saving binding
kernel: rport 2:0-3: blocked FC remote port time out: saving binding
kernel: rport 2:0-4: blocked FC remote port time out: saving binding
kernel: rport 2:0-5: blocked FC remote port time out: saving binding
kernel: rport 2:0-6: blocked FC remote port time out: saving binding
kernel: rport 1:0-1: blocked FC remote port time out: saving binding
kernel: rport 1:0-2: blocked FC remote port time out: saving binding
kernel: rport 1:0-3: blocked FC remote port time out: saving binding
kernel: rport 1:0-4: blocked FC remote port time out: saving binding
kernel: rport 1:0-5: blocked FC remote port time out: saving binding
kernel: rport 1:0-6: blocked FC remote port time out: saving binding
sd 1:0:5:5: SCSI error: return code = 0x00010000
end request I/O error dev sdp sector 167988652
Buffer I/O error, dev sdx4, logical block 0
lost page write due to I/O error on sdx4
...
...
lots more errors like the above on sdp and sdx

TB0ne · 02-23-2010, 11:21 AM

Quote:

Originally Posted by thllgo

Hello

Problem: About every two weeks our system seems to lose its connection to its RAID. Attempting to to an ls hangs. The system is responsive except for access to the RAID

I am running RHEL 5 on an SGI 450 IA64. I have two FC connections to a Silkworm 200E Brocade. The Brocade is in plugged into an SGI TP9500 RAID. The FC cards are LSIFC949X

Both RAID and brocade report they are ok.

When we loose connectivity I see the following messages in the log files. Once this error occurs I can not reboot gracefully I have to power down and power up. Once the system is powered back up all seems ok for the next week or two. I'm guessing there is some kind of hicup in the FC connection to the RAID but it does not recover.

lots more errors like the above on sdp and sdx

I've seen this happen before when my SAN guys are doing 'behind-the-scenes' things, and have had flaky things happen. Don't know if that's the case here, though. Are there any copy/mirror jobs, like doing a BCV snapshot, that occur with some frequency?

thllgo · 02-23-2010, 12:07 PM

That was my first thought. Unfortunately no. Its not on a completely timed basis. Sometimes its 10 days sometimes 18 days and everywhere in between. It does happen at times of heavy writes. From what I've read what I think it happening is the FC connection is in heavy use, gets reset but does not come back quite completely and my mounted filesystems get hosed.

TB0ne · 02-23-2010, 03:10 PM

Quote:

Originally Posted by thllgo

That was my first thought. Unfortunately no. Its not on a completely timed basis. Sometimes its 10 days sometimes 18 days and everywhere in between. It does happen at times of heavy writes. From what I've read what I think it happening is the FC connection is in heavy use, gets reset but does not come back quite completely and my mounted filesystems get hosed.

Perhaps the firmware on the Brocade needs to be updated...are you on the latest release?

thllgo · 02-23-2010, 04:14 PM

I don't know. I will check it out.