[SOLVED] IBM M1015 (LSI 9211-8i) Drops and Re-Allocates LUNs

MQMan · 04-03-2015, 10:26 AM

I just had a SuperMicro X9SCL server board die on me and it was replaced by an Asus P8B-X server board. Since then I've been having strange issues with my IBM M1015 controller. The controller has been cross flashed with the LSI 9211-81 firmware running in IT mode. In the SuperMicro board, it had been running for a couple of years with no issues at all.

Basically, the controller will throw a DID_NOT_CONNECT error on one or more LUNs. This forces the LUN to be dropped and then immediately reallocates to a different one, so in effect a drive previously allocated to /dev/sde suddenly becomes /dev/sdi.

Code:

Apr  2 22:49:39 zentyal kernel: [110351.190226] sd 4:0:3:0: [sde] Synchronizing SCSI cache
Apr  2 22:49:39 zentyal kernel: [110351.190253] sd 4:0:3:0: [sde]
Apr  2 22:49:39 zentyal kernel: [110351.190255] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Apr  2 22:49:39 zentyal kernel: [110351.190990] mpt2sas0: removing handle(0x000c), sas_addr(0x4433221105000000)
Apr  2 22:49:39 zentyal kernel: [110351.488933] sd 4:0:5:0: [sdg] Synchronizing SCSI cache
Apr  2 22:49:39 zentyal kernel: [110351.488959] sd 4:0:5:0: [sdg]
Apr  2 22:49:39 zentyal kernel: [110351.488961] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Apr  2 22:49:39 zentyal kernel: [110351.489066] mpt2sas0: removing handle(0x000e), sas_addr(0x4433221106000000)
Apr  2 22:49:42 zentyal kernel: [110354.194294] sd 4:0:6:0: [sdh] Synchronizing SCSI cache
Apr  2 22:49:42 zentyal kernel: [110354.194322] sd 4:0:6:0: [sdh]
Apr  2 22:49:42 zentyal kernel: [110354.194323] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Apr  2 22:49:42 zentyal kernel: [110354.195048] mpt2sas0: removing handle(0x000f), sas_addr(0x4433221107000000)
Apr  2 22:50:14 zentyal kernel: [110386.383123] scsi 4:0:7:0: Direct-Access     ATA      ST2000DL003-9VT1 CC32 PQ: 0 ANSI: 6
Apr  2 22:50:14 zentyal kernel: [110386.383136] scsi 4:0:7:0: SATA: handle(0x000f), sas_addr(0x4433221107000000), phy(7), device_name(0x0000000000000000)
Apr  2 22:50:14 zentyal kernel: [110386.383139] scsi 4:0:7:0: SATA: enclosure_logical_id(0x500605b0047955a0), slot(4)
Apr  2 22:50:14 zentyal kernel: [110386.383350] scsi 4:0:7:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
Apr  2 22:50:14 zentyal kernel: [110386.383358] scsi 4:0:7:0: qdepth(32), tagged(1), simple(0), ordered(0), scsi_level(7), cmd_que(1)
Apr  2 22:50:14 zentyal kernel: [110386.383551] sd 4:0:7:0: Attached scsi generic sg4 type 0
Apr  2 22:50:14 zentyal kernel: [110386.384518] sd 4:0:7:0: [sdi] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
Apr  2 22:50:14 zentyal kernel: [110386.421347] sd 4:0:7:0: [sdi] Write Protect is off
Apr  2 22:50:14 zentyal kernel: [110386.421353] sd 4:0:7:0: [sdi] Mode Sense: 7f 00 10 08
Apr  2 22:50:14 zentyal kernel: [110386.433602] sd 4:0:7:0: [sdi] Write cache: enabled, read cache: enabled, supports DPO and FUA
Apr  2 22:50:14 zentyal kernel: [110386.510942]  sdi: sdi1
Apr  2 22:50:14 zentyal kernel: [110386.609074] sd 4:0:7:0: [sdi] Attached SCSI disk
Apr  2 22:50:17 zentyal kernel: [110389.881697] scsi 4:0:8:0: Direct-Access     ATA      ST2000DL003-9VT1 CC32 PQ: 0 ANSI: 6
Apr  2 22:50:17 zentyal kernel: [110389.881709] scsi 4:0:8:0: SATA: handle(0x000c), sas_addr(0x4433221105000000), phy(5), device_name(0x0000000000000000)
Apr  2 22:50:17 zentyal kernel: [110389.881712] scsi 4:0:8:0: SATA: enclosure_logical_id(0x500605b0047955a0), slot(6)
Apr  2 22:50:17 zentyal kernel: [110389.881908] scsi 4:0:8:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
Apr  2 22:50:17 zentyal kernel: [110389.881914] scsi 4:0:8:0: qdepth(32), tagged(1), simple(0), ordered(0), scsi_level(7), cmd_que(1)
Apr  2 22:50:17 zentyal kernel: [110389.882066] sd 4:0:8:0: Attached scsi generic sg6 type 0
Apr  2 22:50:17 zentyal kernel: [110389.884018] sd 4:0:8:0: [sdj] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
Apr  2 22:50:17 zentyal kernel: [110389.914465] sd 4:0:8:0: [sdj] Write Protect is off
Apr  2 22:50:17 zentyal kernel: [110389.914471] sd 4:0:8:0: [sdj] Mode Sense: 7f 00 10 08
Apr  2 22:50:17 zentyal kernel: [110389.926766] sd 4:0:8:0: [sdj] Write cache: enabled, read cache: enabled, supports DPO and FUA
Apr  2 22:50:18 zentyal kernel: [110390.007076]  sdj: sdj1
Apr  2 22:50:18 zentyal kernel: [110390.113746] sd 4:0:8:0: [sdj] Attached SCSI disk
Apr  2 22:50:19 zentyal kernel: [110391.631101] scsi 4:0:9:0: Direct-Access     ATA      ST2000DL003-9VT1 CC32 PQ: 0 ANSI: 6
Apr  2 22:50:19 zentyal kernel: [110391.631119] scsi 4:0:9:0: SATA: handle(0x000e), sas_addr(0x4433221106000000), phy(6), device_name(0x0000000000000000)
Apr  2 22:50:19 zentyal kernel: [110391.631121] scsi 4:0:9:0: SATA: enclosure_logical_id(0x500605b0047955a0), slot(5)
Apr  2 22:50:19 zentyal kernel: [110391.631260] scsi 4:0:9:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
Apr  2 22:50:19 zentyal kernel: [110391.631267] scsi 4:0:9:0: qdepth(32), tagged(1), simple(0), ordered(0), scsi_level(7), cmd_que(1)
Apr  2 22:50:19 zentyal kernel: [110391.631423] sd 4:0:9:0: Attached scsi generic sg7 type 0
Apr  2 22:50:19 zentyal kernel: [110391.632100] sd 4:0:9:0: [sdk] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
Apr  2 22:50:19 zentyal kernel: [110391.661327] sd 4:0:9:0: [sdk] Write Protect is off
Apr  2 22:50:19 zentyal kernel: [110391.661333] sd 4:0:9:0: [sdk] Mode Sense: 7f 00 10 08
Apr  2 22:50:19 zentyal kernel: [110391.673534] sd 4:0:9:0: [sdk] Write cache: enabled, read cache: enabled, supports DPO and FUA
Apr  2 22:50:19 zentyal kernel: [110391.752466]  sdk: sdk1
Apr  2 22:50:19 zentyal kernel: [110391.850195] sd 4:0:9:0: [sdk] Attached SCSI disk

Obviously this then causes havoc as the system still tries to communicate with /dev/sde.

Code:

Apr  2 22:59:09 zentyal kernel: [110920.899112] XFS (sde1): metadata I/O error: block 0x1ee2f70 ("xfs_trans_read_buf_map") error 19 numblks 16
Apr  2 22:59:09 zentyal kernel: [110920.899128] XFS (sde1): xfs_imap_to_bp: xfs_trans_read_buf() returned error 19.
Apr  2 22:59:09 zentyal kernel: [110920.899970] XFS (sde1): metadata I/O error: block 0xeda8c0 ("xfs_trans_read_buf_map") error 19 numblks 16
Apr  2 22:59:09 zentyal kernel: [110920.899975] XFS (sde1): xfs_imap_to_bp: xfs_trans_read_buf() returned error 19.
Apr  2 22:59:09 zentyal kernel: [110920.899996] XFS (sde1): metadata I/O error: block 0x1ee2f80 ("xfs_trans_read_buf_map") error 19 numblks 16
Apr  2 22:59:09 zentyal kernel: [110920.899999] XFS (sde1): xfs_imap_to_bp: xfs_trans_read_buf() returned error 19.
Apr  2 22:59:09 zentyal kernel: [110920.900017] XFS (sde1): metadata I/O error: block 0x1ee2f70 ("xfs_trans_read_buf_map") error 19 numblks 16
Apr  2 22:59:09 zentyal kernel: [110920.900020] XFS (sde1): xfs_imap_to_bp: xfs_trans_read_buf() returned error 19.
Apr  2 22:59:09 zentyal kernel: [110920.900054] XFS (sde1): metadata I/O error: block 0x1ee2f70 ("xfs_trans_read_buf_map") error 19 numblks 16
Apr  2 22:59:09 zentyal kernel: [110920.900057] XFS (sde1): xfs_imap_to_bp: xfs_trans_read_buf() returned error 19.

It's not always the same LUNs/disks that this happens on and SMART doesn't report any issues with any of the 7 drives attached to the controller.

This is a Zentyal server 3.5, which is basically Ubuntu Server LTS 14.04.

Any thoughts on this. Could a motherboard swap cause this, or is that just coincidence.

Cheers.

zuikway · 04-03-2015, 12:27 PM

A couple of questions, as I have had problems myself with the 9211. Which firmware version? The latest is 20 http://www.lsi.com/products/host-bus....aspx#tab/tab4

Also, was the previous board and this board using the full 8x PCIe lanes?

I have had MB compatibility issues with some LSI controllers, depending on firmware version.

MQMan · 04-03-2015, 01:10 PM

When I was switching the motherboards, I did flash to the latest (20) firmware. However, I started getting all sorts of weird I/O errors. Not the ones reported here, others. After some Google-fu I found references that the firmware version should match the driver version. As the driver version is 16:

Code:

mpt2sas version 16.100.00.00 loaded

I reflashed back to firmware 16. This resolved the previous errors, but now I started to see this issue occasionally. Because even 16 was higher than I was previously using, I went back to the firmware that had been in the board for the previous 2 years, without issue:

Code:

mpt2sas0: LSISAS2008: FWVersion(14.00.01.00), ChipRevision(0x03), BiosVersion(00.00.00.00)

For a while I thought that had fixed it, as I didn't see any other occurrences for around 10 days, where previously the longest I'd gone was 2, maybe 3, days. But last night. Hence the reason for the post.

Previous board was a true x8 link, x8 slot. On this board, originally it was in an x4 link, x8 slot, now it's in an x16 link, x16 slot, and have had the issue in both. There isn't an x8 link on the board.

Cheers.

zuikway · 04-03-2015, 01:36 PM

Your info may help me also. I have had to RMA this particular controller several times. It would go for 3-4 months then quit. I have a fan that blows directly on the heat sink. I also removed the heat sink and replaced the sticky heat pad with heat sink compound. This chip gets hot. If it has ever been hit with static, it may appear fine, but ESD becomes leaky when hit, and even more when hot.

One other question, this card is PCIe 2 and it may have problems with PCIe 3 slots. I could not see which PCIe version the P8B-X has. Perhaps trying a different slot, even a 1X slot. Your board may be failing or having hardware issues, but that is just a guess.

I have had much better luck with the 9207-8i, which is PCIe 3.0, in 3.0 PCIe slots running Debian Jessie.

MQMan · 04-03-2015, 02:28 PM

Both boards are PCIe 2 and in fact both use the same controller, an Intel C202. The card is also PCIe 2.

Prior to putting the MB into this server, it was previously used as an ESXi host with an LSI 9260-8i MegaRAID card in the 4/8 slot with no issues.

I can't use the 1x slot, as it's a closed end 1x form factor, so the card physically won't fit.

Cheers.

zuikway · 04-03-2015, 07:31 PM

I wish I could be of more help. Please advise anything you find out.

One more question, when you updated the flash, did you do step 13 in this link:
https://forums.freenas.org/index.php...s9240-8i.8632/

this requires the number off the green tag:
namely sas2flsh -o -sasadd 500605b*****
Not sure what this does myself.

MQMan · 04-04-2015, 01:23 PM

Quote:

Originally Posted by zuikway

I wish I could be of more help. Please advise anything you find out.

One more question, when you updated the flash, did you do step 13 in this link:
https://forums.freenas.org/index.php...s9240-8i.8632/

this requires the number off the green tag:
namely sas2flsh -o -sasadd 500605b*****
Not sure what this does myself.

If you're cross-flashing to a different controller then the empty.bin wipes everything out, including the embedded SAS ID. That command just restores it back. If you're only flashing between firmware versions, then the ID isn't touched, so that command isn't necessary.

Cheers.

MQMan · 04-15-2015, 12:27 PM

OK, I think I've gotten to the bottom of this and it was (kinda) the motherboard swap that caused it.

I noticed that even though it wasn't always the same LUN/disk that triggered this, it was always one of four, not all seven drives. This led me to switch around the 2 breakout cables on the card to see if the problem moved to other drives.

Since doing this I haven't seen any issues in almost 2 weeks. So, I'm guessing that either the card or one of the breakout cables wasn't fully seated.

One other piece that also kinda confirms this, is the SMART reports from the drives. They all have a high value for attribute 199: UDMA_CRC_Error_Count.

Cheers.