[SOLVED] IBM M1015 (LSI 9211-8i) Drops and Re-Allocates LUNs
Linux - HardwareThis forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
IBM M1015 (LSI 9211-8i) Drops and Re-Allocates LUNs
I just had a SuperMicro X9SCL server board die on me and it was replaced by an Asus P8B-X server board. Since then I've been having strange issues with my IBM M1015 controller. The controller has been cross flashed with the LSI 9211-81 firmware running in IT mode. In the SuperMicro board, it had been running for a couple of years with no issues at all.
Basically, the controller will throw a DID_NOT_CONNECT error on one or more LUNs. This forces the LUN to be dropped and then immediately reallocates to a different one, so in effect a drive previously allocated to /dev/sde suddenly becomes /dev/sdi.
When I was switching the motherboards, I did flash to the latest (20) firmware. However, I started getting all sorts of weird I/O errors. Not the ones reported here, others. After some Google-fu I found references that the firmware version should match the driver version. As the driver version is 16:
Code:
mpt2sas version 16.100.00.00 loaded
I reflashed back to firmware 16. This resolved the previous errors, but now I started to see this issue occasionally. Because even 16 was higher than I was previously using, I went back to the firmware that had been in the board for the previous 2 years, without issue:
For a while I thought that had fixed it, as I didn't see any other occurrences for around 10 days, where previously the longest I'd gone was 2, maybe 3, days. But last night. Hence the reason for the post.
Previous board was a true x8 link, x8 slot. On this board, originally it was in an x4 link, x8 slot, now it's in an x16 link, x16 slot, and have had the issue in both. There isn't an x8 link on the board.
Your info may help me also. I have had to RMA this particular controller several times. It would go for 3-4 months then quit. I have a fan that blows directly on the heat sink. I also removed the heat sink and replaced the sticky heat pad with heat sink compound. This chip gets hot. If it has ever been hit with static, it may appear fine, but ESD becomes leaky when hit, and even more when hot.
One other question, this card is PCIe 2 and it may have problems with PCIe 3 slots. I could not see which PCIe version the P8B-X has. Perhaps trying a different slot, even a 1X slot. Your board may be failing or having hardware issues, but that is just a guess.
I have had much better luck with the 9207-8i, which is PCIe 3.0, in 3.0 PCIe slots running Debian Jessie.
this requires the number off the green tag:
namely sas2flsh -o -sasadd 500605b*****
Not sure what this does myself.
If you're cross-flashing to a different controller then the empty.bin wipes everything out, including the embedded SAS ID. That command just restores it back. If you're only flashing between firmware versions, then the ID isn't touched, so that command isn't necessary.
OK, I think I've gotten to the bottom of this and it was (kinda) the motherboard swap that caused it.
I noticed that even though it wasn't always the same LUN/disk that triggered this, it was always one of four, not all seven drives. This led me to switch around the 2 breakout cables on the card to see if the problem moved to other drives.
Since doing this I haven't seen any issues in almost 2 weeks. So, I'm guessing that either the card or one of the breakout cables wasn't fully seated.
One other piece that also kinda confirms this, is the SMART reports from the drives. They all have a high value for attribute 199: UDMA_CRC_Error_Count.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.