Need help with Fibre Channel basics from Linux.

larold · 05-06-2010, 02:07 PM

I am relatively new to the whole FC / storage arena.

I have a Linux server (HP DL580) which had exactly 1 dual-port QLogic HBA installed. Each port registers as a separate card / entity in the kernel, so as far as the OS is concerned, it's got two HBAs. (If you go into /sys/class/fc_host, you see host0/ and host1/ .)

I was able to get the qlogic drivers working, multipath configured on top of the SCSI LUNs presented from the SAN, etc. The host was connected direct-attached to ports on an HP EVA SAN setup.

We took the server down and installed another dual-port qlogic HBA. The reason being is that we're going to be moving to a Brocade fabric-based setup, and the storage admins want me to see if I can view a Fibre-based tape drive they connected to the fabric.

So after power-on, the OS does indeed see two new HBA ports. (I can tell this because in /sys/class/fc_host, there are now entries for host2/ and host3/ , as well as the previous host0/ and host1/. Additionally, I see a new entry in /sys/class/fc_remote_ports/, which is rport-2:0-0. In /sys/class/fc_transport, I see target2:0:0.

Not really understanding fully what was going on, and not knowing if I was supposed to, I went ahead and did a:

echo "- - -" > scan in /sys/class/scsi_host/host2/ and host3/ ,

thinking perhaps the drive would show up as a new /dev/sd? device. It did not.

Here's what I don't quite have my head around. How do I know if I can successfully see the tape drive they attached? How do I know if I will be able to successfully talk to any devices attached to the fabric in the future? What exactly do the entries I see in the directories fc_transport/ and fc_remote_ports/ correspond to?

anomie · 05-06-2010, 02:26 PM

For starters, what OS / version is this host? And, out of curiosity, what did you use to set up multipathing?

Quote:

How do I know if I can successfully see the tape drive they attached?

Check the contents of /proc/scsci/scsi. In addition to the SCSI info there, of particular interest may be the vendor and model fields. If re-scanning the SCSI bus does not work, you might try restarting the host.

Quote:

How do I know if I will be able to successfully talk to any devices attached to the fabric in the future?

If the device's WWN/WWID is presented to the OS, you will be able to successfully talk to it. (Unless you have zoning at the switch level and/or access controls at the storage device level that would prevent it.)

For a RHEL5-specific look at viewing WWNs, see this KB article: http://kbase.redhat.com/faq/docs/DOC-19446

That may not apply if you're using a different OS or version.

Quote:

What exactly do the entries I see in the directories fc_transport/ and fc_remote_ports/ correspond to?

Dunno for sure. But my WAG is these will match up with the SCSI host, channel, and Id (as visible, for example, in /proc/scsi/scsi).

larold · 05-06-2010, 02:38 PM

Quote:

Originally Posted by anomie

For starters, what OS / version is this host? And, out of curiosity, what did you use to set up multipathing?

Check the contents of /proc/scsci/scsi. In addition to the SCSI info there, of particular interest may be the vendor and model fields. If re-scanning the SCSI bus does not work, you might try restarting the host.

If the device's WWN/WWID is presented to the OS, you will be able to successfully talk to it. (Unless you have zoning at the switch level and/or access controls at the storage device level that would prevent it.)

For a RHEL5-specific look at viewing WWNs, see this KB article: http://kbase.redhat.com/faq/docs/DOC-19446

That may not apply if you're using a different OS or version.

Dunno for sure. But my WAG is these will match up with the SCSI host, channel, and Id (as visible, for example, in /proc/scsi/scsi).

Thank you for the assistance. Before you posted, I took the time and went into /sys/class/fc_remote_ports, and was able to match up a WWPN from the tape drive to the port_name file in one of the rport-?:?-? directories. Good sign.

Then I used your advice and delved into /proc/scsi/scsi, and yep, the new tape drive shows up at the bottom of the list. (I recognize the Vendor: and Model: fields.) My knowledge is limited enough that I'm still confused why a scan of the scsi bus didn't end up producing an additional file in the form of a /dev/sd? device file. I'm wondering if udev would need to be tweaked to actually create the device file for it.

To answer your other questions, this is a RHEL 5.3 box. I used dm-multipath to handle multipathing to the san luns.

I will check out your kbase article - I think I may have read that last night.

anomie · 05-06-2010, 02:43 PM

Quote:

Originally Posted by larold

My knowledge is limited enough that I'm still confused why a scan of the scsi bus didn't end up producing an additional file in the form of a /dev/sd? device file. I'm wondering if udev would need to be tweaked to actually create the device file for it.

I don't know the answer, but I can tell you that (for whatever reason) SCSI bus scans haven't always worked for me. In those situations I've made a practice of rebooting. It's not very convenient, but these are one-off situations.

Quote:

Originally Posted by larold

To answer your other questions, this is a RHEL 5.3 box. I used dm-multipath to handle multipathing to the san luns.

Good. I've been really happy with DM Multipath in my environment -- so far.

larold · 05-06-2010, 03:04 PM

Quote:

Good. I've been really happy with DM Multipath in my environment -- so far.

This is slightly off the post-topic.

I was also happy with dm-mp for quite awhile. Something happened a couple weeks ago you should be aware of, and keep an eye out for. Here's what happened to me.

For unknown reasons, we had all paths to a specific LUN fail. (Through BOTH hbas.) In the syslogs, multipath reported it saw the luns go down, and we saw the usual messages about paths being failed. 15 seconds later, we saw syslog messages (kernel / driver I believe) stating that connectivity was back. However... no multipath message. Also, the paths were still marked as 'failed'. We only noticed this because of Oracle write backlog, and our Oracle consultants saying 'Hey - we did a vmstat on one of the LUNs and see some really weird things."

Turns out that sometime during that 15 seconds, multipathd died. It wasn't around to mark the paths as available again, so Oracle was never able to write to the LUN as far as I can tell. A couple days later is when I was made aware of the problem, and a 'multipath -l' showed me that all paths to the LUN were still marked as 'failed'. *OOPS*.

I have since put in a Nagios check to ensure multipathd is running.

Maybe there's some piece of the puzzle I don't understand, but it was very scary. I have a hunch something about the LUN crash itself freaked multipathd out causing it to die.

anomie · 05-06-2010, 03:27 PM

That sounds frustrating.

Like you, I'm doing some regular polling to ensure that multipath sees exactly the number of LUNS, and paths to each LUN, that I'd expect. In my case, I am doing this on each host (at the host level), since that is what I am concerned with most. That approach might not scale well for 400 nodes, but it works fine for 8 of them.