How to correctly identify/relate a Disk on 3ware Controller

HiFish · 04-10-2021, 09:54 AM

Hello Everyone,
I am currently going through all the steps to follow some advice on how to reassemble a broken softraid array and run into an unexpected puzzle: I have collected detailed information from mdadm on each of the /dev/sd[...] and have also collected SMART info with smartctl. Now i am stumped by not being able to relate those 2 datasets with each other. I have gotten this far:

smartctl -t -d 3ware,0 /dev/twa0

this will give me detalied Info on the Disk connected to the first Port of the controller, including a serial number, but not what /dev/sdX this is

lshw -short

This will list all 8 Disks as "2TB 9550SX-12 DISK" without serial numbers.

Code:

/0/100/1c/0/2/0.1.0      /dev/sdb   disk       2TB 9550SX-12  DISK
/0/100/1c/0/2/0.2.0      /dev/sdc   disk       2TB 9550SX-12  DISK
/0/100/1c/0/2/0.3.0      /dev/sdd   disk       2TB 9550SX-12  DISK
/0/100/1c/0/2/0.4.0      /dev/sde   disk       2TB 9550SX-12  DISK
/0/100/1c/0/2/0.5.0      /dev/sdf   disk       2TB 9550SX-12  DISK
/0/100/1c/0/2/0.6.0      /dev/sdg   disk       2TB 9550SX-12  DISK
/0/100/1c/0/2/0.7.0      /dev/sdh   disk       2TB 9550SX-12  DISK
/0/100/1c/0/2/0.8.0      /dev/sdi   disk       2TB 9550SX-12  DISK

Now this might lead one to suspect that the drive on the first port is /dev/sdb, but how can i verify this?

/sbin/hdparm -i /dev/sdb just returns

Code:

/dev/sdb:
 HDIO_DRIVE_CMD(identify) failed: Invalid argument
 HDIO_GET_IDENTITY failed: Invalid argument

and /sbin/udevadm info --query=property --name=sdb does return some information but not the serial number of the disk.

Code:

DEVLINKS=/dev/disk/by-id/scsi-1AMCC_01840939000000000000 /dev/disk/by-path/pci-0000:02:02.0-scsi-0:0:1:0
DEVNAME=/dev/sdb
DEVPATH=/devices/pci0000:00/0000:00:1c.0/0000:01:00.0/0000:02:02.0/host0/target0:0:1/0:0:1:0/block/sdb
DEVTYPE=disk
ID_BUS=scsi
ID_FS_LABEL=ServhoernchenX5:multiverse
ID_FS_LABEL_ENC=ServhoernchenX5:multiverse
ID_FS_TYPE=linux_raid_member
ID_FS_USAGE=raid
ID_FS_UUID=7321c640-335b-eec5-3b12-db8267ca35c5
ID_FS_UUID_ENC=7321c640-335b-eec5-3b12-db8267ca35c5
ID_FS_UUID_SUB=d28d0061-9ba7-8e66-6d6b-2a66f2edae09
ID_FS_UUID_SUB_ENC=d28d0061-9ba7-8e66-6d6b-2a66f2edae09
ID_FS_VERSION=1.2
ID_MODEL=9550SX-12_DISK
ID_MODEL_ENC=9550SX-12\x20\x20DISK\x20
ID_PATH=pci-0000:02:02.0-scsi-0:0:1:0
ID_PATH_TAG=pci-0000_02_02_0-scsi-0_0_1_0
ID_REVISION=3.04
ID_SCSI=1
ID_SCSI_SERIAL=01840939000000000000
ID_SERIAL=1AMCC_01840939000000000000
ID_SERIAL_SHORT=AMCC_01840939000000000000
ID_TYPE=disk
ID_VENDOR=AMCC
ID_VENDOR_ENC=AMCC\x20\x20\x20\x20
MAJOR=8
MINOR=16
SUBSYSTEM=block
TAGS=:systemd:
USEC_INITIALIZED=5211612

Sadly the ancient hardware is the opposite of accessible and hotplug-capable, so i am scared to figure this out by manualy disconnecting the disks. A reboot might also scramble the letters and even more confuse my efforts to reassmble the array.

Has anybody any suggestion on how To correctly identify the disk with the Serial WD-WCC300067052 that is connectet to Port ID 4 of a 3ware 9550SX-12 Controller?

Greetings and thank you for reading all of this :-).

Edit: I almost forgot: uname -r gives 4.4.0-78-generic

Ser Olmy · 04-10-2021, 01:50 PM

Quote:

Originally Posted by HiFish

I am currently going through all the steps to follow some advice on how to reassemble a broken softraid array and run into an unexpected puzzle: I have collected detailed information from mdadm on each of the /dev/sd[...] and have also collected SMART info with smartctl. Now i am stumped by not being able to relate those 2 datasets with each other. I have gotten this far:

smartctl -t -d 3ware,0 /dev/twa0

this will give me detalied Info on the Disk connected to the first Port of the controller, including a serial number, but not what /dev/sdX this is

Because it probably isn't. At all.

The 3Ware 9550SX-12 is a hardware RAID controller. What makes you think this is a software RAID?

If the RAID array had been working properly, all drives would most likely belong to the same array, which would be partitioned into one or more logical drives. These in turn would turn up as "/dev/sd<something>".

If this was once a functioning 3Ware RAID setup, it would seem that the metadata has been corrupted and all drives are seen as separate, non-RAID logical drives (single-drive arrays, if you will), which is why all drives are identified as "2TB 9550SX-12 DISK".

On the other hand, if this is/was indeed a software RAID assembled from 8 non-RAID drives attached to a perfectly good RAID controller, the sysadmin who installed it should never again be allowed within shouting distance of a production server.

What exactly happened to this system? Why do you need to identify a specific drive in what seems to be a fundamentally broken setup?

Quote:

Originally Posted by HiFish

Code:

/0/100/1c/0/2/0.1.0      /dev/sdb   disk       2TB 9550SX-12  DISK
/0/100/1c/0/2/0.2.0      /dev/sdc   disk       2TB 9550SX-12  DISK
/0/100/1c/0/2/0.3.0      /dev/sdd   disk       2TB 9550SX-12  DISK
/0/100/1c/0/2/0.4.0      /dev/sde   disk       2TB 9550SX-12  DISK
/0/100/1c/0/2/0.5.0      /dev/sdf   disk       2TB 9550SX-12  DISK
/0/100/1c/0/2/0.6.0      /dev/sdg   disk       2TB 9550SX-12  DISK
/0/100/1c/0/2/0.7.0      /dev/sdh   disk       2TB 9550SX-12  DISK
/0/100/1c/0/2/0.8.0      /dev/sdi   disk       2TB 9550SX-12  DISK

Now this might lead one to suspect that the drive on the first port is /dev/sdb, but how can i verify this?

There's no reason why the drive connected to the first physical port would be registered by the kernel as the first logical drive. It might be, and it might not.

You need to access the 3Ware management (3DM2) web page, which will show you exactly how the drives are configured. If the management software isn't installed (which is inexcusable), you can download it from the support pages of whichever company happens to own the rights to the 3Ware/AMCC/LSI series of controllers. Currently, that would be Broadcom. Look for the "latest available codeset" (Note: FTP link, which does not work in some modern browsers, because Google and the Mozilla Corporation always knows what's best for us.)

HiFish · 04-10-2021, 02:42 PM

Thank you for your reply. I left out some background info which i thought not relevant to the question, but i gladly elaborate:
The system in question is my private media storage system (not a production server) which i set up from surplus hardware many years ago with the aim of longevity over performance. The Controller is an PCI-X card in an old dual cpu workstation board. Hardware Controllers are great for professional environments, but i already once suffered a full data los many years ago when an old LSI MegaRaid controller went bust and i could find no identical controller because it was out of production for ages, and the newer ones i got would not recognize the array. So in this case i decided use the first 8 Ports in JBOD mode and run a soft raid 6 on this for "long term" storage, and an hardware raid 10 of old 500 gb drives on the last 4 ports for the OS. I even vaguely remember fiddleling with the webinterface a long time ago, but there were also good CLI tools for this controller back then. All was mostly well untillsome unexpected multiple power outages a few weeks ago (see this thread).

EDIT: here is the output from tw-cli /c0 show

Code:

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-10   OK             -       -       16K     931.303   ON     OFF
u1    JBOD      OK             -       -       -       1863.02   OFF    OFF
u2    JBOD      OK             -       -       -       1863.02   OFF    OFF
u3    JBOD      OK             -       -       -       1863.02   OFF    OFF
u4    JBOD      OK             -       -       -       1863.02   OFF    OFF
u5    JBOD      OK             -       -       -       1863.02   OFF    OFF
u6    JBOD      OK             -       -       -       1863.02   OFF    OFF
u7    JBOD      OK             -       -       -       1863.02   OFF    OFF
u8    JBOD      OK             -       -       -       1863.02   OFF    OFF

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u1     1.82 TB     3907029168    WD-WMC301855828
p1     OK               u2     1.82 TB     3907029168    WD-WMC301840939
p2     OK               u3     1.82 TB     3907029168    WD-WMC301841171
p3     OK               u4     1.82 TB     3907029168    WD-WCC300089632
p4     OK               u5     1.82 TB     3907029168    WD-WCC300067052
p5     OK               u6     1.82 TB     3907029168    WD-WMC301907607
p6     OK               u7     1.82 TB     3907029168    WD-WMC301841150
p7     OK               u8     1.82 TB     3907029168    WD-WCC300088760
p8     OK               u0     465.76 GB   976773168     WD-WXC1A6372531
p9     OK               u0     465.76 GB   976773168     WD-WXD1A63N3860
p10    OK               u0     465.76 GB   976773168     S0MUJ1EQ105307
p11    OK               u0     465.76 GB   976773168     S0MUJ1MPC02006

Ser Olmy · 04-10-2021, 04:55 PM

Quote:

Originally Posted by HiFish

Hardware Controllers are great for professional environments, but i already once suffered a full data los many years ago when an old LSI MegaRaid controller went bust and i could find no identical controller because it was out of production for ages, and the newer ones i got would not recognize the array.

In such cases, the RAID can be assembled using device-mapper or in some cases even mdadm. The RAID stripe format itself is standardized, and in many cases even the metadata format is documented.

Worst case scenario, you'll have to find the correct parameters using trial-and-error. For a real-world example, see the Linus Tech Tips video about their server crash.

TL;DR: Recovering from a broken controller is no more or less complex than re-assembling a software RAID with corrupt metadata.

Quote:

Originally Posted by HiFish

So in this case i decided use the first 8 Ports in JBOD mode and run a soft raid 6 on this for "long term" storage, and an hardware raid 10 of old 500 gb drives on the last 4 ports for the OS.

I see. And now, after a few power outages, the software RAID component has failed.

If you search on this and other forums, you'll find plenty of threads where people are trying to recover their broken software RAIDs. You'll find very few threads about failed hardware RAID controllers, despite them being widely used in the enterprise world.

There's a very simple reason for this: Both setups use SAS or SATA controllers, which can indeed fail, but hardware RAID controllers are more advanced/expensive and tend to fail less often. Also, a hardware RAID controller has built-in logic to handle unusual timeouts and devices that completely lock the SAS/SATA bus, while a standard non-RAID controller expects the OS driver to deal with any errors that may occur.

You have 8 drives connected to a RAID controller, and you use the JBOD mode for each drive. A single-drive JBOD array with one logical drive is not the same as a passthrough drive. There's metadata involved. In other words, if the controller fails and you hook the drives up to a standard SATA controller, you'll probably still have to do some detective work in order to actually access the data.

On top of this you've built a software RAID, meaning the CPU has to generate the parity blocks and send them over the PCI-X bus while the CPU on the 3Ware controller sits idle.

The result is that the RAID controller will not be able to handle timeouts or hangs involving these 8 drives, as there's no redundancy and no write cache, and the md driver has no idea why reads or writes may be timing out, giving you the worst of both worlds. That's why you now have a broken software RAID and a controller that claims the drives are all fine.

Quote:

Originally Posted by HiFish

EDIT: here is the output from tw-cli /c0 show

Code:

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-10   OK             -       -       16K     931.303   ON     OFF
u1    JBOD      OK             -       -       -       1863.02   OFF    OFF
u2    JBOD      OK             -       -       -       1863.02   OFF    OFF
u3    JBOD      OK             -       -       -       1863.02   OFF    OFF
u4    JBOD      OK             -       -       -       1863.02   OFF    OFF
u5    JBOD      OK             -       -       -       1863.02   OFF    OFF
u6    JBOD      OK             -       -       -       1863.02   OFF    OFF
u7    JBOD      OK             -       -       -       1863.02   OFF    OFF
u8    JBOD      OK             -       -       -       1863.02   OFF    OFF

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u1     1.82 TB     3907029168    WD-WMC301855828
p1     OK               u2     1.82 TB     3907029168    WD-WMC301840939
p2     OK               u3     1.82 TB     3907029168    WD-WMC301841171
p3     OK               u4     1.82 TB     3907029168    WD-WCC300089632
p4     OK               u5     1.82 TB     3907029168    WD-WCC300067052
p5     OK               u6     1.82 TB     3907029168    WD-WMC301907607
p6     OK               u7     1.82 TB     3907029168    WD-WMC301841150
p7     OK               u8     1.82 TB     3907029168    WD-WCC300088760
p8     OK               u0     465.76 GB   976773168     WD-WXC1A6372531
p9     OK               u0     465.76 GB   976773168     WD-WXD1A63N3860
p10    OK               u0     465.76 GB   976773168     S0MUJ1EQ105307
p11    OK               u0     465.76 GB   976773168     S0MUJ1MPC02006

Right, so the first unit/logical drive (u0) is the RAID 10 array, while u1-u8 are the drives connected to ports 0-7. That should make u0 /dev/sda and u1-u8 /dev/sdb-/dev/sdi, since all units and drives seem to be active. That's assuming the 3Ware kernel driver enumerates drives according to unit numbers.

HiFish · 04-10-2021, 06:35 PM

Thank you for your explanation. Since this is a glorified NAS with CPUs that sit idle most of the time i was never worried about performance, but i take your point of the robustness of this solution vs a professionell full hardware raid with battery powered cache.
I am now a bit unsure about whether this is a clean pass-through or some borked up meta-drive. I certainly never configured any jbod arrays, like you can do on some controllers. On boot this controller shows all non-configured drives as jbod if i remember correctly.
Sadly the data from smartctl about smart errors on 2 drives seems to be conflicting with the output of mdadm which shows problems for not exactly those 2 drives based on the asumption of linear enumeration. this is why i am looking for a way to confirm which drive letter is right now corresponding to whitch disk. I will work through the documentation for tw-cli. Maybe there is a way to temporarly turn a disk off/to sleep to see which letter corresponds to this without actually unpluggin them?

Ser Olmy · 04-11-2021, 02:05 AM

It's been a while since I used 3Ware controllers, but some management software provides a function called "identify drive".

However, that's only useful if the drive is mounted in an enclosure with activity lights, or if there's a LED on the drive itself (extremely unlikely with modern drives).

As for the "JBOD" mode (which is not an official term), it is entirely possible that 3Ware chose to call passthrough drives "JBOD." In that case, since there's no metadata, there will be no way to add more drives to an existing JBOD array. The documentation should be able to confirm this.

HiFish · 04-11-2021, 02:44 PM

I know this feature from some of the huge storage units at work. I suppose the 3ware controller even would support this feature if i had a propper hotplug bay with indicator LEDs, but thats not the issue. i already can physicly identfy which drive sits at what port of the controller, because tw-cli shows this very nicely with serial numbers. But what i cant figure out is hot to relate this to the enumerated drive letters. I Need those because mdadm will only work or give details in the context of drive letters. I am still baffled that this is even an issue and still hold out hope that i just overlooked something obvious.

Ser Olmy · 04-11-2021, 03:15 PM

No, it's a real problem alright. The md driver deals with device nodes, and it's up to the sysadmin to figure out how these map to real devices. Which is another reason why creating an md RAID on top of virtual devices that all have the exact same name ("2TB 9550SX-12 DISK") might not be such a great idea.

I assume you have checked the kernel boot log for messages from the driver/kernel as it enumerates the logical drives?

HiFish · 04-12-2021, 02:54 PM

Well sort of, assuming you refer to the output of dmesg. I am equaly unsure if the IDs given there are the "right" or "wrong" IDs which i could then compare to smartctl output. Could you shed some light on this? The fun starts at 4.22401 . To be fair 8 times "2TB 9550SX-12 DISK" is not much worse than 8 times "WDC WD20EFRX-68AX9N0" (which is the model name), or do you think this would show the serial instead if it was attached to a regular sata port?

Code:

...
[    3.868959] 3w-9xxx: scsi0: Found a 3ware 9000 Storage Controller at 0xe3a00000, IRQ: 16.
[    3.904302] e1000e 0000:00:19.0 eth0: (PCI Express:2.5GT/s:Width x1) 00:15:17:b2:96:c1
[    3.926071] e1000e 0000:00:19.0 eth0: Intel(R) PRO/1000 Network Connection
[    3.947160] e1000e 0000:00:19.0 eth0: MAC: 7, PHY: 6, PBA No: 0070FF-0FF
[    3.969548] e1000e 0000:00:19.0 rename3: renamed from eth0
[    4.188064] clocksource: Switched to clocksource tsc
[    4.224010] 3w-9xxx: scsi0: Firmware FE9X 3.04.01.011, BIOS BE9X 3.04.00.002, Ports: 12.
[    4.246828] scsi 0:0:0:0: Direct-Access     AMCC     9550SX-12  DISK  3.04 PQ: 0 ANSI: 3
[    4.270903] scsi 0:0:1:0: Direct-Access     AMCC     9550SX-12  DISK  3.04 PQ: 0 ANSI: 3
[    4.293931] scsi 0:0:2:0: Direct-Access     AMCC     9550SX-12  DISK  3.04 PQ: 0 ANSI: 3
[    4.317418] scsi 0:0:3:0: Direct-Access     AMCC     9550SX-12  DISK  3.04 PQ: 0 ANSI: 3
[    4.340187] scsi 0:0:4:0: Direct-Access     AMCC     9550SX-12  DISK  3.04 PQ: 0 ANSI: 3
[    4.362679] scsi 0:0:5:0: Direct-Access     AMCC     9550SX-12  DISK  3.04 PQ: 0 ANSI: 3
[    4.384124] scsi 0:0:6:0: Direct-Access     AMCC     9550SX-12  DISK  3.04 PQ: 0 ANSI: 3
[    4.405212] scsi 0:0:7:0: Direct-Access     AMCC     9550SX-12  DISK  3.04 PQ: 0 ANSI: 3
[    4.426830] scsi 0:0:8:0: Direct-Access     AMCC     9550SX-12  DISK  3.04 PQ: 0 ANSI: 3
[    4.450400] sd 0:0:0:0: Attached scsi generic sg0 type 0
[    4.450546] sd 0:0:0:0: [sda] 1953083392 512-byte logical blocks: (1000 GB/931 GiB)
[    4.450941] sd 0:0:0:0: [sda] Write Protect is off
[    4.450943] sd 0:0:0:0: [sda] Mode Sense: 23 00 00 00
[    4.470373] sd 0:0:0:0: [sda] Write cache: enabled, read cache: disabled, doesn't support DPO or FUA
[    4.530894] sd 0:0:1:0: Attached scsi generic sg1 type 0
[    4.531032] sd 0:0:1:0: [sdb] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
[    4.550650] sd 0:0:1:0: [sdb] Write Protect is off
[    4.550652] sd 0:0:1:0: [sdb] Mode Sense: 23 00 00 00
[    4.551084] sd 0:0:1:0: [sdb] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA
[    4.604993]  sda: sda1 sda2 < sda5 >
[    4.631241] sd 0:0:0:0: [sda] Attached SCSI disk
[    4.631580] sd 0:0:2:0: Attached scsi generic sg2 type 0
[    4.631817] sd 0:0:3:0: Attached scsi generic sg3 type 0
[    4.632051] sd 0:0:4:0: Attached scsi generic sg4 type 0
[    4.632273] sd 0:0:5:0: Attached scsi generic sg5 type 0
[    4.632484] sd 0:0:6:0: Attached scsi generic sg6 type 0
[    4.632696] sd 0:0:7:0: Attached scsi generic sg7 type 0
[    4.632909] sd 0:0:8:0: Attached scsi generic sg8 type 0
[    4.650603] sd 0:0:2:0: [sdc] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
[    4.650826] sd 0:0:4:0: [sde] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
[    4.669968] sd 0:0:6:0: [sdg] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
[    4.669972] sd 0:0:5:0: [sdf] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
[    4.669977] sd 0:0:3:0: [sdd] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
[    4.669979] sd 0:0:7:0: [sdh] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
[    4.669992] sd 0:0:8:0: [sdi] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
[    4.669999] sd 0:0:4:0: [sde] Write Protect is off
[    4.670001] sd 0:0:2:0: [sdc] Write Protect is off
[    4.670002] sd 0:0:4:0: [sde] Mode Sense: 23 00 00 00
[    4.670005] sd 0:0:2:0: [sdc] Mode Sense: 23 00 00 00
[    4.670178] sd 0:0:6:0: [sdg] Write Protect is off
[    4.670180] sd 0:0:6:0: [sdg] Mode Sense: 23 00 00 00
[    4.670276] sd 0:0:5:0: [sdf] Write Protect is off
[    4.670278] sd 0:0:5:0: [sdf] Mode Sense: 23 00 00 00
[    4.670415] sd 0:0:3:0: [sdd] Write Protect is off
[    4.670417] sd 0:0:3:0: [sdd] Mode Sense: 23 00 00 00
[    4.707637] sd 0:0:7:0: [sdh] Write Protect is off
[    4.707639] sd 0:0:7:0: [sdh] Mode Sense: 23 00 00 00
[    4.707644] sd 0:0:1:0: [sdb] Attached SCSI disk
[    4.707657] sd 0:0:8:0: [sdi] Write Protect is off
[    4.707660] sd 0:0:8:0: [sdi] Mode Sense: 23 00 00 00
[    4.707948] sd 0:0:2:0: [sdc] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA
[    4.708076] sd 0:0:4:0: [sde] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA
[    4.744772] sd 0:0:6:0: [sdg] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA
[    4.744780] sd 0:0:5:0: [sdf] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA
[    4.744939] sd 0:0:3:0: [sdd] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA
[    4.744979] sd 0:0:8:0: [sdi] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA
[    4.745783] sd 0:0:7:0: [sdh] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA
[    4.905866]  sdh:
[    4.906264] sd 0:0:2:0: [sdc] Attached SCSI disk
[    4.934437] sd 0:0:4:0: [sde] Attached SCSI disk
[    4.934440] sd 0:0:5:0: [sdf] Attached SCSI disk
[    4.934454] sd 0:0:6:0: [sdg] Attached SCSI disk
[    4.961221] sd 0:0:8:0: [sdi] Attached SCSI disk
[    4.961642] sd 0:0:7:0: [sdh] Attached SCSI disk
[    4.986818] sd 0:0:3:0: [sdd] Attached SCSI disk
[    5.178030] random: nonblocking pool is initialized
[    5.197857] md: bind<sdb>
[    5.213547] md: bind<sdc>
[    5.225331] md: bind<sdh>
[    5.236878] md: bind<sde>
[    5.248250] md: bind<sdd>
[    5.259491] md: bind<sdg>
[    5.271042] md: bind<sdi>
[    5.282213] md: bind<sdf>
[   95.322313] md: linear personality registered for level -1
[   95.335577] md: multipath personality registered for level -4
[   95.349329] md: raid0 personality registered for level 0
[   95.364025] md: raid1 personality registered for level 1
[   95.448020] raid6: mmxx1    gen()  3281 MB/s
[   95.524020] raid6: mmxx2    gen()  3680 MB/s
[   95.600020] raid6: sse1x1   gen()  2247 MB/s
[   95.676017] raid6: sse1x2   gen()  2875 MB/s
[   95.752016] raid6: sse2x1   gen()  4171 MB/s
[   95.828004] raid6: sse2x1   xor()  4386 MB/s
[   95.904006] raid6: sse2x2   gen()  4975 MB/s
[   95.980008] raid6: sse2x2   xor()  5210 MB/s
[   95.988787] raid6: using algorithm sse2x2 gen() 4975 MB/s
[   95.997833] raid6: .... xor() 5210 MB/s, rmw enabled
[   96.007082] raid6: using ssse3x1 recovery algorithm
[   96.020916] xor: measuring software checksum speed
[   96.068007]    pIII_sse  :  8911.000 MB/sec
[   96.116003]    prefetch64-sse: 10035.000 MB/sec
[   96.125641] xor: using function: prefetch64-sse (10035.000 MB/sec)
[   96.139538] async_tx: api initialized (async)
[   96.171739] md: raid6 personality registered for level 6
[   96.182518] md: raid5 personality registered for level 5
[   96.192992] md: raid4 personality registered for level 4
[   96.212288] md: raid10 personality registered for level 10
...

Ser Olmy · 04-13-2021, 07:38 AM

It seems the driver is indeed enumerating drives according to unit number (the unit number becomes the virtual SCSI ID). I'd expect the drive attached to port 0 to then become /dev/sdb and so on.

The easiest way to verify this is to run mdadm --detail on /dev/sdb - /dev/sdi, power down and disconnect port 7, then boot the system and check that a) /dev/sdi is missing, and b) that mdadm --detail returns the same UUIDs for the remaining devices.

I'd then repeat the process for ports 6 to 1, disconnecting one additional port at a time. I could then be absolutely certain that the "JBOD" drives u1-u8 do indeed correspond with device nodes /dev/sdb through /dev/sdi.

(Personally, I would instead dump each drive to a file with dd, using an external USB drive or somesuch for storage, and then work on the image files. And once I had the software RAID up and running from those files, I'd copy the data back onto a hardware RAID set.)

computersavvy · 04-14-2021, 11:50 AM

I did not note that you had tried the smartctl command on each of the /dev devices. While I do not know if the 3ware controller will allow direct access via, for example "smartctl -a /dev/sdc", it certainly seems worth a try, since smartctl does give the serial numbers.

HiFish · 04-14-2021, 03:08 PM

Oh, thank you for pointing out this was not clear. In hindsight i did not clearly state in the first post that this controller does not allow this. The full output is:

Code:

smartctl 6.6 2016-05-31 r4324 [i686-linux-4.4.0-78-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

Smartctl open device: /dev/sdb failed: AMCC/3ware controller, please try adding '-d 3ware,N',
you may need to replace /dev/sdb with /dev/twlN, /dev/twaN or /dev/tweN

computersavvy · 04-14-2021, 07:39 PM

That seems to indicate that simply changing the N on those devices should reflect the slot the device is mounted in. 0 would be slot 1, 1 would be slot 2, etc. Have you tried that?

While I realize it will not directly relate to the /dev/sdb, etc. smartctl should at least give you the model and serial # of the device in that slot. The serial # should also be on the physical device so you can identify the slot that way.
In your first post you referred to "smartctl -t -d 3ware,0 /dev/twa0" as being for the first slot, and that you apparently got the info for the rest of the devices as well. The smartctl info may also point to a bad device as well.

As far as UUIDs go the earlier suggestion to try comparing UUID to the slot it is in may also be of benefit.

One issue of concern to me is that you did not say what raid level you were using. Raid 5 with one failure or raid 6 with 2 failures should be able to bring back on line in spite of the failure. Raid 0, 1, 10 are all different in how the failure is handled. Recovery is dependent upon the raid level and type of failure.

One final thought. Are these slots in the enclosure hot swappable? If they are, then you could pull all the devices except the first then power the system on. AFAIK most modern drives can be plugged in with power already on as long as it is a firm plug in and not allowing an intermittent contact. Plugging the drives in one at a time would enable you to monitor which device name is assigned for each one as it is added. Linux assigns drive names in the order seen, not necessarily by the location. This then could allow you to compare /dev/sdc, etc., to the serial number seen and the slot it is located in for future reference.