Linux - HardwareThis forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Seems I spoke to soon. About 20% into the recovery process sdb1 failed again, and this time sdb2 in md1 also failed. Seems the whole sdb drive is busted.
Code:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb1[2](F) sda1[0]
1464710976 blocks super 1.2 [2/1] [U_]
md1 : active raid1 sda2[0] sdb2[1](F)
24006528 blocks super 1.2 [2/1] [U_]
md2 : active raid1 sdb3[1] sda3[0]
1441268544 blocks super 1.2 [2/2] [UU]
md3 : active raid1 sdc1[0] sdd1[1]
2930133824 blocks super 1.2 [2/2] [UU]
md4 : active raid1 sdf2[1] sde2[0]
2929939264 blocks super 1.2 [2/2] [UU]
unused devices: <none>
I'll have to replace the drive. Now the tricky part is, how do I know which physical hard drive is sdb? Is there a way to tell?
I'll have to replace the drive. Now the tricky part is, how do I know which physical hard drive is sdb? Is there a way to tell?
Now you know why RAID array drives should be clearly labeled...
Assuming these are SATA drives, sdb is (most likely) the drive connected to the SATA port with the second lowest number that's in use.
Since it's no longer part of the array, it will be the only inactive drive. If the drives have on-board activity LEDs (few do these days), you should be able to tell by just looking.
You could try spinning the drive down with hdparm -Y. You should be able to hear it power down.
Now you know why RAID array drives should be clearly labeled...
Yup, lesson learned indeed.
I'll try the hdparm on Monday. Is there a way to power it back up, as I might need to toggle it a few times to find the right one - there are 6 drives in that box :S
Also, before I power down the drive and replace it, I'll need to remove sdb1, sdb2 and sdb3 from md0, md1 and md2. Do I just do that normally, as in:
root@lia:~# smartctl -a /dev/sdb
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
Vendor: /0:0:1:0
Product:
User Capacity: 600Â*332Â*565Â*813Â*390Â*450 bytes [600 PB]
Logical block size: 774843950 bytes
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
Bus error
But, I also get the following on sda:
Code:
root@lia:~# smartctl -a /dev/sda
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
Vendor: /0:0:0:0
Product:
User Capacity: 600Â*332Â*565Â*813Â*390Â*450 bytes [600 PB]
Logical block size: 774843950 bytes
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
Bus error
What the heck....? Is sda failing now as well?
cat /proc/mdstat still shows:
Code:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb1[2](F) sda1[0]
1464710976 blocks super 1.2 [2/1] [U_]
md1 : active raid1 sda2[0] sdb2[1](F)
24006528 blocks super 1.2 [2/1] [U_]
md2 : active raid1 sdb3[1] sda3[0]
1441268544 blocks super 1.2 [2/2] [UU]
md3 : active raid1 sdc1[0] sdd1[1]
2930133824 blocks super 1.2 [2/2] [UU]
md4 : active raid1 sdf2[1] sde2[0]
2929939264 blocks super 1.2 [2/2] [UU]
unused devices: <none>
Indicating that only sdb failed, with 2 out of the 3 partitions down so far.
The faulty drive may be blocking the controller. An emergency reboot may be in order here.
You also need to check the S.M.A.R.T. status of all remaining drives asap.
(For instance, are you sure the rebuild failure was caused by a write error on /dev/sdb, and not a read error on /dev/sda?)
Normal reboot console command? Or is there another way to do an emergency reboot?
The other drives:
sdc has 0 pending sectors.
sdd has 24 pending sectors, and shows "Error 244 occurred at disk power-on lifetime: 8689 hours (362 days + 1 hours)"
sde has 0 pending sectors, but also shows "Error 51 occurred at disk power-on lifetime: 8009 hours (333 days + 17 hours)"
sdf has 0 pending sectors.
This spells crisis to me :/ Of the 6 drives, 3 seems to be busted, one on each array - and I have no idea what's going on with sda.
The reboot command (or the 3-fingered salute) should be used, if possible. Only when that fails should one resort to alternate strategies involving the SysRq key or the power button.
Have you been checking these arrays regularly? I run
The reboot command (or the 3-fingered salute) should be used, if possible. Only when that fails should one resort to alternate strategies involving the SysRq key or the power button.
Have you been checking these arrays regularly? I run
at least weekly. Also, one should always monitor the S.M.A.R.T. status of all drives with smartd.
Do I need to remove any drives before rebooting? The server is offsite, and I'm accessing it remotely at the moment.
EDIT: Just lost remote connection. Server is still up as it's still routing traffic, but I can't access it via SSH anymore.
It would have been really great if someone could unplug the drive causing these bus errors, but the problem is we don't know with 100 % certainty that /dev/sdb is the culprit (although it's more than likely). Also, the drives aren't labeled.
Does this server have built-in remote access functionality, or do you have to rely on the OS?
Edit: I guess you need the OS, parts of which are probably spewing "oops" messages at the console right now.
It would have been really great if someone could unplug the drive causing these bus errors, but the problem is we don't know with 100 % certainty that /dev/sdb is the culprit (although it's more than likely). Also, the drives aren't labeled.
Does this server have built-in remote access functionality, or do you have to rely on the OS?
Edit: I guess you need the OS, parts of which are probably spewing "oops" messages at the console right now.
See my edit. I'll have to drive in and shutdown -h, then locate sdb, disconnect it, and start her back up. Anything else I need to know before going in? (if the server doesn't come back up I won't have internet access from the premises... talk about a double-crisis)
Make sure to bring a live CD (like, say, System Rescue CD) in case the system fails to boot. You could even set up an emergency NAT router with a CD/DVD like that.
Make sure to bring a live CD (like, say, System Rescue CD) in case the system fails to boot. You could even set up an emergency NAT router with a CD/DVD like that.
Will do. If possible, I'll still try to remove sdb1,2,3 from md0,1,2 before shutting down and removing the drive. Right?
Also, any idea why we're seeing errors on 3 drives instead of 1 (refer to post #37)? Normally I'd suspect a RAID controller, but this is software raid.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.