LinuxQuestions.org
View the Most Wanted LQ Wiki articles.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices

Reply
 
Search this Thread
Old 11-15-2013, 11:04 AM   #31
reano
Member
 
Registered: Nov 2013
Posts: 39

Original Poster
Rep: Reputation: Disabled

Seems I spoke to soon. About 20% into the recovery process sdb1 failed again, and this time sdb2 in md1 also failed. Seems the whole sdb drive is busted.

Code:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb1[2](F) sda1[0]
      1464710976 blocks super 1.2 [2/1] [U_]

md1 : active raid1 sda2[0] sdb2[1](F)
      24006528 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sdb3[1] sda3[0]
      1441268544 blocks super 1.2 [2/2] [UU]

md3 : active raid1 sdc1[0] sdd1[1]
      2930133824 blocks super 1.2 [2/2] [UU]

md4 : active raid1 sdf2[1] sde2[0]
      2929939264 blocks super 1.2 [2/2] [UU]

unused devices: <none>
I'll have to replace the drive. Now the tricky part is, how do I know which physical hard drive is sdb? Is there a way to tell?

Meh.

Last edited by reano; 11-15-2013 at 11:12 AM.
 
Old 11-15-2013, 11:17 AM   #32
Ser Olmy
Senior Member
 
Registered: Jan 2012
Distribution: Slackware
Posts: 2,225

Rep: Reputation: Disabled
Quote:
Originally Posted by reano View Post
I'll have to replace the drive. Now the tricky part is, how do I know which physical hard drive is sdb? Is there a way to tell?
Now you know why RAID array drives should be clearly labeled...

Assuming these are SATA drives, sdb is (most likely) the drive connected to the SATA port with the second lowest number that's in use.

Since it's no longer part of the array, it will be the only inactive drive. If the drives have on-board activity LEDs (few do these days), you should be able to tell by just looking.

You could try spinning the drive down with hdparm -Y. You should be able to hear it power down.
 
Old 11-15-2013, 11:27 AM   #33
reano
Member
 
Registered: Nov 2013
Posts: 39

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by Ser Olmy View Post
Now you know why RAID array drives should be clearly labeled...
Yup, lesson learned indeed.

I'll try the hdparm on Monday. Is there a way to power it back up, as I might need to toggle it a few times to find the right one - there are 6 drives in that box :S

Also, before I power down the drive and replace it, I'll need to remove sdb1, sdb2 and sdb3 from md0, md1 and md2. Do I just do that normally, as in:

Code:
mdadm --manage /dev/md0 --fail /dev/sdb1
mdadm --manage /dev/md0 --remove /dev/sdb1

mdadm --manage /dev/md1 --fail /dev/sdb2
mdadm --manage /dev/md1 --remove /dev/sdb2


mdadm --manage /dev/md2 --fail /dev/sdb3
mdadm --manage /dev/md2 --remove /dev/sdb3
Or is there another way to go about it?

Last edited by reano; 11-15-2013 at 11:30 AM.
 
Old 11-15-2013, 11:33 AM   #34
Ser Olmy
Senior Member
 
Registered: Jan 2012
Distribution: Slackware
Posts: 2,225

Rep: Reputation: Disabled
No, that's how you do it; first "--fail", then "--remove".

(And any kind of disk access should wake a sleeping drive, like running fdisk or parted, or dd'ing a few blocks to /dev/null.)

Last edited by Ser Olmy; 11-15-2013 at 11:35 AM.
 
Old 11-15-2013, 11:38 AM   #35
reano
Member
 
Registered: Nov 2013
Posts: 39

Original Poster
Rep: Reputation: Disabled
Now something very concerning started happening.
I wanted to install a package using apt-get. I got the following error:

Code:
root@lia:~# apt-get install gdisk
-bash: /usr/bin/apt-get: Input/output error
So then I did:

Code:
root@lia:~# smartctl -a /dev/sdb

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor:               /0:0:1:0
Product:
User Capacity:        600*332*565*813*390*450 bytes [600 PB]
Logical block size:   774843950 bytes
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
Bus error
But, I also get the following on sda:

Code:
root@lia:~# smartctl -a /dev/sda

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor:               /0:0:0:0
Product:
User Capacity:        600*332*565*813*390*450 bytes [600 PB]
Logical block size:   774843950 bytes
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
Bus error
What the heck....? Is sda failing now as well?

cat /proc/mdstat still shows:

Code:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb1[2](F) sda1[0]
      1464710976 blocks super 1.2 [2/1] [U_]

md1 : active raid1 sda2[0] sdb2[1](F)
      24006528 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sdb3[1] sda3[0]
      1441268544 blocks super 1.2 [2/2] [UU]

md3 : active raid1 sdc1[0] sdd1[1]
      2930133824 blocks super 1.2 [2/2] [UU]

md4 : active raid1 sdf2[1] sde2[0]
      2929939264 blocks super 1.2 [2/2] [UU]

unused devices: <none>
Indicating that only sdb failed, with 2 out of the 3 partitions down so far.

Last edited by reano; 11-15-2013 at 11:40 AM.
 
Old 11-15-2013, 11:41 AM   #36
Ser Olmy
Senior Member
 
Registered: Jan 2012
Distribution: Slackware
Posts: 2,225

Rep: Reputation: Disabled
The faulty drive may be blocking the controller. An emergency reboot may be in order here.

You also need to check the S.M.A.R.T. status of all remaining drives asap.

(For instance, are you sure the rebuild failure was caused by a write error on /dev/sdb, and not a read error on /dev/sda?)

Last edited by Ser Olmy; 11-15-2013 at 11:42 AM.
 
Old 11-15-2013, 11:48 AM   #37
reano
Member
 
Registered: Nov 2013
Posts: 39

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by Ser Olmy View Post
The faulty drive may be blocking the controller. An emergency reboot may be in order here.

You also need to check the S.M.A.R.T. status of all remaining drives asap.

(For instance, are you sure the rebuild failure was caused by a write error on /dev/sdb, and not a read error on /dev/sda?)
Normal reboot console command? Or is there another way to do an emergency reboot?

The other drives:

sdc has 0 pending sectors.
sdd has 24 pending sectors, and shows "Error 244 occurred at disk power-on lifetime: 8689 hours (362 days + 1 hours)"
sde has 0 pending sectors, but also shows "Error 51 occurred at disk power-on lifetime: 8009 hours (333 days + 17 hours)"
sdf has 0 pending sectors.

This spells crisis to me :/ Of the 6 drives, 3 seems to be busted, one on each array - and I have no idea what's going on with sda.

Last edited by reano; 11-15-2013 at 11:52 AM.
 
Old 11-15-2013, 11:55 AM   #38
Ser Olmy
Senior Member
 
Registered: Jan 2012
Distribution: Slackware
Posts: 2,225

Rep: Reputation: Disabled
The reboot command (or the 3-fingered salute) should be used, if possible. Only when that fails should one resort to alternate strategies involving the SysRq key or the power button.

Have you been checking these arrays regularly? I run
Code:
echo check > /sys/devices/virtual/block/<md device>/md/sync_action
at least once a week. Also, one should always monitor the S.M.A.R.T. status of all drives with smartd.
 
Old 11-15-2013, 11:56 AM   #39
reano
Member
 
Registered: Nov 2013
Posts: 39

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by Ser Olmy View Post
The reboot command (or the 3-fingered salute) should be used, if possible. Only when that fails should one resort to alternate strategies involving the SysRq key or the power button.

Have you been checking these arrays regularly? I run
Code:
echo check > /sys/devices/virtual/block/<md device>/md/sync_action
at least weekly. Also, one should always monitor the S.M.A.R.T. status of all drives with smartd.
Do I need to remove any drives before rebooting? The server is offsite, and I'm accessing it remotely at the moment.
EDIT: Just lost remote connection. Server is still up as it's still routing traffic, but I can't access it via SSH anymore.

Last edited by reano; 11-15-2013 at 12:01 PM.
 
Old 11-15-2013, 12:02 PM   #40
Ser Olmy
Senior Member
 
Registered: Jan 2012
Distribution: Slackware
Posts: 2,225

Rep: Reputation: Disabled
It would have been really great if someone could unplug the drive causing these bus errors, but the problem is we don't know with 100 % certainty that /dev/sdb is the culprit (although it's more than likely). Also, the drives aren't labeled.

Does this server have built-in remote access functionality, or do you have to rely on the OS?

Edit: I guess you need the OS, parts of which are probably spewing "oops" messages at the console right now.

Last edited by Ser Olmy; 11-15-2013 at 12:03 PM.
 
Old 11-15-2013, 12:04 PM   #41
reano
Member
 
Registered: Nov 2013
Posts: 39

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by Ser Olmy View Post
It would have been really great if someone could unplug the drive causing these bus errors, but the problem is we don't know with 100 % certainty that /dev/sdb is the culprit (although it's more than likely). Also, the drives aren't labeled.

Does this server have built-in remote access functionality, or do you have to rely on the OS?

Edit: I guess you need the OS, parts of which are probably spewing "oops" messages at the console right now.
See my edit. I'll have to drive in and shutdown -h, then locate sdb, disconnect it, and start her back up. Anything else I need to know before going in? (if the server doesn't come back up I won't have internet access from the premises... talk about a double-crisis)
 
Old 11-15-2013, 12:07 PM   #42
Ser Olmy
Senior Member
 
Registered: Jan 2012
Distribution: Slackware
Posts: 2,225

Rep: Reputation: Disabled
Make sure to bring a live CD (like, say, System Rescue CD) in case the system fails to boot. You could even set up an emergency NAT router with a CD/DVD like that.
 
Old 11-15-2013, 12:09 PM   #43
reano
Member
 
Registered: Nov 2013
Posts: 39

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by Ser Olmy View Post
Make sure to bring a live CD (like, say, System Rescue CD) in case the system fails to boot. You could even set up an emergency NAT router with a CD/DVD like that.
Will do. If possible, I'll still try to remove sdb1,2,3 from md0,1,2 before shutting down and removing the drive. Right?
 
Old 11-15-2013, 12:10 PM   #44
Ser Olmy
Senior Member
 
Registered: Jan 2012
Distribution: Slackware
Posts: 2,225

Rep: Reputation: Disabled
Sounds like a plan.
 
Old 11-15-2013, 12:14 PM   #45
reano
Member
 
Registered: Nov 2013
Posts: 39

Original Poster
Rep: Reputation: Disabled
Also, any idea why we're seeing errors on 3 drives instead of 1 (refer to post #37)? Normally I'd suspect a RAID controller, but this is software raid.

Last edited by reano; 11-15-2013 at 12:15 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Raid 1 Array Degraded reveal Linux - Hardware 3 11-04-2013 10:59 AM
dirty degraded md raid array edgjerp Linux - Hardware 1 01-07-2009 01:51 PM
Raid 1 degraded cferron Linux - Server 6 10-19-2008 10:15 AM
raid 5 degraded unable to log in neonorm Linux - Hardware 4 06-10-2007 09:03 AM
RAID 1 Degraded Array gsoft Debian 2 08-18-2006 02:17 PM


All times are GMT -5. The time now is 05:40 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration