LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 07-06-2011, 04:54 AM   #1
ananthkadalur
Member
 
Registered: Mar 2011
Posts: 38

Rep: Reputation: 0
md device fail in raid


Hi....everybody, I am working as as Linux Admin. I have configured RAID1 with below partitions
1.sda & sdb
HD-Partitions RAID-Devices Size Mountpoint
sda1-sdb1-------- /dev/md0-------- 500MB---------- /boot
sda2-sdb2-------- /dev/md1-------- 50GB----------- /
sda3-sdb3-------- /dev/md2-------- 8GB----------- Swap
sda5-sdb5-------- /dev/md3-------- 500GB------- /backup
sda6-sdb6-------- /dev/md4-------- 300GB---------/data1
around 130GB free space in each(sda&sdb)disk.

2.sdc & sdd
sdc1-sdd1-------- /dev/md5-------- 1000GB--------/repo

I am using CentOS5.5 & md0, md1, md2 was configured during the installation itself. I checked only single-single hdd(sda-sdb) it is booting fine, So I ensured that the system still will be able to boot from another hdd during any hdd(sda or sdb) failure. Everything is working fine. but the problem is whenever I am copying data to md device, and after if I run cat /proc/mdstat, raid is showing one of the partition is fail. This is only for sdb like sdb5 or sdb6 in which I am copying. This happens only then when I am copying some large data(10GB-50GB). Always I am rebuilding it. After rebulding, it will be working fine. I swapped sdd with sdb & sdb with sdd. After readding those hdd it is working fine. But the problem is still continuing, i.e if I try to copy, the partition in sdb would be failed. This is not happening with md5(sdc1&sdd1). I don't know what may be the problem. But it is very urgent & I hope someone will definitely help me about this.

Regards
Ananth
 
Old 07-07-2011, 05:17 PM   #2
macemoneta
Senior Member
 
Registered: Jan 2005
Location: Manalapan, NJ
Distribution: Fedora x86 and x86_64, Debian PPC and ARM, Android
Posts: 4,593
Blog Entries: 2

Rep: Reputation: 344Reputation: 344Reputation: 344Reputation: 344
Check /var/log/messages to see what the problem is. It sounds like you are having some write failures. Make sure you have TLER (time limited error recovery) enabled on the drives:
Code:
smartctl -l scterc /dev/sda
smartctl -l scterc /dev/sdb
If it's disabled, set it to 7 seconds:
Code:
smartctl -l scterc,70,70 /dev/sda
smartctl -l scterc,70,70 /dev/sdb
Some more information on TLER here.

You may also want to force a check on the arrays (and schedule it weekly), to insure they are properly synced:
Code:
echo check > /sys/block/md0/md/sync_action
echo check > /sys/block/md1/md/sync_action
echo check > /sys/block/md2/md/sync_action
echo check > /sys/block/md3/md/sync_action
echo check > /sys/block/md4/md/sync_action
There's more information at the RAID Wiki.

Last edited by macemoneta; 07-07-2011 at 05:18 PM.
 
1 members found this post helpful.
Old 07-08-2011, 05:38 AM   #3
ananthkadalur
Member
 
Registered: Mar 2011
Posts: 38

Original Poster
Rep: Reputation: 0
md device fail in raid

Hi..Sir, Thank you very much for your quick response. as you said I checked
"smartctl -l scterc /dev/sda", please see the the below output as below
[root@testbhim CentOS]# /usr/sbin/smartctl -l scterc /dev/sdd
smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=======> INVALID ARGUMENT TO -l: scterc
=======> VALID ARGUMENTS ARE: error, selftest, selective, directory, background, scttemp[sts|hist] <=======

Use smartctl -h to get a usage summary
above output is for all sda, sdb, sdc, sdd

is there any problem if I leave free space in sda & sdb? because still around 130GB free space in each(sda & sdb) hard disk. & that free space is not in raid array.

please see the output of fdisk
[root@testbhim CentOS]# /sbin/fdisk -l /dev/sdd

Disk /dev/sdd: 1000.2 GB, 1000203804160 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sdd1 1 121601 976760001 fd Linux raid autodetect

[root@testbhim CentOS]# /sbin/fdisk -l /dev/sda

Disk /dev/sda: 1000.2 GB, 1000203804160 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sda1 * 1 64 514048+ fd Linux raid autodetect
/dev/sda2 65 6438 51199155 fd Linux raid autodetect
/dev/sda3 6439 7458 8193150 fd Linux raid autodetect
/dev/sda4 7459 121601 916853647+ 5 Extended
/dev/sda5 7459 68247 488287611 fd Linux raid autodetect
/dev/sda6 68248 104721 292977373+ fd Linux raid autodetect

As you said I ran "echo check > /sys/block/md5/md/sync_action", it started to resync md5. But if I scheduled it weekly once for all md device at night, is there any problem while resyncing? because at night backup script for daily backup will be running daily. or if there is no problem, can I schedule it on day time itself?
can we copy large data while rebuilding & re syncing the raid array?
please see the df output
[root@testbhim CentOS]# df -hT
Filesystem Type Size Used Avail Use% Mounted on
/dev/md1 ext3 48G 7.1G 38G 16% /
/dev/md0 ext3 487M 24M 438M 5% /boot
tmpfs tmpfs 945M 0 945M 0% /dev/shm
/dev/md3 ext3 459G 291G 145G 67% /backup
/dev/md5 ext3 917G 23G 848G 3% /repo
md4 was mounted under md5, it has been unmounted now. because I added new sdb, & rebuilding is going on now. because if I run smartctl -H /dev/sdb the status was failed. but sda, sdc, sdd the status was passed. So I replaced sdb with new one.

please have a look with mdstat output
[root@testbhim CentOS]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[1] sda1[0]
513984 blocks [2/2] [UU]

md2 : active raid1 sdb3[2] sda3[0]
8193024 blocks [2/1] [U_]
resync=DELAYED

md3 : active raid1 sdb5[2] sda5[0]
488287488 blocks [2/1] [U_]
[===================>.] recovery = 99.9% (488132736/488287488) finish=0.0min speed=60982K/sec

md4 : active raid1 sdb6[2] sda6[0]
292977280 blocks [2/1] [U_]
resync=DELAYED

md5 : active raid1 sdd1[1] sdc1[0]
976759936 blocks [2/2] [UU]
[================>....] resync = 80.6% (787464128/976759936) finish=36.4min speed=86508K/sec

md1 : active raid1 sdb2[1] sda2[0]
51199040 blocks [2/2] [UU]

Please have a look with fstab
/dev/md1 / ext3 defaults 1 1
/dev/md0 /boot ext3 defaults 1 2
tmpfs /dev/shm tmpfs defaults 0 0
devpts /dev/pts devpts gid=5,mode=620 0 0
sysfs /sys sysfs defaults 0 0
proc /proc proc defaults 0 0
/dev/md2 swap swap defaults 0 0
/dev/md3 /backup ext3 defaults 0 0
#/dev/md4 /repo/base ext3 defaults 0 0
/dev/md5 /repo ext3 defaults 0 0

please see the mdadm.conf
ARRAY /dev/md1 level=raid1 num-devices=2 metadata=0.90 UUID=57786f03:0a32e8bb:b9fab770:aa72d2d0
ARRAY /dev/md5 level=raid1 num-devices=2 metadata=0.90 UUID=82268afc:1cb20e19:1afcb25e:2cdc61d8
ARRAY /dev/md4 level=raid1 num-devices=2 metadata=0.90 UUID=582b1f12:df2a3d2c:877e2383:d74df3bc
ARRAY /dev/md3 level=raid1 num-devices=2 metadata=0.90 UUID=32a7b35b:3de5bc15:90e539c8:1d9a30ed
ARRAY /dev/md2 level=raid1 num-devices=2 metadata=0.90 UUID=4ebc42df:3289c3a4:9a3bf106:d1005657
ARRAY /dev/md0 level=raid1 num-devices=2 metadata=0.90 UUID=cbfaf74f:c313a55f:3bda6e4d:f8e91bad

After rebuilding sdb with sda, again should we overwrite or add output to /etc/mdadm.conf. By issuing following commands or is it okay even I am not adding any output /etc/mdadm.conf?
#mdadm --detail --scan > /etc/mdadm.conf
#mdadm --examine --scan >> /etc/mdadm.conf
etc....?


please see /var/log/messages output
[root@testbhim CentOS]# tail -n 100 /var/log/messages
Jul 8 12:48:30 testbhim kernel: [<c0435f3b>] kthread+0xc0/0xed
Jul 8 12:48:30 testbhim kernel: [<c0435e7b>] kthread+0x0/0xed
Jul 8 12:48:30 testbhim kernel: [<c0405c53>] kernel_thread_helper+0x7/0x10
Jul 8 12:48:30 testbhim kernel: =======================
Jul 8 12:48:30 testbhim kernel: INFO: task md3_resync:4369 blocked for more than 120 seconds.
Jul 8 12:48:30 testbhim kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 8 12:48:30 testbhim kernel: md3_resync D 000000D0 3440 4369 19 4373 4362 (L-TLB)
Jul 8 12:48:30 testbhim kernel: f2d5bec0 00000046 2c44f342 000000d0 00000064 00000000 00000000 0000000a
Jul 8 12:48:30 testbhim kernel: f7041aa0 2c450af5 000000d0 000017b3 00000001 f7041bac c1f00944 f79bd900
Jul 8 12:48:30 testbhim kernel: 00000000 c1f012e4 f7b7668c f79f55c8 f2d5bf80 f7d20000 c0425e9b c0667726
Jul 8 12:48:30 testbhim kernel: Call Trace:
Jul 8 12:48:30 testbhim kernel: [<c0425e9b>] printk+0x18/0x8e
Jul 8 12:48:30 testbhim kernel: [<c05ab8cf>] md_do_sync+0x1fe/0x966
Jul 8 12:48:30 testbhim kernel: [<c041ee80>] enqueue_task+0x29/0x39
Jul 8 12:48:30 testbhim kernel: [<c041eeda>] __activate_task+0x4a/0x59
Jul 8 12:48:30 testbhim kernel: [<c041f79d>] try_to_wake_up+0x3e8/0x3f2
Jul 8 12:48:30 testbhim kernel: [<c061c770>] schedule+0x9cc/0xa55
Jul 8 12:48:30 testbhim kernel: [<c0435fff>] autoremove_wake_function+0x0/0x2d
Jul 8 12:48:30 testbhim kernel: [<c05ac321>] md_thread+0xdf/0xf5
Jul 8 12:48:30 testbhim kernel: [<c041eb45>] complete+0x2b/0x3d
Jul 8 12:48:30 testbhim kernel: [<c05ac242>] md_thread+0x0/0xf5
Jul 8 12:48:30 testbhim kernel: [<c0435f3b>] kthread+0xc0/0xed
Jul 8 12:48:30 testbhim kernel: [<c0435e7b>] kthread+0x0/0xed
Jul 8 12:48:30 testbhim kernel: [<c0405c53>] kernel_thread_helper+0x7/0x10
Jul 8 12:48:30 testbhim kernel: =======================
Jul 8 12:48:30 testbhim kernel: INFO: task md4_resync:4373 blocked for more than 120 seconds.
Jul 8 12:48:30 testbhim kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 8 12:48:30 testbhim kernel: md4_resync D 000000D0 3492 4373 19 4369 (L-TLB)
Jul 8 12:48:30 testbhim kernel: f2cc1ec0 00000046 2c44c9de 000000d0 00000064 00000096 f2f50f80 00000009
Jul 8 12:48:30 testbhim kernel: f70bdaa0 2c44f342 000000d0 00002964 00000001 f70bdbac c1f00944 f79bd900
Jul 8 12:48:30 testbhim kernel: 00000003 c06b5b98 f7f7d0cc f7d201c8 f2cc1f80 f79f5600 c0425e9b c0667726
Jul 8 12:48:30 testbhim kernel: Call Trace:
Jul 8 12:48:30 testbhim kernel: [<c0425e9b>] printk+0x18/0x8e
Jul 8 12:48:30 testbhim kernel: [<c05ab8cf>] md_do_sync+0x1fe/0x966
Jul 8 12:48:30 testbhim kernel: [<c041ee80>] enqueue_task+0x29/0x39
Jul 8 12:48:30 testbhim kernel: [<c041eeda>] __activate_task+0x4a/0x59
Jul 8 12:48:30 testbhim kernel: [<c041f79d>] try_to_wake_up+0x3e8/0x3f2
Jul 8 12:48:30 testbhim kernel: [<c061c770>] schedule+0x9cc/0xa55
Jul 8 12:48:30 testbhim kernel: [<c0435fff>] autoremove_wake_function+0x0/0x2d
Jul 8 12:48:30 testbhim kernel: [<c05ac321>] md_thread+0xdf/0xf5
Jul 8 12:48:30 testbhim kernel: [<c041eb45>] complete+0x2b/0x3d
Jul 8 12:48:30 testbhim kernel: [<c05ac242>] md_thread+0x0/0xf5
Jul 8 12:48:30 testbhim kernel: [<c0435f3b>] kthread+0xc0/0xed
Jul 8 12:48:30 testbhim kernel: [<c0435e7b>] kthread+0x0/0xed
Jul 8 12:48:30 testbhim kernel: [<c0405c53>] kernel_thread_helper+0x7/0x10
Jul 8 12:48:30 testbhim kernel: =======================
Jul 8 12:50:30 testbhim kernel: INFO: task md2_resync:4362 blocked for more than 120 seconds.
Jul 8 12:50:30 testbhim kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 8 12:50:30 testbhim kernel: md2_resync D 000000D0 3440 4362 19 4369 4359 (L-TLB)
Jul 8 12:50:30 testbhim kernel: f2f50ec0 00000046 2c450af5 000000d0 00000064 00000000 00000000 0000000a
Jul 8 12:50:30 testbhim kernel: f773a550 2c451947 000000d0 00000e52 00000001 f773a65c c1f00944 f79bd900
Jul 8 12:50:30 testbhim kernel: 00000003 c1f012e4 f7b7658c f7d201c8 f2f50f80 f78df200 c0425e9b ffffffff
Jul 8 12:50:30 testbhim kernel: Call Trace:
Jul 8 12:50:30 testbhim kernel: [<c0425e9b>] printk+0x18/0x8e
Jul 8 12:50:30 testbhim kernel: [<c05ab8cf>] md_do_sync+0x1fe/0x966
Jul 8 12:50:30 testbhim kernel: [<c041ee80>] enqueue_task+0x29/0x39
Jul 8 12:50:30 testbhim kernel: [<c041eeda>] __activate_task+0x4a/0x59
Jul 8 12:50:30 testbhim kernel: [<c041f79d>] try_to_wake_up+0x3e8/0x3f2
Jul 8 12:50:30 testbhim kernel: [<c061c770>] schedule+0x9cc/0xa55
Jul 8 12:50:30 testbhim kernel: [<c0435fff>] autoremove_wake_function+0x0/0x2d
Jul 8 12:50:30 testbhim kernel: [<c05ac321>] md_thread+0xdf/0xf5
Jul 8 12:50:30 testbhim kernel: [<c041eb45>] complete+0x2b/0x3d
Jul 8 12:50:30 testbhim kernel: [<c05ac242>] md_thread+0x0/0xf5
Jul 8 12:50:30 testbhim kernel: [<c0435f3b>] kthread+0xc0/0xed
Jul 8 12:50:30 testbhim kernel: [<c0435e7b>] kthread+0x0/0xed
Jul 8 12:50:30 testbhim kernel: [<c0405c53>] kernel_thread_helper+0x7/0x10
Jul 8 12:50:30 testbhim kernel: =======================
Jul 8 12:51:25 testbhim kernel: md: md1: sync done.
Jul 8 12:51:26 testbhim kernel: md: delaying resync of md2 until md3 has finished resync (they share one or more physical units)
Jul 8 12:51:26 testbhim kernel: md: syncing RAID array md3
Jul 8 12:51:26 testbhim kernel: md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
Jul 8 12:51:26 testbhim kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction.
Jul 8 12:51:26 testbhim kernel: md: using 128k window, over a total of 488287488 blocks.
Jul 8 12:51:26 testbhim kernel: RAID1 conf printout:
Jul 8 12:51:26 testbhim kernel: --- wd:2 rd:2
Jul 8 12:51:26 testbhim kernel: disk 0, wo:0, o:1, dev:sda2
Jul 8 12:51:26 testbhim kernel: disk 1, wo:0, o:1, dev:sdb2
Jul 8 12:51:26 testbhim kernel: md: delaying resync of md4 until md3 has finished resync (they share one or more physical units)
Jul 8 12:52:51 testbhim kernel: md: syncing RAID array md5
Jul 8 12:52:51 testbhim kernel: md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
Jul 8 12:52:51 testbhim kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction.
Jul 8 12:52:51 testbhim kernel: md: using 128k window, over a total of 976759936 blocks.
Jul 8 13:18:28 testbhim scim-bridge: The lockfile is destroied
Jul 8 13:18:28 testbhim scim-bridge: Cleanup, done. Exitting...
Jul 8 14:49:51 testbhim avahi-daemon[3272]: Invalid legacy unicast query packet.
Jul 8 14:49:51 testbhim avahi-daemon[3272]: Received response from host 192.168.0.95 with invalid source port 2673 on interface 'eth0.0'
Jul 8 14:49:51 testbhim avahi-daemon[3272]: Invalid legacy unicast query packet.
Jul 8 14:49:51 testbhim avahi-daemon[3272]: Invalid legacy unicast query packet.
Jul 8 14:49:51 testbhim avahi-daemon[3272]: Received response from host 192.168.0.95 with invalid source port 2673 on interface 'eth0.0'
Jul 8 14:49:55 testbhim last message repeated 4 times
Jul 8 14:58:55 testbhim kernel: md: md3: sync done.
Jul 8 14:58:55 testbhim kernel: md: syncing RAID array md4
Jul 8 14:58:55 testbhim kernel: md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
Jul 8 14:58:55 testbhim kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction.
Jul 8 14:58:55 testbhim kernel: md: using 128k window, over a total of 292977280 blocks.
Jul 8 14:58:55 testbhim kernel: RAID1 conf printout:
Jul 8 14:58:55 testbhim kernel: --- wd:2 rd:2
Jul 8 14:58:55 testbhim kernel: disk 0, wo:0, o:1, dev:sda5
Jul 8 14:58:55 testbhim kernel: disk 1, wo:0, o:1, dev:sdb5
Jul 8 14:58:55 testbhim kernel: md: delaying resync of md2 until md4 has finished resync (they share one or more physical units)
[root@testbhim CentOS]#




please see the
Quote:
Originally Posted by macemoneta View Post
Check /var/log/messages to see what the problem is. It sounds like you are having some write failures. Make sure you have TLER (time limited error recovery) enabled on the drives:
Code:
smartctl -l scterc /dev/sda
smartctl -l scterc /dev/sdb
If it's disabled, set it to 7 seconds:
Code:
smartctl -l scterc,70,70 /dev/sda
smartctl -l scterc,70,70 /dev/sdb
Some more information on TLER here.

You may also want to force a check on the arrays (and schedule it weekly), to insure they are properly synced:
Code:
echo check > /sys/block/md0/md/sync_action
echo check > /sys/block/md1/md/sync_action
echo check > /sys/block/md2/md/sync_action
echo check > /sys/block/md3/md/sync_action
echo check > /sys/block/md4/md/sync_action
There's more information at the RAID Wiki.

Last edited by ananthkadalur; 07-08-2011 at 06:07 AM. Reason: A small doubt
 
Old 07-08-2011, 09:23 AM   #4
macemoneta
Senior Member
 
Registered: Jan 2005
Location: Manalapan, NJ
Distribution: Fedora x86 and x86_64, Debian PPC and ARM, Android
Posts: 4,593
Blog Entries: 2

Rep: Reputation: 344Reputation: 344Reputation: 344Reputation: 344
Quote:
Originally Posted by ananthkadalur View Post
[root@testbhim CentOS]# /usr/sbin/smartctl -l scterc /dev/sdd
smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=======> INVALID ARGUMENT TO -l: scterc
=======> VALID ARGUMENTS ARE: error, selftest, selective, directory, background, scttemp[sts|hist] <=======

Use smartctl -h to get a usage summary
above output is for all sda, sdb, sdc, sdd
Your drive doesn't support the smartctl TLER. You need to contact your drive manufacturer, to see if they have a proprietary utility to enable TLER. If not, these drives may not be suitable for use with RAID.

Quote:
Originally Posted by ananthkadalur View Post
is there any problem if I leave free space in sda & sdb? because still around 130GB free space in each(sda & sdb) hard disk. & that free space is not in raid array.
That's not a problem.

Quote:
Originally Posted by ananthkadalur View Post
As you said I ran "echo check > /sys/block/md5/md/sync_action", it started to resync md5. But if I scheduled it weekly once for all md device at night, is there any problem while resyncing? because at night backup script for daily backup will be running daily. or if there is no problem, can I schedule it on day time itself?
can we copy large data while rebuilding & re syncing the raid array?
You can run the check any time. It will yield to other I/O.

Quote:
Originally Posted by ananthkadalur View Post
because if I run smartctl -H /dev/sdb the status was failed. but sda, sdc, sdd the status was passed. So I replaced sdb with new one.
Do you still have a problem?

Quote:
Originally Posted by ananthkadalur View Post
After rebuilding sdb with sda, again should we overwrite or add output to /etc/mdadm.conf. By issuing following commands or is it okay even I am not adding any output /etc/mdadm.conf?
You need to update mdadm.conf any time you replace a drive.

Quote:
Originally Posted by ananthkadalur View Post
please see /var/log/messages output
[root@testbhim CentOS]# tail -n 100 /var/log/messages
Jul 8 12:48:30 testbhim kernel: INFO: task md3_resync:4369 blocked for more than 120 seconds.
This could be the result of the failing drive or lack of TLER.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
sound device fail sinvikram Linux - Newbie 3 11-21-2010 05:59 AM
After boot up RAID drives fail to start OS. webguyinternet Fedora 2 10-06-2006 03:03 PM
SATA RAID disk fail detection henrikost Linux - Hardware 2 09-21-2006 03:28 AM
installing X causes RAID to fail mrog Debian 2 05-06-2005 08:32 AM
rebuild kernel 2.6 boot fail on soft RAID fs Ducko Fedora 4 10-05-2004 09:40 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 04:18 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration