LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices


Reply
  Search this Thread
Old 10-20-2022, 02:35 AM   #1
anctop
Member
 
Registered: Feb 2010
Posts: 101

Rep: Reputation: Disabled
RAID-1 rebuild failure


Hi,

Our system has two 1TB HDD's (/dev/sda, /dev/sdb) where a 150GB RAID-1 array (/dev/md3) is created to mirror the partitions "/dev/sda3" and "/dev/sdb3".

Recently we bought two 4TB HDD's for upgrade replacement.

The disks have been tested thoroughly with vendor's utility before deployment.
They support SCT ERC, but require setting "Read" and "Write" values to "70" after reboot.

On each new disk, a 500GB partition is created for extending the array.

The first step of replacement involves:
  1. "mdadm --fail /dev/md3 /dev/sdb3";
  2. "mdadm --remove /dev/md3 /dev/sdb3", then the array becomes "clean,degraded";
  3. power down the system and replace the 1TB /dev/sdb with one of the new 4TB disks;
  4. power up the system, the new disk is detected correctly as "/dev/sdb";
  5. "mdadm --add /dev/md3 /dev/sdb3" and the re-building process starts.

The rebuilding ran smoothly at the beginning, but died at about 98% of progress, with a bunch of I/O errors:
Code:
> ata2.00: exception Emask 0x0 SAct 0x1400 SErr 0x0 action 0x0
> ata2.00: irq_stat 0x40000008
> ata2.00: failed command: READ FPDMA QUEUED
> ata2.00: cmd 60/08:50:50:4b:7e/00:00:12:00:00/40 tag 10 ncq dma 4096 in
>          res 41/40:00:50:4b:7e/00:00:12:00:00/40 Emask 0x409 (media error) <F>
> ata2.00: status: { DRDY ERR }
> ata2.00: error: { UNC }
> ata2.00: configured for UDMA/133
> sd 1:0:0:0: [sda] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=3s
> sd 1:0:0:0: [sda] tag#10 Sense Key : Medium Error [current]
> sd 1:0:0:0: [sda] tag#10 Add. Sense: Unrecovered read error - auto reallocate failed
> sd 1:0:0:0: [sda] tag#10 CDB: Read(10) 28 00 12 7e 4b 50 00 00 08 00
> blk_update_request: I/O error, dev sda, sector 310266704 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 0
> ata2: EH complete
> md/raid1:md3: sda: unrecoverable I/O read error for block 308167424
> md: md3: recovery interrupted.
> ata2.00: exception Emask 0x0 SAct 0x2 SErr 0x0 action 0x0
> ata2.00: irq_stat 0x40000008
> ata2.00: failed command: READ FPDMA QUEUED
> ata2.00: cmd 60/08:08:60:56:7e/00:00:12:00:00/40 tag 1 ncq dma 4096 in
>          res 41/40:00:60:56:7e/00:00:12:00:00/40 Emask 0x409 (media error) <F>
> ata2.00: status: { DRDY ERR }
> ata2.00: error: { UNC }
> ata2.00: configured for UDMA/133
> sd 1:0:0:0: [sda] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=3s
> sd 1:0:0:0: [sda] tag#1 Sense Key : Medium Error [current]
> sd 1:0:0:0: [sda] tag#1 Add. Sense: Unrecovered read error - auto reallocate failed
> sd 1:0:0:0: [sda] tag#1 CDB: Read(10) 28 00 12 7e 56 60 00 00 08 00
> blk_update_request: I/O error, dev sda, sector 310269536 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 0
> ata2: EH complete
> md/raid1:md3: sda: unrecoverable I/O read error for block 308170240
The steps have been repeated with different combinations of SATA cables and ports, therefore it is not likely related to hardware issues.

The "degraded" array still works perfectly in all other operations.

In the past 4 years of its service, no error has been logged at all.
We cannot think of a reason for the rebuild problem.

Should we copy all files to the new partition and re-create the array ?

Please kindly advise.
 
Old 10-20-2022, 09:18 AM   #2
dc.901
Senior Member
 
Registered: Aug 2018
Location: Atlanta, GA - USA
Distribution: CentOS/RHEL, openSuSE/SLES, Ubuntu
Posts: 1,005

Rep: Reputation: 370Reputation: 370Reputation: 370Reputation: 370
The IO errors are all pointing to sda:

Code:
> sd 1:0:0:0: [sda] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=3s
> sd 1:0:0:0: [sda] tag#10 Sense Key : Medium Error [current]
> sd 1:0:0:0: [sda] tag#10 Add. Sense: Unrecovered read error - auto reallocate failed
> sd 1:0:0:0: [sda] tag#10 CDB: Read(10) 28 00 12 7e 4b 50 00 00 08 00
> blk_update_request: I/O error, dev sda, sector 310266704 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 0
> ata2: EH complete
> md/raid1:md3: sda: unrecoverable I/O read error for block 308167424
I would say if you have an option then I would go with new partition and new array; this way it will be cleaner operation.
 
Old 10-20-2022, 11:46 PM   #3
anctop
Member
 
Registered: Feb 2010
Posts: 101

Original Poster
Rep: Reputation: Disabled
Thanks for your suggestion.

I've overlooked an important thing about the old disk.
Indeed error messages from SMART started to develop a couple of weeks ago, saying that:

Code:
Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
As the disks have been used for 4 years, we are not going to spend efforts fix it by locating corrupted sectors etc.

It sounds much easier to copy files from "/dev/md3" to "/dev/sdb3" and then create a new array from "/dev/sdb3" afresh.
 
Old 10-25-2022, 01:14 AM   #4
anctop
Member
 
Registered: Feb 2010
Posts: 101

Original Poster
Rep: Reputation: Disabled
We have finally created a new RAID-1 array, which has been running continuously in the past few days without an error.

On the other hand, we have two questions related to configurations:
  1. As suggested by some documentations, the "readahead" of each disk may be increased from the default of 256 to 1024.
    Should the same value be set for the RAID device as well (because the array does not inherit the value from the components) ?
  2. We used "mkfs.ext4" to create the source filesystem.
    The "dumpe2fs" output shows that the new filesystem has "Maximum mount count: -1" and "Check interval: 0 (<none>)".
    The values on the old partitions were "30" and "15552000 (6 months)" respectively.
    Should the old values be applied to the new filesystem to force regular checking ?

Please advise.
 
Old 10-25-2022, 04:08 AM   #5
lvm_
Senior Member
 
Registered: Jul 2020
Posts: 1,547

Rep: Reputation: 531Reputation: 531Reputation: 531Reputation: 531Reputation: 531Reputation: 531
When I reboot a server expecting 1 minute downtime and it starts checking filesystems because some clown set up scheduled filesystem checks is close to the top of my list of misconfiguration annoyances. Never known filesystem to go bad on its own either. If it is dirty - fsck will run regardless, otherwise leave it alone. As for readahead, I believe setting it for individual RAID member disks is redundant, I set it only for the RAID device itself. And by the way, stripe_cache_size has a much stronger effect on md raid performance than setra.

Last edited by lvm_; 10-25-2022 at 04:29 AM.
 
Old 10-25-2022, 04:21 AM   #6
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,389

Rep: Reputation: 4191Reputation: 4191Reputation: 4191Reputation: 4191Reputation: 4191Reputation: 4191Reputation: 4191Reputation: 4191Reputation: 4191Reputation: 4191Reputation: 4191
Best to read the discussion in the mke2fs and tune2fs manpages re filesystem errors and no dirty flag. I continue to be confounded that distro maintainers don't give this more credence. FWIW I always turn on periodic check in the conf prior to any mkfs.

Re readahead I leave it to the kernel dynamic adjustment based on I/O patterns. But then I have (very) modest loads - and no production entanglements.
 
1 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
rebuild Raid 1 after disk failure giobaxx Linux - General 2 04-26-2020 07:18 PM
[SOLVED] How to rebuild the corrupted rebuild RPM database pantdk Linux - Server 3 02-19-2015 01:01 AM
Rebuild from RAID-6 2-disk simultaneous failure politenessTKY Linux - Newbie 3 10-17-2012 07:33 PM
replace failure disk and rebuild RAID with mdadm ufmale Linux - Software 0 11-15-2007 02:24 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware

All times are GMT -5. The time now is 07:17 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration