LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices


Reply
  Search this Thread
Old 03-21-2020, 04:29 PM   #1
81bones
Member
 
Registered: Oct 2006
Location: Chicago, IL
Distribution: Almalinux
Posts: 66

Rep: Reputation: 15
Unable to get system to boot with read error/bad sector on RAID disk


I have a server running Centos 7.7.1908 and I just rebooted it. It rebooted to the console and asked me for the root password so it could go into emergency mode. It is not able to complete booting because it is attempting to mount my software RAID volume (a RAID5 array created using mdadm) and there appear to be errors on one of the disks.

The array consists of four disks (sda, sdb, sdc, and sdd). One disk (sdc) appears to have developed some bad sectors. Smartctl bears this out:

Code:
root@server# smartctl -a /dev/sdc
...
SMART Self-test log structure revision number 1
Num Test_Description   Status                  Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline   Completed: read failure       90%    30731        2058
# 2 Short offline      Completed: read failure       70%    30731        2058
I tried to run fsck from the emergency console but am getting this error and output:

Code:
root@server# fsck /dev/sdc
fsck from util-linux 2.23.2
e2fsck 1.42.9 (28-Dec-2013)
ext2fs_open2: Bad magic number super-block
fsck.ext2: Superblock invalid, trying backup blocks...
fsck.ext2: Bad magic number in super-block while trying to open /dev/sdc

The superblock could not be read or does not describe a correct ext2
filesystem. If the devices is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
   e2fsck -b 8193 <device>

[ 2226.194946] blk_update_request: critical medium error, dev sdc, sector 2058
[ 2226.339391] blk_update_request: critical medium error, dev sdc, sector 2058
[ 2226.339438] Buffer I/O error on dev sdc1, logical block 1, async page read
[ 2226.694976] blk_update_request: critical medium error, dev sdc, sector 2058
[ 2226.839412] blk_update_request: critical medium error, dev sdc, sector 2058
[ 2226.839461] Buffer I/O error on dev sdc1, logical block 1, async page read
I assume at least part of the issue here is that the filesystem on sdc is not, in fact, an ext2 system since it's part of a RAID. For reference, here's the output from fdisk and gdisk:

Code:
root@server# fdisk -l /dev/sdc
WARNING: fdisk GPT supprt is currently new, and therefore in an experimental phase. Use at your own discretion.

Disk /dev/sdc: 2000.4 GB, 2000398934016 bytes, 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: gpt
Disk identifier: E8379A35-3379-49B6-B4F0-7C109D2BB307

#       Start           End   Size  Type          Name
 1       2048    3907028991   1.8T  Linux RAID    primary


root@server# gdisk -l /dev/sdc
GPT fdisk (gdisk) version 0.8.10

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sdc: 3907029168 sectors, 1.8 TiB
Disk identifier (GUID): E8379A35-3379-49B6-B4F0-7C109D2BB307
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 3907029134
Partitions will be aligned on 2048-sector boundaries
Total free space is 2157 sectors (1.1 MiB)

Number  Start (sector)    End (sector)  Size     Code  Name
   1            2048      3907028991   1.8 TiB   FD00  primary

[ 2265.233103] blk_update_request: critical medium error, dev sdc, sector 2058
[ 2265.233152] Buffer I/O error on dev sdc1, logical block 1, async page read
So I'm currently stuck in emergency mode because my system wants to add this drive to my RAID and mount the /dev/md127 device, and it can't add the drive because there are errors. Here are my questions:
  1. Is there a way I can force the system to boot without attempting to mount the array? This might generate lots of other problems, as there are several applications that will look for files and paths on the array that won't exist, but at least I would be able to get back to a normal console
  2. Is there a way to run fsck (or something similar or more useful) on /dev/sdc so I can try to fix it or mark the bad sector so the disk will add to the array successfully?
  3. Since this disk is part of a RAID array, perhaps there's a way that I can just mark the disk as bad/failed, and thus md will ignore it and the system will boot normally? I understand I'll need to replace the disk ASAP since the array will have lost redundancy, but at least I'll be running again and can access the data on the array.

I am happy to provide any other command output that might be useful. Please help!!
 
Old 03-21-2020, 05:37 PM   #2
81bones
Member
 
Registered: Oct 2006
Location: Chicago, IL
Distribution: Almalinux
Posts: 66

Original Poster
Rep: Reputation: 15
Some additional info. I thought trying to force the array to repair itself (as noted at https://unix.stackexchange.com/quest...nd/25934#25934) might help fix the problem. Here's the output of mdstat:
Code:
root@server# cat /proc/mdstat
Personalities :
md127 : inactive sda1[0](S) sdb1[1](S) sdd1[4](S)
      5860529953 blocks super 1.2

unused devices: <none>
Attempting to do "echo 'repair' > /sys/block/md127/md/sync_action" does not work (I get a permission denied error). This is presumably because the array has not been properly assembled? Should I attempt to assemble and/or re-add the drive using mdadm?
 
Old 03-21-2020, 06:16 PM   #3
81bones
Member
 
Registered: Oct 2006
Location: Chicago, IL
Distribution: Almalinux
Posts: 66

Original Poster
Rep: Reputation: 15
More updates. I attempted remove/re-add the bad drive from the array by attempting the following:
Code:
root@server# mdadm --stop /dev/md127
root@server# mdadm --assemble --scan
Stopping the array was successful and it removed the /dev/md127 device (as was expected). Doing the automated assemble quickly dumped some of the same journal errors for sdc to the screen, but then seemed to assemble the array with only the three remaining drives. The screen then dumped back to the graphical spinning wheel boot screen. Hitting esc got me back to the console but no new information was present and it just sat there, seemingly locked up. After awhile I eventually decided to cross my fingers and power the system off. After waiting a bit and powering it back on, the system actually booted all the way to the graphical login. Huzzah!

The array is now in a clean but degraded state since sdc is missing:
Code:
root@server# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 sda1[0] sdb1[1] sdd1[4]
      5860528128 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UU_U]

unused devices: <none>
A new drive has been ordered, and I am currently running "badblocks -vn /dev/sdc > bad-blocks-sdc.txt" on the bad drive to see if I can try to get fsck to repair it. There's no real need to do this at this point since the array is still ok, but I'd like to see if it will work. So I guess the crisis has been averted...for now...
 
  


Reply

Tags
fsck, raid



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Is MBR in the boot block/sector or is MBR the boot block/sector? ultra_reader Linux - Newbie 10 02-14-2019 03:45 PM
Kernel/driver parameter to limit (S)ATA disk bad sector read retries? Nyyr Linux - Kernel 2 11-27-2012 10:08 AM
[SOLVED] How can I make 512byte/sector format on 4KiB/sector drive? delorean-bf Linux - General 14 11-20-2012 09:33 AM
Bad disk, bad disk controller, or bad memory? NULL Pointer Linux - General 2 03-01-2009 05:21 PM
I am getting bad sector error in RHEL 4 U6 in IBM 3650 server with RAID 8.3. shajahan Linux - Hardware 3 12-02-2008 10:09 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware

All times are GMT -5. The time now is 12:31 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration