Unable to get system to boot with read error/bad sector on RAID disk

81bones · 03-21-2020, 04:29 PM

I have a server running Centos 7.7.1908 and I just rebooted it. It rebooted to the console and asked me for the root password so it could go into emergency mode. It is not able to complete booting because it is attempting to mount my software RAID volume (a RAID5 array created using mdadm) and there appear to be errors on one of the disks.

The array consists of four disks (sda, sdb, sdc, and sdd). One disk (sdc) appears to have developed some bad sectors. Smartctl bears this out:

Code:

root@server# smartctl -a /dev/sdc
...
SMART Self-test log structure revision number 1
Num Test_Description   Status                  Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline   Completed: read failure       90%    30731        2058
# 2 Short offline      Completed: read failure       70%    30731        2058

I tried to run fsck from the emergency console but am getting this error and output:

Code:

root@server# fsck /dev/sdc
fsck from util-linux 2.23.2
e2fsck 1.42.9 (28-Dec-2013)
ext2fs_open2: Bad magic number super-block
fsck.ext2: Superblock invalid, trying backup blocks...
fsck.ext2: Bad magic number in super-block while trying to open /dev/sdc

The superblock could not be read or does not describe a correct ext2
filesystem. If the devices is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
   e2fsck -b 8193 <device>

[ 2226.194946] blk_update_request: critical medium error, dev sdc, sector 2058
[ 2226.339391] blk_update_request: critical medium error, dev sdc, sector 2058
[ 2226.339438] Buffer I/O error on dev sdc1, logical block 1, async page read
[ 2226.694976] blk_update_request: critical medium error, dev sdc, sector 2058
[ 2226.839412] blk_update_request: critical medium error, dev sdc, sector 2058
[ 2226.839461] Buffer I/O error on dev sdc1, logical block 1, async page read

I assume at least part of the issue here is that the filesystem on sdc is not, in fact, an ext2 system since it's part of a RAID. For reference, here's the output from fdisk and gdisk:

Code:

root@server# fdisk -l /dev/sdc
WARNING: fdisk GPT supprt is currently new, and therefore in an experimental phase. Use at your own discretion.

Disk /dev/sdc: 2000.4 GB, 2000398934016 bytes, 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: gpt
Disk identifier: E8379A35-3379-49B6-B4F0-7C109D2BB307

#       Start           End   Size  Type          Name
 1       2048    3907028991   1.8T  Linux RAID    primary


root@server# gdisk -l /dev/sdc
GPT fdisk (gdisk) version 0.8.10

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sdc: 3907029168 sectors, 1.8 TiB
Disk identifier (GUID): E8379A35-3379-49B6-B4F0-7C109D2BB307
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 3907029134
Partitions will be aligned on 2048-sector boundaries
Total free space is 2157 sectors (1.1 MiB)

Number  Start (sector)    End (sector)  Size     Code  Name
   1            2048      3907028991   1.8 TiB   FD00  primary

[ 2265.233103] blk_update_request: critical medium error, dev sdc, sector 2058
[ 2265.233152] Buffer I/O error on dev sdc1, logical block 1, async page read

So I'm currently stuck in emergency mode because my system wants to add this drive to my RAID and mount the /dev/md127 device, and it can't add the drive because there are errors. Here are my questions:

Is there a way I can force the system to boot without attempting to mount the array? This might generate lots of other problems, as there are several applications that will look for files and paths on the array that won't exist, but at least I would be able to get back to a normal console
Is there a way to run fsck (or something similar or more useful) on /dev/sdc so I can try to fix it or mark the bad sector so the disk will add to the array successfully?
Since this disk is part of a RAID array, perhaps there's a way that I can just mark the disk as bad/failed, and thus md will ignore it and the system will boot normally? I understand I'll need to replace the disk ASAP since the array will have lost redundancy, but at least I'll be running again and can access the data on the array.

I am happy to provide any other command output that might be useful. Please help!!

81bones · 03-21-2020, 05:37 PM

Some additional info. I thought trying to force the array to repair itself (as noted at https://unix.stackexchange.com/quest...nd/25934#25934) might help fix the problem. Here's the output of mdstat:

Code:

root@server# cat /proc/mdstat
Personalities :
md127 : inactive sda1[0](S) sdb1[1](S) sdd1[4](S)
      5860529953 blocks super 1.2

unused devices: <none>

Attempting to do "echo 'repair' > /sys/block/md127/md/sync_action" does not work (I get a permission denied error). This is presumably because the array has not been properly assembled? Should I attempt to assemble and/or re-add the drive using mdadm?

81bones · 03-21-2020, 06:16 PM

More updates. I attempted remove/re-add the bad drive from the array by attempting the following:

Code:

root@server# mdadm --stop /dev/md127
root@server# mdadm --assemble --scan

Stopping the array was successful and it removed the /dev/md127 device (as was expected). Doing the automated assemble quickly dumped some of the same journal errors for sdc to the screen, but then seemed to assemble the array with only the three remaining drives. The screen then dumped back to the graphical spinning wheel boot screen. Hitting esc got me back to the console but no new information was present and it just sat there, seemingly locked up. After awhile I eventually decided to cross my fingers and power the system off. After waiting a bit and powering it back on, the system actually booted all the way to the graphical login. Huzzah!

The array is now in a clean but degraded state since sdc is missing:

Code:

root@server# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 sda1[0] sdb1[1] sdd1[4]
      5860528128 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UU_U]

unused devices: <none>

A new drive has been ordered, and I am currently running "badblocks -vn /dev/sdc > bad-blocks-sdc.txt" on the bad drive to see if I can try to get fsck to repair it. There's no real need to do this at this point since the array is still ok, but I'd like to see if it will work. So I guess the crisis has been averted...for now...