I have a server running Centos 7.7.1908 and I just rebooted it. It rebooted to the console and asked me for the root password so it could go into emergency mode. It is not able to complete booting because it is attempting to mount my software RAID volume (a RAID5 array created using mdadm) and there appear to be errors on one of the disks.
The array consists of four disks (sda, sdb, sdc, and sdd). One disk (sdc) appears to have developed some bad sectors. Smartctl bears this out:
Code:
root@server# smartctl -a /dev/sdc
...
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 30731 2058
# 2 Short offline Completed: read failure 70% 30731 2058
I tried to run fsck from the emergency console but am getting this error and output:
Code:
root@server# fsck /dev/sdc
fsck from util-linux 2.23.2
e2fsck 1.42.9 (28-Dec-2013)
ext2fs_open2: Bad magic number super-block
fsck.ext2: Superblock invalid, trying backup blocks...
fsck.ext2: Bad magic number in super-block while trying to open /dev/sdc
The superblock could not be read or does not describe a correct ext2
filesystem. If the devices is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
e2fsck -b 8193 <device>
[ 2226.194946] blk_update_request: critical medium error, dev sdc, sector 2058
[ 2226.339391] blk_update_request: critical medium error, dev sdc, sector 2058
[ 2226.339438] Buffer I/O error on dev sdc1, logical block 1, async page read
[ 2226.694976] blk_update_request: critical medium error, dev sdc, sector 2058
[ 2226.839412] blk_update_request: critical medium error, dev sdc, sector 2058
[ 2226.839461] Buffer I/O error on dev sdc1, logical block 1, async page read
I assume at least part of the issue here is that the filesystem on sdc is not, in fact, an ext2 system since it's part of a RAID. For reference, here's the output from fdisk and gdisk:
Code:
root@server# fdisk -l /dev/sdc
WARNING: fdisk GPT supprt is currently new, and therefore in an experimental phase. Use at your own discretion.
Disk /dev/sdc: 2000.4 GB, 2000398934016 bytes, 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: gpt
Disk identifier: E8379A35-3379-49B6-B4F0-7C109D2BB307
# Start End Size Type Name
1 2048 3907028991 1.8T Linux RAID primary
root@server# gdisk -l /dev/sdc
GPT fdisk (gdisk) version 0.8.10
Partition table scan:
MBR: protective
BSD: not present
APM: not present
GPT: present
Found valid GPT with protective MBR; using GPT.
Disk /dev/sdc: 3907029168 sectors, 1.8 TiB
Disk identifier (GUID): E8379A35-3379-49B6-B4F0-7C109D2BB307
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 3907029134
Partitions will be aligned on 2048-sector boundaries
Total free space is 2157 sectors (1.1 MiB)
Number Start (sector) End (sector) Size Code Name
1 2048 3907028991 1.8 TiB FD00 primary
[ 2265.233103] blk_update_request: critical medium error, dev sdc, sector 2058
[ 2265.233152] Buffer I/O error on dev sdc1, logical block 1, async page read
So I'm currently stuck in emergency mode because my system wants to add this drive to my RAID and mount the /dev/md127 device, and it can't add the drive because there are errors. Here are my questions:
- Is there a way I can force the system to boot without attempting to mount the array? This might generate lots of other problems, as there are several applications that will look for files and paths on the array that won't exist, but at least I would be able to get back to a normal console
- Is there a way to run fsck (or something similar or more useful) on /dev/sdc so I can try to fix it or mark the bad sector so the disk will add to the array successfully?
- Since this disk is part of a RAID array, perhaps there's a way that I can just mark the disk as bad/failed, and thus md will ignore it and the system will boot normally? I understand I'll need to replace the disk ASAP since the array will have lost redundancy, but at least I'll be running again and can access the data on the array.
I am happy to provide any other command output that might be useful. Please help!!