LinuxQuestions.org - [SOLVED] Raid Repair Now wont boot

- Linux - Server (https://www.linuxquestions.org/questions/linux-server-73/)

- - Raid Repair Now wont boot - Other mounting problems (https://www.linuxquestions.org/questions/linux-server-73/raid-repair-now-wont-boot-other-mounting-problems-946039/)

Raid Repair Now wont boot - Other mounting problems

OK So it has been a long 3 days of just frustration and waiting.

Here is my situation.

1) Server went down and got booted into a rescue mode where I was able to use putty to find out that my md3 (soft raid) was degraded which is funny cause that is just the /var not the / which in theory should still let the server boot i just wouldn't have my www files.

2) I managed to repair the raid using mdadm and status is all good

3) I reboot and the server never comes back online. In normal Mode (boot from HD) So I put it back into the rescue mode

4) I used to be able to go into the rescue mode and type

Code:

mount /dev/md3 /mnt/

This would mount the raid1 and allow me to view the files etc

5) Now couple reboots later still not booting I can't run the

Code:

mount /dev/md3 /mnt/

it comes up with this error.

Code:

root@rescue:/var/log# mount /dev/md3 /mnt

/dev/md3 looks like swapspace - not mounted

mount: you must specify the filesystem type

6) I am upset, and just frustrated as to why after repairing a raid the server no longer boots. I haven't changed a setting or config in months its just a file server.

Any help would be great and Id even pay for some help. I have AIM/MSN/SKYPE/GTALK if anyone knows this stuff well and can lend a quick hand..

Thanks

What is the layout of your disks? That is, how are the physical drives partitioned and how are the various md devices configured?

In particular, which drives/partitions are part of md3?

What exactly did you do in step #2?

Have you checked the S.M.A.R.T. status of your drives with smartctl?

Quote:

Originally Posted by Ser Olmy (Post 4683408)

I have 2 disks. (this is an OVH server but)

I have them laid out as such

sda
---- sda1 = Raid 1 40GB /
---- sda2 = swap
---- sda3 = Raid 1 ~ 1.7TB /var

sdb
---- sdb1 = Raid 1 40GB /
---- sdb2 = swap
---- sdb3 = Raid 1 ~ 1.7TB /var

Therefore the Raids are

md3 = /var
md1 = /

md1 and md3 with the corresponding sda1 sdb1 , sda3,sdb3.

Here is my fdisk

Code:

root@rescue:~# fdisk -l



Disk /dev/sda: 2000.4 GB, 2000398934016 bytes

255 heads, 63 sectors/track, 243201 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes

Sector size (logical/physical): 512 bytes / 512 bytes

I/O size (minimum/optimal): 512 bytes / 512 bytes

Disk identifier: 0x000d0305



  Device Boot      Start        End      Blocks  Id  System

/dev/sda1  *          1        5100    40958976+  fd  Linux raid autodetect

/dev/sda2            5100        8924    30718976  82  Linux swap / Solaris

/dev/sda3            8924      243201  1881830400  fd  Linux raid autodetect



Disk /dev/sdb: 2000.4 GB, 2000398934016 bytes

255 heads, 63 sectors/track, 243201 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes

Sector size (logical/physical): 512 bytes / 512 bytes

I/O size (minimum/optimal): 512 bytes / 512 bytes

Disk identifier: 0x000e5562



  Device Boot      Start        End      Blocks  Id  System

/dev/sdb1              1        5100    40958976+  fd  Linux raid autodetect

/dev/sdb2            5100        8924    30718976  82  Linux swap / Solaris

/dev/sdb3            8924      243201  1881830400  fd  Linux raid autodetect



Disk /dev/md3: 1927.0 GB, 1926994264064 bytes

2 heads, 4 sectors/track, 470457584 cylinders

Units = cylinders of 8 * 512 = 4096 bytes

Sector size (logical/physical): 512 bytes / 512 bytes

I/O size (minimum/optimal): 512 bytes / 512 bytes

Disk identifier: 0x00000000



Disk /dev/md3 doesn't contain a valid partition table



Disk /dev/md1: 41.9 GB, 41941925888 bytes

2 heads, 4 sectors/track, 10239728 cylinders

Units = cylinders of 8 * 512 = 4096 bytes

Sector size (logical/physical): 512 bytes / 512 bytes

I/O size (minimum/optimal): 512 bytes / 512 bytes

Disk identifier: 0x00000000



Disk /dev/md1 doesn't contain a valid partition table

When i try to manually do fsck i get

Code:

root@rescue:~# fsck -fc /dev/sda1

fsck from util-linux-ng 2.17.2

fsck: fsck.linux_raid_member: not found

fsck: Error 2 while executing fsck.linux_raid_member for /dev/sda1

or if i try on the md1 or md3 i get

Code:

root@rescue:~# fsck -fc /dev/md1

fsck from util-linux-ng 2.17.2

fsck: fsck.swap: not found

fsck: Error 2 while executing fsck.swap for /dev/md1

You still haven't told us what you did in step #2.

What does mdadm --misc --detail /dev/md3 say?

Quote:

Originally Posted by Ser Olmy (Post 4683426)

You still haven't told us what you did in step #2.

What does mdadm --misc --detail /dev/md3 say?

Since sdb3 was not part of the raid i added it back then ran the repair. Following / Modifying this as needed as a new drive was not put it just that partition had a glitch.

so i ended up doing this

Code:

mdadm /dev/md3 --manage --add /dev/sdb3

mdadm --misc --detail /dev/md3 Result... BOTH md1 (boot) and md3 are clean

Code:

root@rescue:~# mdadm --misc --detail /dev/md3

/dev/md3:

        Version : 0.90

  Creation Time : Fri Jan 27 17:55:17 2012

    Raid Level : raid1

    Array Size : 1881830336 (1794.65 GiB 1926.99 GB)

  Used Dev Size : 1881830336 (1794.65 GiB 1926.99 GB)

  Raid Devices : 2

  Total Devices : 2

Preferred Minor : 3

    Persistence : Superblock is persistent



    Update Time : Sun May 20 20:48:03 2012

          State : clean

 Active Devices : 2

Working Devices : 2

 Failed Devices : 0

  Spare Devices : 0



          UUID : 4f96ca65:0859f8bf:a4d2adc2:26fd5302 (local to host rescue.ovh.net)

        Events : 0.1089706



    Number  Major  Minor  RaidDevice State

      0      8        3        0      active sync  /dev/sda3

      1      8      19        1      active sync  /dev/sdb3

Quote:

Originally Posted by bigstack (Post 4683428)

Since sdb3 was not part of the raid i added it back then ran the repair. Following / Modifying this as needed as a new drive was not put it just that partition had a glitch.

so i ended up doing this

Code:

mdadm /dev/md3 --manage --add /dev/sdb3

So you're saying that your RAID device wasn't working at all (in the sense that it didn't contain a valid file system), and further inspection revealed that only /dev/sda3 was part of the array?

That means the RAID 1 array was degraded, not broken. You should still have been able to mount /dev/md3. The fact that you couldn't, indicates that the data on /dev/sda3 is corrupt.

You then added /dev/sdb3 to the degraded array with mdadm /dev/md3 --manage --add /dev/sdb3. Wouldn't that initiate a synchronization, causing the entire /dev/sdb3 to be overwritten with the (known corrupt) data from /dev/sda3?

Could you post the output from smartctl -a /dev/sda?

Quote:

Originally Posted by Ser Olmy (Post 4683435)

here is that output... as for the other statement.... sdb3 was considered degraded... and it was not showing up as part of the raid so i simply re-added sdb3 back to the raid the "A" disk did not have the problem. ... I know this cause after the thing repair itself i could mount it and see files.

OUTPUT: this is for A

Code:

root@rescue:~# smartctl -a /dev/sda

smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)

Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net



=== START OF INFORMATION SECTION ===

Device Model:    Hitachi HDS723020BLA642

Serial Number:    MN1220F32U0U5D

Firmware Version: MN6OA5C0

User Capacity:    2,000,398,934,016 bytes

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:  8

ATA Standard is:  ATA-8-ACS revision 4

Local Time is:    Sun May 20 21:08:40 2012 UTC

SMART support is: Available - device has SMART capability.

SMART support is: Enabled



=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED



General SMART Values:

Offline data collection status:  (0x80) Offline data collection activity

                                        was never started.

                                        Auto Offline Data Collection: Enabled.

Self-test execution status:      (  0) The previous self-test routine completed

                                        without error or no self-test has ever

                                        been run.

Total time to complete Offline

data collection:                (18950) seconds.

Offline data collection

capabilities:                    (0x5b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        No Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine

recommended polling time:        (  1) minutes.

Extended self-test routine

recommended polling time:        ( 255) minutes.

SCT capabilities:              (0x003d) SCT Status supported.

                                        SCT Error Recovery Control supported.

                                        SCT Feature Control supported.

                                        SCT Data Table supported.



SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x000b  100  100  016    Pre-fail  Always      -      0

  2 Throughput_Performance  0x0005  133  133  054    Pre-fail  Offline      -      93

  3 Spin_Up_Time            0x0007  147  147  024    Pre-fail  Always      -      390 (Average 390)

  4 Start_Stop_Count        0x0012  100  100  000    Old_age  Always      -      44

  5 Reallocated_Sector_Ct  0x0033  100  100  005    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x000b  100  100  067    Pre-fail  Always      -      0

  8 Seek_Time_Performance  0x0005  135  135  020    Pre-fail  Offline      -      26

  9 Power_On_Hours          0x0012  100  100  000    Old_age  Always      -      2737

 10 Spin_Retry_Count        0x0013  100  100  060    Pre-fail  Always      -      0

 12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      44

192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      44

193 Load_Cycle_Count        0x0012  100  100  000    Old_age  Always      -      44

194 Temperature_Celsius    0x0002  162  162  000    Old_age  Always      -      37 (Lifetime Min/Max 20/46)

196 Reallocated_Event_Count 0x0032  100  100  000    Old_age  Always      -      0

197 Current_Pending_Sector  0x0022  100  100  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0008  100  100  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x000a  200  200  000    Old_age  Always      -      0



SMART Error Log Version: 1

No Errors Logged



SMART Self-test log structure revision number 1

No self-tests have been logged.  [To run self-tests, use: smartctl -t]





SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

Output for B:

Code:

root@rescue:~# smartctl -a /dev/sdb

smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)

Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net



=== START OF INFORMATION SECTION ===

Device Model:    WDC WD2002FAEX-007BA0

Serial Number:    WD-WMAY02495554

Firmware Version: 05.01D05

User Capacity:    2,000,398,934,016 bytes

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:  8

ATA Standard is:  Exact ATA specification draft version not indicated

Local Time is:    Sun May 20 21:13:12 2012 UTC

SMART support is: Available - device has SMART capability.

SMART support is: Enabled



=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED



General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

                                        was completed without error.

                                        Auto Offline Data Collection: Enabled.

Self-test execution status:      (  0) The previous self-test routine completed

                                        without error or no self-test has ever

                                        been run.

Total time to complete Offline

data collection:                (29580) seconds.

Offline data collection

capabilities:                    (0x7b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine

recommended polling time:        (  2) minutes.

Extended self-test routine

recommended polling time:        ( 255) minutes.

Conveyance self-test routine

recommended polling time:        (  5) minutes.

SCT capabilities:              (0x3037) SCT Status supported.

                                        SCT Feature Control supported.

                                        SCT Data Table supported.



SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x002f  157  129  051    Pre-fail  Always      -      171761

  3 Spin_Up_Time            0x0027  253  253  021    Pre-fail  Always      -      8166

  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      112

  5 Reallocated_Sector_Ct  0x0033  171  171  140    Pre-fail  Always      -      230

  7 Seek_Error_Rate        0x002e  100  253  000    Old_age  Always      -      0

  9 Power_On_Hours          0x0032  095  095  000    Old_age  Always      -      4275

 10 Spin_Retry_Count        0x0032  100  100  000    Old_age  Always      -      0

 11 Calibration_Retry_Count 0x0032  100  100  000    Old_age  Always      -      0

 12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      110

192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      109

193 Load_Cycle_Count        0x0032  200  200  000    Old_age  Always      -      2

194 Temperature_Celsius    0x0022  112  107  000    Old_age  Always      -      40

196 Reallocated_Event_Count 0x0032  021  021  000    Old_age  Always      -      179

197 Current_Pending_Sector  0x0032  200  198  000    Old_age  Always      -      13

198 Offline_Uncorrectable  0x0030  200  200  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0

200 Multi_Zone_Error_Rate  0x0008  187  187  000    Old_age  Offline      -      2787



SMART Error Log Version: 1

Warning: ATA error count 5774 inconsistent with error log pointer 3



ATA Error Count: 5774 (device log contains only the most recent five errors)

        CR = Command Register [HEX]

        FR = Features Register [HEX]

        SC = Sector Count Register [HEX]

        SN = Sector Number Register [HEX]

        CL = Cylinder Low Register [HEX]

        CH = Cylinder High Register [HEX]

        DH = Device/Head Register [HEX]

        DC = Device Command Register [HEX]

        ER = Error register [HEX]

        ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.



Error 5774 occurred at disk power-on lifetime: 3825 hours (159 days + 9 hours)

  When the command that caused the error occurred, the device was active or idle.



  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 58 eb 8a 77 e2  Error: UNC 88 sectors at LBA = 0x02778aeb = 41388779



  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  c8 00 58 b8 8a 77 e2 08  21d+11:23:24.946  READ DMA

  c8 00 08 b0 8a 77 e2 08  21d+11:23:24.929  READ DMA

  c8 00 38 78 8a 77 e2 08  21d+11:23:24.587  READ DMA

  c8 00 20 50 8a 77 e2 08  21d+11:23:24.121  READ DMA

  c8 00 08 10 8a 77 e2 08  21d+11:23:24.121  READ DMA



Error 5773 occurred at disk power-on lifetime: 3825 hours (159 days + 9 hours)

  When the command that caused the error occurred, the device was active or idle.



  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 08 11 90 c5 e2  Error: UNC 8 sectors at LBA = 0x02c59011 = 46501905



  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  c8 00 08 10 90 c5 e2 08  21d+11:07:40.587  READ DMA

  c8 00 08 10 b5 a2 e2 08  21d+11:07:40.575  READ DMA

  c8 00 08 00 90 c5 e2 08  21d+11:07:40.514  READ DMA

  c8 00 08 e8 a0 73 e2 08  21d+11:07:40.512  READ DMA

  c8 00 08 e0 a0 73 e2 08  21d+11:07:40.499  READ DMA



SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline      Completed without error      00%      1534        -

# 2  Short offline      Completed without error      00%      1523        -

# 3  Short offline      Completed without error      00%      1523        -

# 4  Short offline      Completed without error      00%        24        -

# 5  Short offline      Completed without error      00%        13        -

# 6  Short offline      Completed without error      00%        13        -



SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

If I read this correctly the B drive appears to be failing .. could this cause it not to boot? Even though the A drive is just fine? Even if that is so i should be able to Mount just A and look at it but i can't do that.

Maybe the mdadm terminology is a bit off, because "degraded" is one possible state of a RAID array, not the state of any single member of an array.

After you added /dev/sdb3 to /dev/md3, what (if any) other mdadm commands did you run?

Anyway, /dev/sda seems good from the S.M.A.R.T. data. Specifically, "Reallocated_Sector_Count" and "Current_Pending_Sector" are both 0. Could you post the same data for /dev/sdb?

Quote:

Originally Posted by Ser Olmy (Post 4683439)

I just updated my post with the B data below the A data.. That drive appears to be bad..

as for what else i ran i just used this

Code:

cat /proc/mdstat

where it showed something like this sample when that finished it showed 2/2 UU and it showed green again in the web GUI for rescue mode.

If you look here http://help.ovh.co.uk/RaidSoft i might have managed to mess up the very bottom of that cause the swap commands did nothing and error ed. Might have not put the right letters idk im stuck. I know my data is good on drive a just not accessible for some reason :(

THIS IS JUST A SAMPLE

Code:

Personalities : linear raid0 raid1 raid5

read_ahead 1024 sectors

md1 : active raid1 sdb11 sda10

3068288 blocks 2/2 UU



md2 : active raid1 sdb22 sda20

240597376 blocks 2/1 U_

>.................... recovery = 0.2% (655104/240597376) finish=73.2min speed=54592K/sec

unused devices: <none>

That is one seriously broken drive. You should unplug /dev/sdb immediately, or at the very least use mdadm /dev/md3 --manage --fail /dev/sdb3 (and repeat the command for md1 and /dev/sdb1).

Quote:

Originally Posted by Ser Olmy (Post 4683447)

That is one seriously broken drive. You should unplug /dev/sdb immediately, or at the very least use mdadm /dev/md3 --manage --fail /dev/sdb3 (and repeat the command for md1 and /dev/sdb1).

I am not able to access the server psychically as it is in another country lol will these commands do the same thing as unplugging the drive and letting it as a single drive server until the Datacenter is able to put a new drive in?

Code:

mdadm /dev/md3 --manage --fail /dev/sdb3

and

mdadm /dev/md1 --manage --fail /dev/sdb1

Alright i did both of those commands and got this output

Code:

root@rescue:~# mdadm /dev/md1 --manage --fail /dev/sdb1

mdadm: set /dev/sdb1 faulty in /dev/md1

root@rescue:~# mdadm /dev/md3 --manage --fail /dev/sdb3

mdadm: set /dev/sdb3 faulty in /dev/md3

Im guessing that is good?

Should I tell the server to Boot from the Hard Drive Now??? Or do i need to change other things to get the server to boot.. shouldnt the raid just say hey there is a good drive here we can use this...?

Yes, that should do the trick.

Ok here goes nothing on the reboot

Well system never comes online after reboot :( Something must be wrong... should i be able to mount anything?? I just like to rsync the data to somewhere and start over